# The Mathematics and Geometry of Gradient Descent

Gradient Descent is one of the most widely used Optimization techniques that has profound use in Machine learning. Although one of the more simpler methods because of its ease of use and less memory requirements it has been widely adopted in the Data Science and Machine Learning Community.

If we have a cost function where the update rule as per Gradient Descent is given by where is the learning rate.

We might have come across in several articles that the gradient of the cost function gives the direction of the maximum increase in the function where as the negative of the gradient gives the direction of maximum decrease of the cost function. What is often not mentioned is that the negative of the gradient of the function gives the maximum decrease of the function assuming that the function is linear in the neighborhood of the point at which gradient is evaluated. As per Taylor’s theorem any function can be approximated in the neighborhood of the point at which the gradient is evaluated linearly as

With gradient descent, of all the possible of a given magnitude we would like to choose a so that is minimum. Mathematically we have a constrained optimization problem

subject to

The constrained optimization problem can be solved by minimizing

where is the Lagrangian Multiplier.

The optimized parameter value is given by

To get the optimal we need to compute the gradient of with respect to and set it to zero. The gradient of is as expressed below

Setting to zero we get

where

So if and are the parameter values at iteration and respectively then as per gradient descent

Now that we have understood the Mathematics behind the gradient descent formulation let's try to interpret the algorithm geometrically. Lets suppose we are trying to optimize a function of two variables through Gradient Descent where

For each value of the function, it can be represented by a contour line in the plane.

For example if we take then for each value the expression would represent an elliptical contour. All the points on the same contour line would correspond to the same value of the cost function.

If we take two points and on the same contour then the difference between the value of the cost function at these two points is zero. Also from the properties of partial derivatives can be expressed as

Now represents the tangential direction to the contour line at the point . Hence we can see that the gradient of the cost function at any point is perpendicular to the tangent to the contour line at that point.

The gradient is denoted by the dotted blue arrow. As we can see the negative of the gradient denoted by blue arrows gives a good direction towards the optimal minimum. By moving in the direction of the negative gradients at each iteration we finally reach the optimal minimum M. At each of the points in the different contour lines the negative of the gradient is perpendicular to the tangent denoted by the red arrow.

Once the algorithm reaches the optimal point, the gradient at the optimal point is technically the zero vector and hence there would be no further update as per the gradient descent update rule. However since there are going to be computational rounding errors,etc. involved gradient descent might not reach the exact zero gradient vector pertaining to the optimal point and hence checks such as magnitude of gradient vector below a threshold can be taken as the criteria to stop the iterative process. Other common method to stop the iterative process is to specify the number of iterations the user thinks is enough to get to the optimal point.

How much to move in the negative direction of gradient is determined by the learning rate. If the learning rate is low it would take more iterations to reach the optimal point. However if the learning rate is too high we will reach near the optimal point fast but at around the optimal point there might be oscillations. Hence the learning rate is to be chosen judiciously. Several intelligent heuristics can be applied for selecting the learning rate which deserves a topic in itself. Just to name one we can start with a high learning rate and in each iteration we can keep on decreasing the learning rate so that we can avoid the oscillations near the optimal point .

Hope this article helps in providing the mathematical and geometrical intuition regarding the gradient descent technique.

## 3 Responses to “The Mathematics and Geometry of Gradient Descent”

## Siddhartha

Very well described useful content, eager to read full book soon..

## jeeban

Nice post !!

## Vikulp

Nice explained Santanu.