The Mathematics and Geometry of Gradient Descent

Gradient Descent is one of the most widely used Optimization techniques that has profound use in Machine learning. Although one of the more simpler methods because of its ease of use and less memory requirements it has been widely adopted in the Data Science and Machine Learning Community.

If we have a cost function  \displaystyle {C(}\theta{)} where  \displaystyle \theta = [\theta_{1} \theta_{2} .... \theta_{n}]^T the update rule as per Gradient Descent is given by  \displaystyle \theta^{(t+1)} = \theta^{(t)} - \eta\nabla{C(\theta^{(t)})} where  \eta is the learning rate.

We might have come across in several articles that the gradient of the cost function gives the direction of the maximum increase in the function where as the negative of the gradient gives the direction of maximum decrease of the cost function. What is often not mentioned is that the negative of the gradient of the function gives the maximum decrease of the function assuming that the function is linear in the neighborhood of the point at which gradient is evaluated. As per Taylor’s theorem any function  can be approximated in the neighborhood of the point at which the gradient is evaluated linearly as

 

 \displaystyle {C(}\theta{+ \Delta{\theta}) = C(\theta) + \Delta\theta^T\nabla{C(\theta)}}

 

With gradient descent, of all the possible  \displaystyle \Delta\theta   of a given magnitude   \displaystyle {k}  we would like to choose a    \displaystyle \Delta\theta so that  \displaystyle {C(}\theta{+ \Delta{\theta}) is minimum. Mathematically we have a constrained optimization problem

 

 \displaystyle \arg\min_{\Delta\theta} {C(}\theta{+ \Delta{\theta})  subject to  \displaystyle \lVert{\Delta\theta}\rVert_{2} = {k}

 

The constrained optimization problem can be solved by minimizing

 

 \displaystyle {F(}\Delta\theta{) = }  {C(}\theta{+ \Delta{\theta}) + \lambda({\Delta\theta}^T{\Delta\theta} - k) where  \displaystyle \lambda   is the Lagrangian Multiplier.

 

The optimized parameter value  \displaystyle \Delta\theta^{*} is given by

 

 \displaystyle  \Delta\theta^{*} =  \arg\min_{\Delta\theta} {F(}\Delta\theta{) = } \arg\min_{\Delta\theta} C(\theta) + \Delta\theta^T\nabla{C(\theta)}} + \lambda({\Delta\theta}^T{\Delta\theta} - k)

 

To get the optimal  \displaystyle \Delta\theta^{*} we need to compute the gradient of   \displaystyle {F(}\Delta\theta{)}  with respect to   \displaystyle \Delta\theta  and set it to zero. The gradient of   \displaystyle {F(}\Delta\theta{)}  is as expressed below

 

 \displaystyle \nabla{F(}\Delta\theta {)} = \nabla{C(}\theta{)} + {2\lambda}\Delta\theta

 

Setting  \displaystyle \nabla{F(}\Delta\theta {)}  to zero we get

 

 \displaystyle \nabla{C(}\theta{)} + {2\lambda}\Delta\theta = 0

 

 \displaystyle {=>} \Delta\theta = -\frac{1}{2\lambda}\nabla{C(}\theta{)}

 

 \displaystyle {=>} \Delta\theta = -\eta\nabla{C(}\theta{)}    where  \displaystyle \eta = \frac{1}{2\lambda}

 

So if  \displaystyle \theta^{(t+1)}   and  \displaystyle \theta^{(t)} are the parameter values at iteration  \displaystyle (t+1)  and  \displaystyle t  respectively then as per gradient descent

 

 \displaystyle \theta^{(t+1)}  = \theta^{(t)} + \Delta\theta

 \displaystyle  {                                  }   = \theta^{(t)} -\eta\nabla{C(}\theta^{(t)}{)}

 

Now that we have understood the Mathematics behind the gradient descent formulation let's try to interpret the algorithm geometrically. Lets suppose we are trying to optimize a function of two variables  \displaystyle C(\theta) = C(\theta_1,\theta_2) through Gradient Descent where   \displaystyle \theta = [\theta_1  \theta_2]^T

For each value of the function, it can be represented by a contour line in the  \displaystyle (\theta_1,\theta_2) plane.

For example if we take  \displaystyle C(\theta_1,\theta_2) = \frac{\theta_1^{2} }{2} + \frac{\theta_2^{2}}{3}   then for each value  \displaystyle {c}   the expression  \displaystyle C(\theta_1,\theta_2) = \frac{\theta_1^{2} }{2} + \frac{\theta_2^{2}}{3}  = c   would represent an elliptical contour. All the points on the same contour line would correspond to the same value of the cost function.

If we take two points   \displaystyle \theta^{(1)} and   \displaystyle \theta^{(2)} on the same contour then the difference between the value of the cost function at these two points  \displaystyle \delta{C} is zero.  Also from the properties of partial derivatives  \displaystyle \delta{C} can be expressed as

 

 \displaystyle \delta{C} = \frac{\partial{C}}{\partial{\theta_1}}\delta\theta_1  + \frac{\partial{C}}{\partial{\theta_2}}\delta\theta_2

 

 \displaystyle  0 = [\frac{\partial{C}}{\partial{\theta_1}}        \frac{\partial{C}}{\partial{\theta_2}}]^T  [\delta\theta_1     \delta\theta_2]

 

 \displaystyle  0 = \nabla{C(\theta)^T}[\delta\theta_1     \delta\theta_2]

 

Now   \displaystyle  [\delta\theta_1     \delta\theta_2] represents the tangential direction to the contour line at the point  \displaystyle \theta^{(1)} . Hence we can see that the gradient of the cost function at any point   \displaystyle \theta^{(1)} is perpendicular to the tangent to the contour line at that point.

 

 

 

The gradient is denoted by the dotted blue arrow. As we can see the negative of the gradient denoted by blue arrows gives a good direction towards the optimal minimum. By moving in the direction of the negative gradients at each iteration we finally reach the optimal minimum M. At each of the points in the different contour lines the negative of the gradient is perpendicular to the tangent denoted by the red arrow.

Once the algorithm reaches the optimal point, the gradient at the optimal point is technically the zero vector and hence there would be no further update as per the gradient descent update rule. However since there are going to be computational rounding errors,etc. involved gradient descent might not reach the exact zero gradient vector pertaining to the optimal point and hence checks such as magnitude of gradient vector below a threshold can be taken as the criteria to stop the iterative process. Other common method to stop the iterative process is to specify the number of iterations the user thinks is enough to get to the optimal point.

How much to move in the negative direction of gradient is determined by the learning rate. If the learning rate is low it would take more iterations to reach the optimal point. However if the learning rate is too high we will reach near the optimal point fast but at around the optimal point there might be oscillations. Hence the learning rate is to be chosen judiciously. Several intelligent heuristics can be applied for selecting the learning rate which deserves a topic in itself. Just to name one we can start with a high learning rate and in each iteration we can keep on decreasing the learning rate so that we can avoid the oscillations near the optimal point .

 

Hope this article helps in providing the mathematical and geometrical intuition regarding the gradient descent technique.

 

3 Responses to “The Mathematics and Geometry of Gradient Descent”

Author's gravatar

Very well described useful content, eager to read full book soon..

Author's gravatar

Nice explained Santanu.

Leave a Reply