Newton's Method vs Gradient Descent Method in tacking saddle points in Non-Convex Optimization

For Convex functions, Newton’s method leads to faster convergence since it’s a second order optimization technique that takes the curvature of the cost surface in account. The curvature information at a given point is provided by the Hessian matrix of the cost function. Although Newton method is good for convex optimization having one global minima, it has its own issues when the cost surface is highly non-convex. The good thing about gradient descent is it is attracted by only local minima since it uses the negative of the gradient to iteratively find points with lower cost than that of the prior point. On the other hand, Newton's method in each iteration builds a second order approximate of the cost function and tries to reach a stationary point(having gradient zero) with respect to the second order approximation of the cost function. What that essentially means is that the new point might be a maxima, minima or a saddle point with respect to the second order approximate.

Coming to some math, as per Newton’s method, a cost function  \displaystyle {C(}\theta{)} can be expanded in the neighbourhood of  \displaystyle \theta as below

 

 \displaystyle {C(}\theta{ + \Delta}\theta {)} = {C(}\theta {)} + \Delta\theta^{T}\nabla{C(}\theta{)} + \frac{1}{2}\Delta\theta^{T}{H(}\theta{)}\Delta\theta

 

 

where  \displaystyle {H(}\theta{)} and  \displaystyle \nabla{(}\theta{)} are the Hessian and the Gradient of the cost function at the point  \displaystyle \theta .

 

To reach a stationary point, the gradient of  \displaystyle {C(}\theta{ + \Delta\theta {)} with respect to  \displaystyle \Delta\theta should be zero.

 

Setting the gradient of  \displaystyle {C(}\theta{ + \Delta\theta {)} to zero we have -

 

 \displaystyle \nabla{C(}\theta{ + \Delta\theta {)} = \nabla{C(}\theta{)} + {H(}\theta{)}\Delta\theta = 0

 

 \displaystyle {=>} \Delta\theta = {-H(}\theta{)}^{-1}\nabla{C(}\theta{)}

 

 \displaystyle {=>} \theta + \Delta\theta = \theta {-H(}\theta{)}^{-1}\nabla{C(}\theta{)}

 

 

The preceding is the celebrated Newton’s method equation and it can be generalized for each iteration as below

 

 \displaystyle \theta^{(n + 1)} = \theta^{(n)} {-H(}\theta^{(n)}{)}^{-1}\nabla{C(}\theta^{(n)}{)}

 

where  \displaystyle \theta^{(n + 1)} and  \displaystyle \theta^{(n)} are the parameter values at iterations  \displaystyle {n} and  \displaystyle {n+1} of Newton Method respectively.

 

Just to refresh our memory the Gradient Descent Update rule is given by -

 

 \displaystyle \theta^{(n + 1)} = \theta^{(n)} - \eta\nabla{C(}\theta^{(n)}{)}

 

where  \displaystyle \eta is the learning rate.

 

Coming back to saddle points, they are stationary points with zero gradient but are neither a local minima or maxima. For a minima, all the eigen values of the Hessian matrix at the minima should be postive while for a maxima all the eigen values of the Hessian matrix should be negative. At a saddle point the Hessian matrix should have both positive and negative eigen values. The directions given by eigen vectors of the Hessian matrix corresponding to positive eigen values are directions along with the cost function is a local minima. Similarly directions given by eigen vectors corresponding to negative eigen values are directions along with the cost function is a local maxima. This makes the saddle point a local maxima along few directions and a local minima along other directions. Illustrated below in Figure 1 is the plot of the  \displaystyle f(x,y) = x^2 - y^2 with the saddle point  \displaystyle {[0,0]}^T labeled as O . From the saddle point O if we travel along the  \displaystyle x axis in either side of  \displaystyle x=0 the cost function increases while if we go in either direction along  \displaystyle y axis from  \displaystyle y=0 the cost function reduces. This illustrates the fact that the saddle point is a point of minima with respect to  \displaystyle x axis and point of maxima with respect to  \displaystyle y axis. To confirm the same we can compute the Hessian matrix of the cost function at  \displaystyle {[0,0]}^T that turns out to be  \displaystyle \begin{pmatrix} 2 & 0\\ 0 & -2 \end{pmatrix} . The two eigen values are  \displaystyle 2 and  \displaystyle -2 corresponding to the eigen vectors  \displaystyle {[1,0]}^T that represents  \displaystyle x axis and  \displaystyle {[0,1]}^T that represents  \displaystyle y axis.

Figure 1 - Illustration of saddle point, source : wikipedia

If we observe the region around the saddle point(0,0) in Figure 1 the surface is pretty flat like a plateau with very less gradient. As per our discussions thus far, Gradient Descent should be able to escape the saddle point since its not a local minima while Newton Method would be attracted by the saddle point since its a stationary point(zero gradient). The flat plateau does pose some issues to gradient descent in a way that a lot of time is spent in coming out of the saddle point zone because of the small gradients in that zone. However, the gradient descent would never be attracted towards the saddle point.

Now let's observe the behavior of both Newton Method and Gradient Descent near a saddle point of a Non convex Quadratic Cost function  \displaystyle f(x,y) = x^2 - y^2 = f(\theta) and see whether these algorithms are attracted by the saddle point  \displaystyle (0,0) .

 

As we can see in Newton Method in one iteration we reach the saddle point  \displaystyle {[0,0]}^{T} which is undesirable given that through optimization we are interested in finding local/global minima. Now let's try to see what happens if the initial point chosen is far from the saddle point say  \displaystyle {[100,100]}^{T}

In this case also the saddle point is reached in one iteration of the Newton update since the function is strictly quadratic.

Now we try out luck with Gradient Descent and see if we are able to escape the saddle point and march towards a minima instead. We start with initial point  \displaystyle {[0.5,0.5]}^T

As we can see the point reached after 1000 iterations of gradient descent with learning rate of 0.01 is far from the saddle point  \displaystyle {[0,0]}^T and is a heading towards the minima  \displaystyle {[0.0,inf]}^T since 6.15115961e-98 is practically zero while 7.58955045e+78 is large enough to be considered infinity.

Now, let's see if the initial point is very close to the saddle point would gradient descent still be able to escape the saddle point. Technically it should as long as the gradient is not zero. One thing to note is that in the region around the saddle point the gradient is very small and hence gradient descent might take a while in coming out of the saddle zone. However, it will never get attracted by a saddle point or even a maxima unlike Newton's method which gets attracted by stationary points be in minima. maxima and saddle points.

We start the initial point as  \displaystyle {[0.0000000001,0.0000000001]}^T very near to the saddle point  \displaystyle {[0,0]}^T . Also since the updates are expected to be small in the saddle point zone we decrease the update threshold as well so that learning still continues even when the updates are small.

 

We can see from the above simulation that even when the initial point is very near to saddle point gradient descent is able to escape the region of saddle point and is heading towards the minima of  \displaystyle {[0,inf]}^T . However if we observe carefully the point reached towards minima i.e.  \displaystyle {[  1.23023192e-107,1.51791009e+069]}^T is several orders behind the point  \displaystyle {[  6.15115961e-98,7.58955045e+78] }^T achieved in the earlier simulation. This is because in this simulation gradient descent spend more time in coming out of the saddle point zone that has a plateau with very small gradients.

 

To conclude Newton methods in its pure form can't be used in Non-Convex optimization for minimizing a function because of the presence of saddle points and local maxima along with local minima. Also the computation of the Hessian matrix is not always intractable, let alone the inverse of it. All these makes the simple gradient descent technique and its several variants a popular choice for non convex optimization. There are modified versions of Newtons method that have been developed to optimize Non-convex cost functions which we would pick up on another day. One more thing to add is that saddle point is not always undesirable. As we will learn in Generative Adversarial Networks the equilibrium(Nash Equilibrium) solution between the Generator and the Discriminator is nothing but a saddle point where the Discriminator and the Generator learns by playing a zero sum min-max game against each other.

Hope you find this article useful. If so please do share with your friends and colleagues.

Leave a Reply