Notations

If ${\textbf A}$ is matrix then ${\textbf A_{i,:}}$ denotes the i-th row of ${\textbf A}$ and ${\textbf A_{:,i}}$ denotes the i-th column of ${\textbf A}$.
The dot product of two vectors ${a}$ and ${b}$ can be written as ${a.b}$ or ${a^{T}b}$ or ${\lVert a\rVert\lVert b\rVert cos\theta}$ where ${\theta}$ is the angle between ${a}$ and ${b}$ and ${\lVert . \rVert}$ denotes the Frobenius norm of a quantity.

Pre-Requisites

You can watch the amazing series Essence of Linear Algebra by 3 Blue 1 Brown instead of reading this section to gain better understanding of the linear algebra needed for this blog. In fact, most of the material written in this section is inspired from that series

1. Eigenvectors And Eigenvalues:

Multiplying a matrix by a vector rotates and shrinks/stretches that vector, for eg. ${\left[ \begin{array}{ccc} -1 & 0\\ 0 & -1\\ \end{array} \right] \left[ \begin{array}{ccc} 2\\ 3\\ \end{array} \right] = \left[ \begin{array}{ccc} 2\\ 4\\ \end{array} \right]}$. Eigenvectors of a matrix ${\textbf A}$ are those vectors that are only stretched/shrinked when multipled by ${\textbf A}$. The amount by which the matrix strech/shrink a eigenvector is called eigenvalue for that particular vector. for eg ${\left[ \begin{array}{ccc} -1 & 0\\ 0 & -1\\ \end{array} \right] \left[ \begin{array}{ccc} 2\\ 0\\ \end{array} \right] = \left[ \begin{array}{ccc} -2\\ 0\\ \end{array} \right] = -1\left[ \begin{array}{ccc} 2\\ 0\\ \end{array} \right]}$. Thus ${\left[ \begin{array}{ccc} 2\\ 0\\ \end{array} \right]}$ is eigenvector of matrix ${\left[ \begin{array}{ccc} -1 & 0\\ 0 & -1\\ \end{array} \right]}$ with eigenvalue -1. Notice that ${\left[ \begin{array}{ccc} 3\\ 0\\ \end{array} \right]}$ is also a eigenvector with same eigenvalue. More generally if ${\textbf v}$ is a eigenvetor of a matrix then so is ${k\textbf v}$ (where k is a scalar) with same eigenvalue. Since multiples of eigenvectors are also eigenvectors of the matrix we only consider the eigenvector with unit magnitude. A matrix can have multiple eigenvalues. A matrix with only positive/negative eigenvalues is called positive/negative definite matrix. A matrix with non-negative/non-positive eigenvalues is called positive/negative semidefinite matrix.

2. Inverse Of A Matrix:

The inverse of matrix ${\textbf A}$ is a matrix which when multiplied by ${\textbf A}$ produces ${\textbf I}$ where ${\textbf I}$ is the identity matrix. The inverse of matrix ${\textbf A}$ is denoted by ${\textbf A^{-1}}$.

3. Transpose Of A Matrix:

The transpose of matrix ${\textbf A}$ is done by replacing ${\textbf A_{i,j}}$ with ${\textbf A_{j,i}}$ and is denoted by ${\textbf A^{T}}$. for eg ${\left[ \begin{array}{ccc} 1 & 2\\ 3 & 4\\ \end{array} \right]^{T} = \left[ \begin{array}{ccc} 1 & 3\\ 2 & 4\\ \end{array} \right]}$. The transpose of a row vector is a column vector and vice versa, for eg. ${[1\,\,2]^{T} = \left[\begin{array}{ccc}1\\2\end{array}\right]}$

4. Orthogonal Matrix:

A matrix ${\textbf A}$ is orthogonal if ${\textbf A^{T} = \textbf A^{-1}}$. The rows and columns of an orthogonal matrix are unit vectors and all the rows and columns of this matrix are orthonormal to each other. Two vectors are called orthonormal when their dot product is 0 (which further means they are perpendicular to each other). Multiplying a orthogonal matrix by a vector only rotates that vector by some angle without changing it's magnitude. Note - If ${\textbf A}$ is orthogonal then so is ${\textbf A^{T}}$.

When the i-th column of an orthogonal matrix is multiplied by the matrix, the result is a column vector with 1 in i-th position and 0 everywhere else. When the i-th row of an orthogonal matrix is multiplied by the matrix, the result is a row vector with 1 in i-th position and 0 everywhere else.

5. Diagonal Matrix:

A matrix is a diagonal matrix if it has 0 at every position except it's diagonal for eg. ${\textbf A = \left[ \begin{array}{ccc} 1 & 0\\ 0 & 2\\ \end{array} \right]}$. Diagonal matrix can be simply denoted by ${diag(\lambda_{1}, \lambda_{2},...,\lambda_{n})}$ where ${(\lambda_{1}, \lambda_{2},...,\lambda_{n})}$ are the diagonal elements of the matrix. for eg ${\textbf A = diag(1,2)}$

5. Symmetric Matrix :

A matrix ${\textbf A}$ is called symmetric if ${\textbf A = \textbf A^{T}}$, which means ${\textbf A_{i,j} = \textbf A_{j,i}}$. Symmetric matrices have a intriguing property that they can be expressed as product of three simple matrices. If ${\textbf A}$ is symmetric then it can be shown that ${\textbf A = \textbf Q\Lambda \textbf Q^{T}}$. ${\textbf Q\Lambda \textbf Q^{T}}$ is called the eigendecomposition of ${A}$. The columns of ${\textbf Q}$ are the unit eigenvectors of ${\textbf A}$ and ${\Lambda}$ is a diagonal matrix whose diagonal elements are eigenvalues of ${\textbf A}$ i.e ${\Lambda = diag(\lambda_{1}, \lambda_{2},...,\lambda_{n})}$ where ${\lambda_{1}, \lambda_{2},...,\lambda_{n}}$ are eigenvalues of ${\textbf A}$. The eigenvalue ${\lambda_{i}}$ is associated with column ${i}$ of ${\textbf Q}$. Another important thing to note is that ${\textbf Q}$ here is an orthogonal matrix.

6. Basis and Coordinates :

Suppose you are given a vector say ${\left[\begin{array}{ccc}2 \\ 4\end{array}\right]}$. This vector is address of a particular point say ${p}$ in 2-D space. Suppose you are at origin and you want to go to ${p}$. This vector says that to reach that point travel 2 units in direction ${\left[\begin{array}{ccc}1 \\ 0\end{array}\right]}$ and then 4 units in direction ${\left[\begin{array}{ccc}0 \\ 1\end{array}\right]}$. Therfore, ${\left[\begin{array}{ccc}2 \\ 4\end{array}\right]}$ can be written as ${2\left[\begin{array}{ccc}1 \\ 0\end{array}\right] + 4\left[\begin{array}{ccc}0 \\ 1\end{array}\right]}$. Notice that we are giving the address of the point in space in terms of how much you should travel in direction ${\left[\begin{array}{ccc}1 \\ 0\end{array}\right]}$ and ${\left[\begin{array}{ccc}0 \\ 1\end{array}\right]}$. But what if we want to give the address in terms of different directions say ${\left[\begin{array}{ccc}1 \\ 1\end{array}\right]}$ and ${\left[\begin{array}{ccc}1 \\ -1\end{array}\right]}$. In this case we would say travel 3 units in direction ${\left[\begin{array}{ccc}1 \\ 1\end{array}\right]}$ and -1 units in direction ${\left[\begin{array}{ccc}1 \\ -1\end{array}\right]}$.

So the new address of ${p}$ would be ${\left[\begin{array}{ccc}3 \\ -1\end{array}\right]}$ in terms of these new directions. These directions are called basis vectors. If basis vectors are n-dimensional then every point in n-dimensioanl space can be written as linear combination of basis vectors. for eg point ${p}$ can be written as ${3\left[\begin{array}{ccc}1 \\ 1\end{array}\right] + -1\left[\begin{array}{ccc}1 \\ 1\end{array}\right]}$. The coordinates of ${p}$ with respect to these bases would be ${\left[\begin{array}{ccc}3 \\ -1\end{array}\right]}$. Not all collection of vectors qualify as basis vectors of space. Notice that the coordinate of p corresponding to ${\left[\begin{array}{ccc}1 \\ 1\end{array}\right]}$ (3) is greater than coordinate corresponding to ${\left[\begin{array}{ccc}1 \\ -1\end{array}\right]}$ (-1). This is because if we draw a vector from origin towards ${p}$, that vector is more aligned towards ${\left[\begin{array}{ccc}1 \\ 1\end{array}\right]}$ than the other basis vector.

The n eigenvectors of a ${n\times n}$ symmetric matrix qualify as basis vectors for that n-dimensional space. That means that every point in that n-dimensional space can be written as a linear combination of these n eigenvectors.

1. Directional Derivative

Let's say we have a multivariable function with two variables ${f(x_{1},x_{2}) = x_{1}^{2} + x_{2}^{2}}$ where ${x_{1}}$ and ${x_{2}}$ represent the standard axes in a 2 dimensional space.

We can say that f takes in a two-dimensional vector as input where the entries of vector are values of ${x_{1}}$ and ${x_{2}}$ respectively. We want to know how the function changes if we are at point ${(1,1)}$ and move in the direction of ${x_{1}}$ by a little amount. We can answer this question by calculating the partial derivative of ${f}$ with respect to ${x_{1}}$ which is ${2x_{1}}$. Putting ${x_{1} = 1}$, we conclude that if we move a miniscule amount in ${x_{1}}$ direction the function ${f}$ will change by twice that amount. The same goes for the case in which we want to take a tiny step in direction of ${x_{2}}$. But what if we want to move in the direction ${(1,2)}$ ? For calcualting the effect of this movement on ${f}$ we need to calculate the derivative of ${f}$ in direction ${[\frac {1}{\sqrt{5}},\frac{2}{\sqrt{5}}]^{T}}$ (which is a unit vector in direction ${(1,2)}$).

More generally, to find out how does nudging the input of a multivariable function in a certain direction affects the value of that function, we need to calculate the derivative of that function in that direction. This derivative is called the directional derivative of that function for that particular direction. Fortunately, it's trivially easy to calculate. The directional derivative of a function g in direction ${u}$ is just ${u^{T}\nabla \scriptsize x\normalsize g(x)}$ (matrix mutltiplication) or ${u . \nabla \scriptsize x\normalsize g(x)}$ (dot product) where ${\nabla \scriptsize x\normalsize g(x)}$ is called gradient of function g. Gradient of g is a vector containing all the partial derivatives of ${g}$ with respect to vector ${\textbf{x}}$. Element ${i }$ of ${\nabla \scriptsize x\normalsize g(x)}$ is partial derivative of ${g}$ w.r.t ${x_{i}}$. Thus, partial derivative of our function f in the direction ${\left[\begin{array}{ccc}\frac {1}{\sqrt{5}}\\ \frac{2}{\sqrt{5}}\end{array}\right]}$ is just ${\left[\begin{array}{ccc}\frac {1}{\sqrt{5}}\\ \frac{2}{\sqrt{5}}\end{array}\right].\left[\begin{array}{ccc}2x_{1}\\ 2x_{2}\end{array}\right] = \frac{2x_{1}}{\sqrt{5}}+\frac{4x_{2}}{\sqrt{5}}}$.

More formally, directional derivative of a function ${f}$ in direction ${u}$ is derivative of ${f(x + \alpha u)}$ with respect to ${\alpha}$, calculated at ${\alpha = 0}$. Let ${(x + \alpha u) = p}$. ${\large \frac{\partial p}{\partial \alpha} = u . \therefore \frac{\partial f(x+\alpha u)}{\partial \alpha} = \frac{\partial f(p)}{\partial \alpha}}$.

Then, using the chain rule of calculus ${\large\frac{\partial f(p)}{\partial \alpha} = \frac{\partial f(p)}{\partial p} \frac{\partial p}{\partial \alpha} = \small\nabla_{p} f(p).u = u^{T}\nabla_{x+\alpha u}f(x + \alpha u)}$ - ${\textbf Eq(1)}$. Putting ${\alpha = 0}$, we get directional derivative in direction ${u}$ = ${u^{T}\nabla_{x}f(x)}$

Why do we take a step in opposite direction of the gradient ?

Earlier we saw that the directional derivative of a function ${f}$ in the direction ${u}$ is ${u^{T}\nabla_xf(x)}$ or ${u.\nabla_xf(x)}$. To emphasize again, directional derivative in a certain direction tells us how much the function changes relative to change in input in that direction. This can give us the answer to where should we move while minimsing a function. During learning, our aim is to minimise an objective function. Suppose we are at a certain point in the input space of function ${f}$ and we want to decide where should we move so that that the function decreases the fastest. In other words we want to move in direction ${u}$ such that ${u.\nabla_xf(x)}$ is minimum. By writing ${u.\nabla_xf(x)}$ as ${\lVert u\rVert\lVert \nabla_{x}f(x)\rVert cos\theta}$, we can see that this quantity is minimum when ${cos\theta = -1}$ or ${\theta = 180^{\circ}}$. This means ${u}$ should be in opposite direction of the gradient. This is the reason we take a step in the direction of negative gradient during learning/training. Therefore, the gradient descent algorithm proposes update the weights according to rule ${\theta\leftarrow \theta- \epsilon \nabla_{\theta}f(\theta)}$ where ${\epsilon}$ is called the learning rate.

2. Second derivative of a function and how does it affect the gradient step

The second derivative of a fnction with respect to it's input tells us about the curvature of that fnction. Let's take three fnctions ${p(x) = 2x, g(x) = x^{3}, h(x) = -x^{2}}$.

Let's say we are at the point ${x = 1}$. The second derivatives of the functions at this point are ${p^{''}(1) = 0, g^{''}(1) = 6, h^{''}(1) = -2}$. The value of ${p^{''}(x)}$ tells us that it's gradient doesn't change at all. We can verify this by seeing that gradient of p(x) is 2 everywhere. The second derivative of g(x) increases as x increases. This shows that gradient of g increases as we move rightwards from any point. The positive value of second derivative of g(x) for positive x shows that function curves upwards when x is positive. The negative value of second derivative of h(x) tells us that gradient of h(x) decreases as x increases. Therefore, the function curves downwards. We are at point x = 1 and we want to minimise the functions p,g and h. Gradient descent tells us to move in the direction of negative gradient. The gradient of the functions at ${x = 1}$ are ${p^{'}(1) = 2, g^{'}(1) = 3, h^{'}(1) = -2}$. The value of 2 for ${p^{'}(1)}$ tells us that if we change x by a small amount, the value of p(x) should change by twice that amount. The same goes for g(x) and h(x). Let's see if that's the case.

Let learning rate ${\epsilon = 0.001}$.

In case of p(x), we move to new point ${x = x - 2\epsilon = 1 - 0.001\times2 = 0.998}$. New value of p(x) is ${2\times0.998 = 1.996}$. Thus, we have successfully decreased the value of p(x) and we can see that it decreased by 0.004 which is exactly twice the amount of change in x. This is what we expected from the gradient.

For g(x), ${x = x- 3\epsilon = 1 - 3\times0.001 = 0.997}$. The new value of g(x) is 0.991026973. x changes by 0.003 so according to gradient information g(x) should change by thrice that amount i.e 0.009. But, we can see that g(x) changes by 0.008973027 which is little less than the value 0.009 predicted by the gradient.

For h(x), ${x = x- (-2)\epsilon = 1 + 2\times0.001 = 1.002}$. The new value of h(x) is -1.004004. x changes by 0.002 so according to gradient information h(x) should change by 0.004. We can see that h(x) changes by 0.004004 which is little more than the value 0.004 predicted by the gradient.

Generally, we can see the effect of second derivative by looking at the taylor series approximation of a function at a point ${x^{\omicron}}$. The function f at a point ${x^{\omicron}}$ can be approximated upto three terms as follows :

${f(x)\approx f(x^{\omicron}) + \large\frac{(x-x^{\omicron})}{1!}\small f^{'}(x^{\omicron}) + \large\frac{(x-x^{\omicron})^{2}}{2!}\small f^{''}(x^\omicron)}$ - ${\textbf Eq(2)}$.

The gradient descent tells us that in order to reduce ${f(x)}$ we should go from ${x^\omicron}$ to ${x^\omicron - \epsilon f^{'}(x^\omicron)}$. Putting ${x = x^\omicron - \epsilon f^{'}(x^\omicron)}$ in Eq(2):

${f(x^\omicron - \epsilon f^{'}(x^\omicron)) = f(x^\omicron) - \epsilon (f^{'}(x^\omicron))^{2} + 0.5 (\epsilon f^{'}(x^\omicron))^{2}f^{''}(x^\omicron)}$.

From above equation we can see that the second term in above equation (${\epsilon (f^{'}(x^\omicron))^{2}}$) is the reduction in value of ${f(x^\omicron)}$ after taking a gradient descent step and third term (${0.5 (\epsilon f^{'}(x^\omicron))^{2}f^{''}(x^\omicron)}$) is correction we apply to account for the curvature of the function . If ${f^{''}(x^\omicron) = 0}$ then the third term vanishes as happened in case of function ${p(x)}$ above. In this case gradient information correctly predicts the change in value of function.

But if second derivative of a function is non-zero, there are second order effects too which have to be accounted for. Gradient descent is unaware of these second order effects. If second derivative of a function is too large at some point, a step taken in opposite direction of gradient can even increase the value of function if learning rate isn't too small. If second derivative of a function is too large at some point we need to keep learning rate too small to avoid the second order effects, but a small learning rate will lead to longer training time.

3. Hessians

In last section we looked at second derivative of functions which only had a scalar parameter/input. In deep learning, we have to deal with objective functions which take in a vector or matrix as input. These functions don't have a scalar value as a second derivative and thus it is difficult to judge the second order effects in these scenarios. To analyze these functions, let's take a function ${f: R^{m}\rightarrow R}$ which takes in a m-dimensional vector as input. Let's denote this m-dimensional vector as ${\textbf x = (x_{1}, x_{2},...,x_{m})}$. The gradient of this function is denoted by m-dimensional vector ${\textbf g}$. The i-th entry of ${\textbf g}$ is equal to ${\large\frac{\partial f}{\partial x_{i}}}$. This function will have multiple second derivatives which can be collected together in a matrix called Hessian matrix denoted by ${\textbf H(f)}$.

${\textbf H(f)_{i,j} = \large \frac{\delta^{2}}{\delta x_{i}x_{j}}\small f}$. Therefore, ${\textbf H(f)}$ will be a ${m\times m}$ matrix. Notice that, ${\textbf H(f)_{i,j} = \textbf H(f)_{j,i}}$ wherever the partial derivative is continous. This means ${\textbf H(f)}$ is a symmetric matrix. Therefore, ${\textbf H(f)}$ can be written as ${\textbf Q\Lambda \textbf Q^{T}}$ where columns of ${\textbf Q}$ are eigenvectors of ${\textbf H(f)}$ and ${\Lambda}$ is a diagonal matrix with eigenvalues of ${\textbf H(f)}$ as it's diagonal elements. If ${(\lambda_{1}, \lambda_{2},...,\lambda_{m})}$ are eigenvalues of ${\textbf H}$, then ${\Lambda = diag(\lambda_{1}, \lambda_{2},...,\lambda_{m})}$

We want to know the second derivative in direction ${u}$ so that we can infer how does the gradient and our function will behave if we move in that direction. For that we would have to calculate second directional derivative in a particular direction. Second directional derivative in direction ${u}$ can be written as ${\large\frac{\partial^{2}}{\partial \alpha^{2}}\small f(x + \alpha u)}$ at ${\alpha = 0}$. Let ${x + \alpha u = p,\, \therefore \large\frac{\partial p}{\partial \alpha} \small= u}$.

${\large\frac{\partial^{2}}{\partial \alpha^{2}}\small f(x + \alpha u) =\large\frac{\partial}{\partial\alpha}(\frac{\partial}{\partial\alpha} \small f(p)\large) }$

Using the chain rule, ${\large\frac{\partial}{\partial\alpha}(\frac{\partial}{\partial\alpha} \small f(p)\large)= \frac{\partial}{\partial \alpha}(\frac{\partial f(p)}{\partial p}\frac {\partial p }{\partial \alpha}) = \frac{\partial}{\partial\alpha}\small(u. f^{'}(p)) = u.\large\frac{\partial}{\partial \alpha}\small f^{'}(p) = u.\large\frac{\partial f^{'}(p)}{\partial p}\frac {\partial p}{\partial\alpha} = \small u^{2}.f^{''}(x + \alpha u)}$.

Putting ${\alpha = 0}$ we can conclude that second directional derivative of function ${f}$ in direction ${\textbf u}$ is ${\textbf u^{2}f^{''}(\textbf x)}$. Notice that ${\textbf x}$ and ${\textbf u}$ here are vectors. Therefore ${f^{''}(\textbf x)}$ is the hessian matrix ${\textbf H}$. So, the second directional derivative in direction ${\textbf u}$ is written as ${\textbf u^{T}H\textbf u}$

Now that we have derived the formula of second directional derivative in direction ${u}$, let's calulate it in a certain direction say ${q_{i}}$ where ${q_{i} = \textbf Q_{:,i}}$. In other words ${q_{i}}$ is the eigenvector of ${\textbf H}$ associated with eigenvalue ${\lambda_{i}}$.

Directional derivative in direction ${q_{i} = q_{i}^{T}Hq_{i} = q_{i}^{T}\textbf Q\Lambda \textbf Q^{T}q_{i}}$.

${q_{i}^{T}\textbf Q}$ gives a m-dimensional row vector which will have 1 in i-th position and 0 everywhere else. ${\textbf Q^{T}q_{i}}$ gives a m-dimensional column vector which will have 1 in i-th position and 0 everywhere else.

${\therefore q_{i}^{T}Hq_{i} = [0\,0\, ... 1\,... \,0]\Lambda[0\,0\, ... 1\,... \,0]^{T}}$. This will give back the i-th eigenvalue of ${\textbf H}$ i.e ${\lambda_{i}}$. Thus, value of second directional derivative in the direction of an eigenvector is just the eigenvalue associated with that eigenvector.

What if we want to know the value of second directional derivative in direction ${v}$ which is not one of the eigenvectors of ${\textbf H}$. We know that m eigenvectors of ${\textbf H}$ form a basis for the m-dimensional space. Therefore, ${v}$ can be expressed as linear combination of these m eigenvectors as follows:

${v = a_{1}q_{1} + a_{2}q_{2} + ... + a_{m}q_{m}}$. This means that ${[a_{1}\, a_{2}\,...\,a_{m}]^{T}}$ are the coordinates of point to which ${v}$ points with respect to basis consisting of eigenvectors. Since ${v}$ is a direction ${\lVert v\rVert = 1}$. This means ${\sqrt{a_{1}^{2}+a_{2}^{2}+...+a_{m}^{2}} = 1}$. From this we can infer that values of ${a_{1},...,a_{m}}$ lie between -1 and 1. Another point to keep in mind is that ${a_{i} > a_{j}}$ if ${v}$ is more inclined towards ${q_{i}}$ than towards ${q_{j}}$.

Second directional derivative in direction ${v}$ = ${v^{T}Hv = (a_{1}q_{1} +... + a_{m}q_{m})^{T}Q\Lambda Q^{T}(a_{1}q_{1} +... + a_{m}q_{m})}$

Since vector-matrix mulitplication is distributive,

${v^{T}Hv = (a_{1}q_{1}^{T}Q + ...+ a_{m}q_{m}^{T}Q)\,\Lambda\,(a_{1}Q^{T}q_{1} +...+a_{m}Q^{T}q_{1})}$

${= \large(\small a_{1}[1\, 0\,.. \,0] + ... + a_{m}[0\, 0\, ...\, 1]\large)\small\, \Lambda\, \large(\small a_{1}[1 \,0\,.. \,0]^{T} + ... + a_{m}[\,0\, 0\, ...\, 1]^{T}\large)\small = [a_{1}\,a_{2} \,...\, a_{m}]\Lambda [a_{1}\,a_{2} \,...\, a_{m}]^{T}}$

Solving further, ${[a_{1}\,a_{2} \,...\, a_{m}]\Lambda [a_{1}\,a_{2} \,...\, a_{m}]^{T} = [a_{1}\,a_{2} \,...\, a_{m}][a_{1}\lambda_{1}\,\,a_{2}\lambda_{2} \,\,...\,\, a_{m}\lambda_{m}]^{T} = a_{1}^{2}\lambda_{1} + a_{2}^{2}\lambda_{2} + .. + a_{m}^{2}\lambda_{m}}$

${\therefore v^{T}Hv = a_{1}^{2}\lambda_{1} + a_{2}^{2}\lambda_{2} + .. + a_{m}^{2}\lambda_{m}}$.

This shows that directional second derivative in any direction is weighted average of all the eigenvalues of the Hessian and the value of weights lie between 0 and 1. ${\lambda_{i}}$ receives more weight than ${\lambda_{j}}$ if ${v}$ is more inclined towards ${q_{i}}$ than towards ${q_{j}}$

Let ${\lambda_{max}}$ and ${\lambda_{min}}$ be the maximum and minimum eigenvalue of the hessian respectively.

Then, ${a_{1}^{2}\lambda_{1} + ... + a_{m}^{2}\lambda_{m} \leq a_{1}^{2}\lambda_{max} + \,...\, + a_{m}^{2}\lambda_{max} = (a_{1}^{2}+ \,...\, + a_{m}^{2})\lambda_{max} = \lambda_{max}}$

Also, ${a_{1}^{2}\lambda_{1} + ... + a_{m}^{2}\lambda_{m} \geq a_{1}^{2}\lambda_{min} + \,...\, + a_{m}^{2}\lambda_{min} = (a_{1}^{2}+ \,...\, + a_{m}^{2})\lambda_{min} = \lambda_{min}}$

Therefore, second directional derivative in any direction can not be greater than ${\lambda_{max}}$ and can't be less than ${\lambda_{min}}$

What's The Worst That Could Happen ?

Earlier we approximated ${f: R\rightarrow R}$ at point ${x^\omicron}$. We do that again here but for function ${f: R^{m}\rightarrow R}$. Using the Taylor series expansion, ${f(\textbf x)}$ can be approximated at point ${\textbf x^\omicron}$ upto three terms as :

${f(\textbf x)\approx f(\textbf x^\omicron) + \large\frac{(\textbf x - \textbf x^\omicron)^{T}}{1!}\small\textbf g + \large\frac{1}{2!}\small(\textbf x-\textbf x\omicron)^{T}H(\textbf x-\textbf x\omicron)\,\,\,\,\textbf Eq(3)}$ , where ${\textbf g}$ and ${\textbf H}$ are gradient vector and Hessian matrix of ${f(\textbf x)}$ respectively.

Gradient descent advises us to go to ${\textbf x = \textbf x^\omicron - \epsilon \textbf g}$. Putting ${\textbf x^\omicron - \epsilon \textbf g}$ in ${\textbf Eq(3)}$

${f(\textbf x^\omicron - \epsilon\textbf g) \approx f(\textbf x^\omicron) - \epsilon \textbf g^{T}\textbf g + \frac{1}{2}\epsilon^{2}\textbf g^{T}\textbf H\textbf g\,\,\,\,Eq(4)}$.

Again, the second term here is the reduction in the value of function by going to ${\textbf x^\omicron - \epsilon \textbf g}$ and third term is the correction applied to account for curvature. Eagle eyed reader can notice that in the third term the quantity ${\textbf g^{T}\textbf H\textbf g}$ is the second directional derivative in the direction ${\textbf g}$ whose maximum value is equal to maximum eigenvector of ${\textbf H}$. The worst case scenario would be when ${\textbf g}$ aligns with the eigenvector of ${H}$ corresponding to the maximum eigenvalue. In this case, ${\textbf g^{T}\textbf H\textbf g = \lambda_{max}}$ and if ${\epsilon}$ isn't small enough then the third term can increase the value of ${f}$.

In fact we can solve for what should be the optimal ${\epsilon}$. Let optimal value of learning rate be ${\epsilon^{\ast}}$. Then, ${\large\frac{\partial}{\partial \epsilon}\small f(\textbf x^{\omicron} - \epsilon \textbf g)}$ at ${\epsilon^{\ast} = 0}$.

${\therefore -\textbf g^{T}\textbf g + \epsilon^{\ast} \textbf g^{T}H\textbf g = 0 \implies \epsilon^{\ast} = \large\frac{\textbf g^{T}\textbf g}{\textbf g^{T}H\textbf g}}$. From here, we can see that in worst case scenario the optimal learning rate is ${\large\frac{1}{\lambda_{max}}}$. If maximum eigenvalue of Hessian is large then this learning rate is too small to make significant progress.

Condition Number

(This part of blog is inspired from this brilliant distill-pub post Why Momentum really works)

The condition number of a matrix is the ratio of it's maximum eigenvalue and minimum eigenvalue. A matrix is said to be suffering from the case of poor condition number if it's condition number is too large. A Hessian matrix having poor condition number would mean that in directions which are inclined towards the eigenvector corresponding to the large eigenvalues the gradient changes too fast, while in directions inclined towards the eigenvector corresponding to lower eigenvalues the gradient changes slowly. This is probelmatic since gradient descent doesn't know about this effect. We would prefer to go in directions where gradient remains negative for longer (because negative gradient means value of function decreases by going in these directions).

We can see the ill effects of poor condition number by looking at a concrete example. Let's minimise the function ${f(w) = \frac{1}{2}w^{T}Aw - b^{T}w}$ where ${w\in R^{m}}$ and A is symmetric and invertible. Let optimal value of ${w}$ for this function be ${w^\ast}$. We can calculate the value of ${w^\ast}$ by using the fact that ${\large\frac{\partial}{\partial w}f}$ at ${w^\ast = 0}$. ${\therefore Aw^\ast - b = 0 \implies w^\ast = A^{-1}b}$. We want to arrive at ${w^\ast}$ using gradient descent. The gradient ${g}$ of ${f}$ is ${Aw-b}$ and the Hessian matrix is ${A}$.

We start from the point ${w^\omicron}$. Let's say at t-th iteration we are at ${w^t}$. The gradient advises us to go in opposite direction of ${Aw^t - b}$.

${\therefore\, w^{t+1} = w^t - \epsilon(Aw^t - b)\,\,\,\, Eq(5)}$. We can calculate how far we are from optimal value at time t by calculating ${w^t - w^\ast}$. Subtracting ${w^\ast}$ from both sides of ${Eq(5)}$:

${w^{t+1}- w^{\ast} = (w^t - w^\ast) - \epsilon(Aw^t - b)}$

Subtracting and adding ${w^\ast}$ in ${Aw^t}$

${w^{t+1}- w^{\ast}= (w^t - w^\ast) - \epsilon\large(\small A(w^t-w^\ast + w^\ast) - b\large) \small = (w^t - w^\ast)(I -\epsilon A) - \epsilon Aw^\ast + \epsilon b}$ where ${I}$ is the identity matrix. Since ${Aw^\ast = b}$, the last two terms cancel out. This leaves us with:

${w^{t+1}- w^{\ast} = (I -\epsilon A)(w^t - w^\ast)\,\,\,\, Eq(6)}$.

Since ${A}$ is symmetric matrix, we can write eigendecomposition of ${A}$ as ${\textbf Q\Lambda \textbf Q^{T}}$. Since ${\textbf Q}$ is orthogonal matrix, ${\textbf Q\textbf Q^{T} = I}$. We know that multiplying a vector by an orthogonal matrix simply rotates the vector without changing it's magnitude. In case of vector ${w^{t+1}- w^{\ast}}$, we are only concerned with it's magnitude because that tells us how far we are from optimal point. So we can define a new variable ${x^{t} = Q^{T}(w^{t}- w^{\ast})}$. ${x^t}$ has same magnitude as $(w^{t}- w^{\ast})$. We do this because it will yield an easy interpretation in the end.

Multiplying both sides of ${Eq(6)}$ with ${Q^{T}}$ and replacing ${A}$ with it's eigendecomposition we get:

${Q^{T}(w^{t+1}- w^{\ast}) = Q^{T}(QQ^{T} - \epsilon Q\Lambda Q^{T})(w^t - w^\ast) = Q^{T}Q(I - \epsilon \Lambda)Q^{T}(w^t - w^\ast) = (I - \epsilon \Lambda)x^t}$

${x^{t+1} = (I - \epsilon \Lambda)x^t}$. We can write ${x^{t+1}}$ in terms of ${x^\omicron}$ as:

${x^{t+1} = (I-\epsilon \Lambda)^{t}x^\omicron = diag([1-\epsilon\lambda_{1},...,1-\epsilon\lambda_{m}])^{t}x^\omicron}$.

If a matrix ${B}$ is a diagonal matrix denoted by ${diag([b_{1}, b_{2},...,b_{n}])}$then ${B^k}$ is equal to ${diag([b_{1}^k, b_{2}^k,...,b_{n}^k])}$

${\therefore x^{t+1} = diag([(1-\epsilon\lambda_{1})^t,...,(1-\epsilon\lambda_{m})^t])x^\omicron}$.

Let ${x^\omicron_{i}}$ be the i-th element of vector ${x^\omicron}$

${x^{t+1} = \left[\begin{array}{ccc}(1-\epsilon\lambda_{1})^t x^\omicron_{1}\\...\\(1-\epsilon\lambda_{m})^t x^\omicron_{m}\end{array}\right]}$.

${x^{t+1} = \left[\begin{array}{ccc}x^{t+1}_{1}\\...\\x^{t+1}_{m}\end{array}\right] = \left[\begin{array}{ccc}(1-\epsilon\lambda_{1})^t x^\omicron_{1}\\...\\(1-\epsilon\lambda_{m})^t x^\omicron_{m}\end{array}\right]}$

This result yields an fascinating interpretation. At any iteration ${t}$, we want our error ${x^{t}}$ to be low. i-th component of error, ${x^{t}_{i}}$ is equal to ${(1-\epsilon\lambda_{i})^{t-1}x^{\omicron}_{i}}$. If ${(1-\epsilon\lambda_{i})^{t-1}}$ is small then error in that component is small. More specifically if ${\mid1-\epsilon\lambda_{i}\mid}$ is less than 1, the convergence will be faster for the i-th component. If ${\mid1-\epsilon\lambda_{i}\mid}$ is close to 1 then convergence will be slow. From here we can see that, components corresponding to larger eigenvalues will converge faster and those corresponding to lower eigenvalues will struggle till later iterations.

Let ${\lambda_{min}, \lambda_{max}}$ be the least and largest eigenvalues of ${A}$ repectively. The overall convergence rate will be determined by error component which converges the slowest (i.e which is most close to 1). The slowest error component will be either ${\mid1-\epsilon\lambda_{min}\mid}$ or ${\mid1-\epsilon\lambda_{max}\mid}$ (because if ${\lambda_{min}\approx 0}$, then ${\mid1-\epsilon\lambda_{min}\mid \approx 1}$, or if ${\epsilon =0.01}$ and ${\lambda_{max} = 200}$ then ${\mid1-\epsilon\lambda_{max}\mid\approx 1}$). We can adjust the learning rate so that error component corresponding to least and largest eigenvalue converge at same rate. Let optimal learning rate be ${\epsilon^\ast}$.

Then, ${\mid1-\epsilon^\ast\lambda_{min}\mid} = {\mid1-\epsilon^\ast\lambda_{max}\mid}$

${1-\epsilon^\ast\lambda_{min} =\epsilon^\ast\lambda_{max}-1\implies \epsilon^\ast = \large\frac{2}{\lambda_{min} + \lambda_{max}}}$

Then, the optimal convergence rate is ${1-\large\frac{2\lambda_{max}}{\lambda_{min} + \lambda_{max}}}$.

Optimal rate = ${\frac{\large\frac{\lambda_{max}}{\lambda_{min}} - 1}{\large\frac{\lambda_{max}}{\lambda_{min}} + 1}}$. Let condition number be denoted by ${\kappa}$.

${\therefore}$ Optimal rate = ${\large\frac{\kappa-1}{\kappa+1}}$. This goes on to show that, larger the condition number the slower the gradient descent would be.

This analysis showed again that how the eigenvalues of Hessian Matrix affect the convergence rate and optimal learning rate of our algorithms.

Saddle Points

Suppose we are at a certain point where the gradient vector ${g}$ is ${0}$ (A vector is 0 when all of it's components are 0) and all the eigenvalues of the Hessian are negative. What does that mean ? It means that in whichever direction we go the gradient decreases. Since the gradient is already 0, it's components will become negative if we move in any direction. Negative components further mean that moving in any direction will lead to decrease in the value of function. This is an interesting case. Since the value of function is decreasing regardless of which direction we move in, we must be at the point where value of function is greater than the value of function at it's neighbouring points. This point is called a local maximum or a global maximum.

The opposite happens when the gradient vector ${g}$ is ${0}$ and all the eigenvalues of the Hessian are positive. This means that we are at point where the value of function is less than the value of function at it's neighbouring points. This point is called local minimum or global minimum.

The procedure where we check the sign of all the eigenvalues of Hessian at points where gradient vector is 0 is called the second derivative test. The second derivative helps us determine whether the point is local maximum or local minimum.

But what happens if all the eigenvalues of the Hessian don't have the same sign. In this case the second-derivative test is inconclusive. If some eigenvalues are positive it means that derivative and thus consecutively value of function increases in those directions, while the value of functions decreaes in directions corresponding to eigenvectors with negative eigenvalues. This point is called the saddle point of the function.

Let's take an example for better clarification. Let ${f: R^{2}\rightarrow R}$ be a function defined as ${f(x_{1}, x_{2}) = x_{1}^{2} - x_2^2}$. The gradient vector of this function is ${g = \left[\begin{array}{ccc}\large\frac{\partial f}{\partial x_1}\\\large\frac{\partial f}{\partial x_2}\end{array}\right] = \left[\begin{array}{ccc}2x_{1}\\-2x_2\end{array}\right]}$. The Hessian for this function is ${H = \left[\begin{array}{ccc} \large\frac{\partial}{\partial x_1\partial x_1}f & \large\frac{\partial}{\partial x_1\partial x_2}f\\ \large\frac{\partial}{\partial x_2\partial x_1}f & \large\frac{\partial}{\partial x_2\partial x_2}f\end{array}\right] = \left[ \begin{array}{ccc} 2 & 0\\ 0 & -2\\ \end{array} \right]}$.

The gradient ${g}$ is 0 at the origin. We can see that vectors ${\left[\begin{array}{ccc} 1\\0\end{array}\right]}$ and ${\left[\begin{array}{ccc} 0\\-1\end{array}\right]}$ are eigenvectors of ${H}$ with eigenvalues 2 and -2 respectively (because when multiplied by Hessians these vectors are scaled by 2 and -2 respectively). Therefore origin is a saddle point because the value of function increases if we go in direction of ${x_1}$ and decreases in the direction ${x_2}$. This function is shown below:

This can be problematic since saddle point is maximum point for some cross section (${x_{2}-f}$ cross section in our case) and minimum point for other (${x_{1}-f}$ cross section in our case). The gradient is 0 at the saddle point and therefore we can't move anywhere using gradient descent. If the value of function at saddle point is large, then gradient descent can't help us in escaping from that point.