Numerical Computation in Machine Learning (1)

Machine learning algorithms usually require a great deal of numerical computation—that is, solving for approximate values iteratively rather than obtaining analytical solutions. These algorithms typically involve optimization and solving systems of linear equations. Representing various floating-point numbers with a finite number of bits on a computer carries inherent error, so we need certain methods to guarantee the precision of our computations.

Overflow and Underflow

Overflow is destructive: it occurs when numbers of large magnitude are approximated as -\infty or \infty, which causes the numbers to become non-numbers (NaN) in subsequent computations. Underflow is equally devastating: it occurs when a floating-point number that is very small gets rounded to 0. Sometimes 0 and non-zero values have completely different properties, which can lead to errors such as division by zero.

Overflow and underflow can be avoided using the softmax function:

softmax(x)i=exp(xi)j=1nexp(xj)softmax(\vec x)_{i}=\dfrac{exp(x_{i})}{\sum^{n}_{j=1}exp(x_{j})}

For the sake of discussion, suppose every element xi=cx_{i}=c in x\vec x. When cc is very large, the numerator may overflow; likewise, when cc is very small, the denominator may underflow. Both problems can be solved by computing softmax(z)softmax(\vec z), where z=xmax(xi)\vec z=\vec x-max(x_{i}). This way, the largest element in z\vec z is 0, so the numerator cannot overflow; as for the denominator, at least one of its terms equals 1 while all other elements are non-negative, so the denominator cannot underflow either. This solves the overflow problem.

Poor Conditioning

The condition number measures how quickly a function changes with respect to small changes in its input. Slight perturbations of the input that cause the function to change rapidly are problematic for numerical computation. For the function f(x)=A1xf(x)=A^{-1}x, when ARnnA\in R^{n*n} has an eigenvalue decomposition, its condition number is:

maxi,jλiλjmax_{i,j}|\frac{\lambda_{i}}{\lambda_{j}}|

That is, the ratio of the largest eigenvalue to the smallest eigenvalue. When this number is large, f(x)f(x) is very sensitive to error in the input xx.

Gradient-Based Optimization

Deep learning algorithms typically use gradient-based optimization algorithms. Since plenty of material already exists on traditional gradient descent, I won’t repeat it here; instead, I will focus on other knowledge related to gradient methods, mainly the Jacobian matrix and the Hessian matrix.

Sometimes we need to compute all the partial derivatives of a function whose input and output are both vectors. The matrix containing all such partial derivatives is called the Jacobian matrix. That is, if we have a function f:RmRnf:R^{m}\to R^{n}, the Jacobian matrix of ff is Ji,j=xjf(x)iJ_{i,j}=\frac{\partial}{\partial x_{j}}f(\vec x)_{i}.

For traditional gradient descent we only use first-order derivatives. Sometimes, when using other optimization algorithms (such as Newton’s method), we need second-order derivatives. The matrix composed of all second-order derivatives is called the Hessian matrix:

H(f)(x)i,j=2xixjf(x)H(f)(\vec x)_{i,j}=\frac{\partial^{2}}{\partial x_{i}\partial x_{j}}f(\vec x)

In the context of deep learning, the Hessian matrix is symmetric almost everywhere. The second-order derivative in a particular direction d\vec d can be written as dTHd\vec d^{T}H\vec d. When d\vec d is an eigenvector of HH, the second-order derivative in that direction is the corresponding eigenvalue.

One important use of the Hessian matrix is the second-derivative test, which determines whether the current point is a maximum, a minimum, or a saddle point. The one-dimensional case is simple. For the multi-dimensional case, we determine the nature of a critical point by examining the eigenvalues of the Hessian matrix: when the Hessian matrix is positive definite, the critical point is a local minimum, and likewise when it is negative definite, the point is a local maximum. If at least one of the Hessian’s eigenvalues is positive and at least one is negative, then xx is a local maximum within one plane and a local minimum within another, making it a saddle point.

This is the first article in the series on numerical computation in machine learning. To be continued…