Numerical Computation in Machine Learning (1)

Machine learning algorithms usually require a great deal of numerical computation—that is, solving for approximate values iteratively rather than obtaining analytical solutions. These algorithms typically involve optimization and solving systems of linear equations. Representing various floating-point numbers with a finite number of bits on a computer carries inherent error, so we need certain methods to guarantee the precision of our computations.

Overflow and Underflow

Overflow is destructive: it occurs when numbers of large magnitude are approximated as $-\infty$ or $\infty$ , which causes the numbers to become non-numbers (NaN) in subsequent computations. Underflow is equally devastating: it occurs when a floating-point number that is very small gets rounded to 0. Sometimes 0 and non-zero values have completely different properties, which can lead to errors such as division by zero.

Overflow and underflow can be avoided using the softmax function:

softmax(\vec x)_{i}=\dfrac{exp(x_{i})}{\sum^{n}_{j=1}exp(x_{j})}

For the sake of discussion, suppose every element $x_{i}=c$ in $\vec x$ . When $c$ is very large, the numerator may overflow; likewise, when $c$ is very small, the denominator may underflow. Both problems can be solved by computing $softmax(\vec z)$ , where $\vec z=\vec x-max(x_{i})$ . This way, the largest element in $\vec z$ is 0, so the numerator cannot overflow; as for the denominator, at least one of its terms equals 1 while all other elements are non-negative, so the denominator cannot underflow either. This solves the overflow problem.

Poor Conditioning

The condition number measures how quickly a function changes with respect to small changes in its input. Slight perturbations of the input that cause the function to change rapidly are problematic for numerical computation. For the function $f(x)=A^{-1}x$ , when $A\in R^{n*n}$ has an eigenvalue decomposition, its condition number is:

max_{i,j}|\frac{\lambda_{i}}{\lambda_{j}}|

That is, the ratio of the largest eigenvalue to the smallest eigenvalue. When this number is large, $f(x)$ is very sensitive to error in the input $x$ .

Gradient-Based Optimization

Deep learning algorithms typically use gradient-based optimization algorithms. Since plenty of material already exists on traditional gradient descent, I won’t repeat it here; instead, I will focus on other knowledge related to gradient methods, mainly the Jacobian matrix and the Hessian matrix.

Sometimes we need to compute all the partial derivatives of a function whose input and output are both vectors. The matrix containing all such partial derivatives is called the Jacobian matrix. That is, if we have a function $f:R^{m}\to R^{n}$ , the Jacobian matrix of $f$ is $J_{i,j}=\frac{\partial}{\partial x_{j}}f(\vec x)_{i}$ .

For traditional gradient descent we only use first-order derivatives. Sometimes, when using other optimization algorithms (such as Newton’s method), we need second-order derivatives. The matrix composed of all second-order derivatives is called the Hessian matrix:

H(f)(\vec x)_{i,j}=\frac{\partial^{2}}{\partial x_{i}\partial x_{j}}f(\vec x)

In the context of deep learning, the Hessian matrix is symmetric almost everywhere. The second-order derivative in a particular direction $\vec d$ can be written as $\vec d^{T}H\vec d$ . When $\vec d$ is an eigenvector of $H$ , the second-order derivative in that direction is the corresponding eigenvalue.

One important use of the Hessian matrix is the second-derivative test, which determines whether the current point is a maximum, a minimum, or a saddle point. The one-dimensional case is simple. For the multi-dimensional case, we determine the nature of a critical point by examining the eigenvalues of the Hessian matrix: when the Hessian matrix is positive definite, the critical point is a local minimum, and likewise when it is negative definite, the point is a local maximum. If at least one of the Hessian’s eigenvalues is positive and at least one is negative, then $x$ is a local maximum within one plane and a local minimum within another, making it a saddle point.

This is the first article in the series on numerical computation in machine learning. To be continued…

Technology

2018 · 07 · 14