Numerical Computation in Machine Learning (1)
Machine learning algorithms usually require a great deal of numerical computation—that is, solving for approximate values iteratively rather than obtaining analytical solutions. These algorithms typically involve optimization and solving systems of linear equations. Representing various floating-point numbers with a finite number of bits on a computer carries inherent error, so we need certain methods to guarantee the precision of our computations.
Overflow and Underflow
Overflow is destructive: it occurs when numbers of large magnitude are approximated as or , which causes the numbers to become non-numbers (NaN) in subsequent computations. Underflow is equally devastating: it occurs when a floating-point number that is very small gets rounded to 0. Sometimes 0 and non-zero values have completely different properties, which can lead to errors such as division by zero.
Overflow and underflow can be avoided using the softmax function:
For the sake of discussion, suppose every element in . When is very large, the numerator may overflow; likewise, when is very small, the denominator may underflow. Both problems can be solved by computing , where . This way, the largest element in is 0, so the numerator cannot overflow; as for the denominator, at least one of its terms equals 1 while all other elements are non-negative, so the denominator cannot underflow either. This solves the overflow problem.
Poor Conditioning
The condition number measures how quickly a function changes with respect to small changes in its input. Slight perturbations of the input that cause the function to change rapidly are problematic for numerical computation. For the function , when has an eigenvalue decomposition, its condition number is:
That is, the ratio of the largest eigenvalue to the smallest eigenvalue. When this number is large, is very sensitive to error in the input .
Gradient-Based Optimization
Deep learning algorithms typically use gradient-based optimization algorithms. Since plenty of material already exists on traditional gradient descent, I won’t repeat it here; instead, I will focus on other knowledge related to gradient methods, mainly the Jacobian matrix and the Hessian matrix.
Sometimes we need to compute all the partial derivatives of a function whose input and output are both vectors. The matrix containing all such partial derivatives is called the Jacobian matrix. That is, if we have a function , the Jacobian matrix of is .
For traditional gradient descent we only use first-order derivatives. Sometimes, when using other optimization algorithms (such as Newton’s method), we need second-order derivatives. The matrix composed of all second-order derivatives is called the Hessian matrix:
In the context of deep learning, the Hessian matrix is symmetric almost everywhere. The second-order derivative in a particular direction can be written as . When is an eigenvector of , the second-order derivative in that direction is the corresponding eigenvalue.
One important use of the Hessian matrix is the second-derivative test, which determines whether the current point is a maximum, a minimum, or a saddle point. The one-dimensional case is simple. For the multi-dimensional case, we determine the nature of a critical point by examining the eigenvalues of the Hessian matrix: when the Hessian matrix is positive definite, the critical point is a local minimum, and likewise when it is negative definite, the point is a local maximum. If at least one of the Hessian’s eigenvalues is positive and at least one is negative, then is a local maximum within one plane and a local minimum within another, making it a saddle point.
This is the first article in the series on numerical computation in machine learning. To be continued…