Numerical Computation in Machine Learning

Machine learning algorithms often require extensive numerical computation, typically solving for approximate values through iteration rather than obtaining analytical solutions. These algorithms usually involve optimization and solving systems of linear equations. In computers, representing various floating-point numbers with finite bits introduces certain errors, so methods are needed to ensure computational precision.

Overflow and Underflow

Overflow is destructive and occurs when numbers of large magnitude are approximated as $-\infty$ or $\infty$ , leading to non-numeric values in subsequent calculations. Underflow is similarly destructive, occurring when very small floating-point numbers are rounded to zero. Sometimes, zero and non-zero values have entirely different properties, potentially causing errors like division by zero.

The softmax function can help prevent overflow and underflow:

softmax(\vec x)_{i}=\dfrac{exp(x_{i})}{\sum^{n}_{j=1}exp(x_{j})}

For discussion, assume each element $x_{i}=c$ in $\vec x$ . When $c$ is large, the numerator may overflow; similarly, when $c$ is small, the denominator may underflow. These issues can be resolved by computing $softmax(\vec z)$ , where $\vec z=\vec x-max(x_{i})$ . In this case, the maximum element in $\vec z$ is 0, preventing numerator overflow. As for the denominator, there is at least one component with a value of 1, and other elements are non-negative, preventing underflow. This resolves the overflow issue.

Ill-conditioned Problems

The condition number indicates how sensitive a function is to small changes in its input. Functions that change rapidly with slight input perturbations pose problems for numerical computation. For the function $f(x)=A^{-1}x$ , where $A\in R^{n*n}$ has an eigenvalue decomposition, its condition number is:

max_{i,j}|\frac{\lambda_{i}}{\lambda_{j}}|

This is the ratio of the largest eigenvalue to the smallest. When this number is large, $f(x)$ is highly sensitive to errors in input $x$ .

Gradient-based Optimization

Deep learning algorithms often use gradient-based optimization algorithms. Traditional gradient descent is well-documented, so it won’t be elaborated here. Instead, we’ll focus on other aspects of gradient methods, mainly including the Jacobian and Hessian matrices.

Sometimes, we need to compute all partial derivatives of a function with vector inputs and outputs. The matrix containing all such partial derivatives is called the Jacobian matrix. For a function $f:R^{m}\to R^{n}$ , the Jacobian matrix of $f$ is $J_{i,j}=\frac{\partial}{\partial x_{j}}f(\vec x)_{i}$ .

Traditional gradient descent algorithms only use first-order derivatives. However, other optimization algorithms (like Newton’s method) may require second-order derivatives. The matrix of all second-order derivatives is called the Hessian matrix:

H(f)(\vec x)_{i,j}=\frac{\partial^{2}}{\partial x_{i}\partial x_{j}}f(\vec x)

In the context of deep learning, the Hessian matrix is almost always symmetric. The second-order derivative in a specific direction $\vec d$ can be expressed as $\vec d^{T}H\vec d$ . When $\vec d$ is an eigenvector of $H$ , the second-order derivative in this direction is the corresponding eigenvalue.

An important use of the Hessian matrix is second-order derivative testing, which determines whether a point is a local maximum, minimum, or saddle point. In one-dimensional cases, this is straightforward. In multi-dimensional cases, we examine the eigenvalues of the Hessian matrix to assess the critical point. When the Hessian is positive definite, the critical point is a local minimum; when negative definite, it’s a local maximum. If the Hessian has at least one positive and one negative eigenvalue, then $x$ is a saddle point, being a local maximum in some planes and a local minimum in others.

This is the first article in a series on numerical computation in machine learning. To be continued…