Deriving the SVM (Part 1)
The SVM is a classic method in machine learning. Beyond the hard-margin SVM, there are also variants such as the soft-margin SVM and the kernel trick. This article focuses on deriving the hard-margin SVM.
Functional Distance and Geometric Distance
Suppose the two classes of sample points can be separated exactly; then we can use the hard-margin SVM for classification. Assume the separating hyperplane has the equation . The distance from each sample point to this hyperplane is . If we define points whose distance from the hyperplane is positive as the positive class, i.e. , and conversely points with negative distance as the negative class, i.e. , then the distance from a sample point to the separating hyperplane can be expressed as . This is called the functional distance from the sample point to the separating hyperplane.
Let be the minimum functional distance. Note that when and are both scaled up by some factor, the functional distance increases but the hyperplane itself does not change. We therefore need to impose a constraint on the hyperplane’s , for example by requiring . We can redefine the distance as , which is called the geometric distance. Letting , we obtain .
The Optimization Form of the Maximum Margin
The optimization problem of maximizing the separating distance can then be expressed as follows:
that is,
Note that in the expression above, the value of the functional margin does not affect the solution of the optimization problem, so we may simply assume . Noting also that maximizing is equivalent to minimizing , the problem above can be rewritten equivalently as:
Lagrange Multipliers and the KKT Conditions
The problem above is an inequality-constrained optimization problem, which can be solved using the method of Lagrange multipliers together with the KKT conditions. Let . We first construct the Lagrangian function:
Suppose are the optimal solution of the optimization problem. Then, according to the KKT conditions, must satisfy the following equations:
The KKT conditions consist of several parts: 1. the gradient of the Lagrangian with respect to the original optimization variables is zero, as in (1) and (2); 2. the product of each Lagrange multiplier and the left-hand side of its inequality constraint (in standard form) is zero, as in (3); 3. the constraints of the original problem, as in (4); 4. the Lagrange multipliers are nonnegative, as in (5).
From (1) and (2) we obtain:
The Dual Problem
By Lagrangian duality, the original problem can be rewritten as:
that is,
For the inner problem , the KKT conditions give the optimal solution , along with . Substituting these into the expression above reduces the original problem to:
After solving for , we can compute via . Since the separating hyperplane is determined by the parameters , and the hyperplane is determined by the with , we then pick any such that and obtain . This yields the separating hyperplane .
For the optimization problem of solving for above, we will introduce the SMO algorithm in the next article to solve it. To be continued…