Deriving the SVM (Part 1)

The SVM is a classic method in machine learning. Beyond the hard-margin SVM, there are also variants such as the soft-margin SVM and the kernel trick. This article focuses on deriving the hard-margin SVM.

Functional Distance and Geometric Distance

Suppose the two classes of sample points can be separated exactly; then we can use the hard-margin SVM for classification. Assume the separating hyperplane has the equation wx+b=0w\cdot x+b=0. The distance from each sample point xix_{i} to this hyperplane is wxi+b|w\cdot x_{i}+b|. If we define points whose distance from the hyperplane is positive as the positive class, i.e. yi=+1y_{i}=+1, and conversely points with negative distance as the negative class, i.e. yi=1y_{i}=-1, then the distance from a sample point to the separating hyperplane can be expressed as γ^i=yi(wxi+b)\hat{\gamma}_{i}=y_{i}(w\cdot x_{i}+b). This is called the functional distance from the sample point to the separating hyperplane.

Let γ^=min(γ^i)\hat{\gamma}=min(\hat{\gamma}_{i}) be the minimum functional distance. Note that when ww and bb are both scaled up by some factor, the functional distance γ^i=yi(wxi+b)\hat{\gamma}_{i}=y_{i}(w\cdot x_{i}+b) increases but the hyperplane itself does not change. We therefore need to impose a constraint on the hyperplane’s ww, for example by requiring w=1||w||=1. We can redefine the distance as γi=yi(wwxi+bw)\gamma_{i}=y_{i}(\frac{w}{||w||}\cdot x_{i}+\frac{b}{||w||}), which is called the geometric distance. Letting γ=min(γi)\gamma=min(\gamma_{i}), we obtain γ=γ^w\gamma=\frac{\hat\gamma}{||w||}.

The Optimization Form of the Maximum Margin

The optimization problem of maximizing the separating distance can then be expressed as follows:

maxw,bγs.t.yi(wwxi+bw)γi=1,2...n\begin{align*} &max_{w,b}\quad \gamma \\ &s.t.\quad y_{i}(\frac{w}{||w||}\cdot x_{i}+\frac{b}{||w||})\ge\gamma,i=1,2...n \end{align*}

that is,

maxw,bγ^ws.t.yi(wxi+b)γ^i=1,2...n\begin{align*} &max_{w,b}\quad \frac{\hat\gamma}{||w||} \\ &s.t.\quad y_{i}(w\cdot x_{i}+b)\ge\hat \gamma,i=1,2...n \end{align*}

Note that in the expression above, the value of the functional margin γ^\hat\gamma does not affect the solution of the optimization problem, so we may simply assume γ^=1\hat\gamma=1. Noting also that maximizing 1w\frac{1}{||w||} is equivalent to minimizing 12w2\frac{1}{2}||w||^{2}, the problem above can be rewritten equivalently as:

minw,b12w2s.t.yi(wxi+b)10\begin{align*} &min_{w,b}\quad \frac{1}{2}||w||^{2}\\ &s.t.\quad y_{i}(w\cdot x_{i}+b)-1\ge0 \end{align*}

Lagrange Multipliers and the KKT Conditions

The problem above is an inequality-constrained optimization problem, which can be solved using the method of Lagrange multipliers together with the KKT conditions. Let α=[α1,α2,α3,...,αn]T\alpha=[\alpha_{1},\alpha_{2},\alpha_{3},...,\alpha_{n}]^{T}. We first construct the Lagrangian function: L(w,b,α)=12w2iαiyi(wxi+b)+iαiL(w,b,\alpha)=\frac{1}{2}||w||^{2}-\sum_{i}\alpha_{i}y_{i}(w\cdot x_{i}+b)+\sum_{i}\alpha_{i}

Suppose W,b,αW^{*},b^{*},\alpha^{*} are the optimal solution of the optimization problem. Then, according to the KKT conditions, W,b,αW^{*},b^{*},\alpha^{*} must satisfy the following equations:

wL(w,b,α)=wi=1Nαiyixi=0bL(w,b,α)=i=1Nαiyi=0αi(yi(wxi+b)1)=0yi(wxi+b)10αi0\begin{align} &\nabla_{w}L(w^{*},b^{*},\alpha^{*})=w^{*}-\sum_{i=1}^{N}\alpha^{*}_{i}y_{i}x_{i}=0\tag{1}\\ &\nabla_{b}L(w^{*},b^{*},\alpha^{*})=-\sum_{i=1}^{N}\alpha_{i}^{*}y_{i}=0\tag{2}\\ &\alpha_{i}^{*}(y_{i}(w^{*}\cdot x_{i}+b^{*})-1)=0\tag{3}\\ &y_{i}(w^{*}\cdot x_{i}+b^{*})-1\ge 0\tag{4}\\ &\alpha_{i}^{*}\ge0\tag{5} \end{align}

The KKT conditions consist of several parts: 1. the gradient of the Lagrangian with respect to the original optimization variables is zero, as in (1) and (2); 2. the product of each Lagrange multiplier and the left-hand side of its inequality constraint (in standard form) is zero, as in (3); 3. the constraints of the original problem, as in (4); 4. the Lagrange multipliers are nonnegative, as in (5).

From (1) and (2) we obtain:

w=iNαiyixiiNαiyi=0w^{*}=\sum_{i}^{N}\alpha_{i}^{*}y_{i}x_{i}\\ \sum_{i}^{N}\alpha_{i}^{*}y_{i}=0

The Dual Problem

By Lagrangian duality, the original problem can be rewritten as:

minw,b maxα 12w2iαiyi(wxi+b)+iαi,αi0min_{w,b}\ max_{\alpha}\ \frac{1}{2}||w||^{2}-\sum_{i}\alpha_{i}y_{i}(w\cdot x_{i}+b)+\sum_{i}\alpha_{i},\alpha_{i}\ge 0

that is,

maxα minw,b 12w2iαiyi(wxi+b)+iαi,αi0max_{\alpha}\ min_{w,b}\ \frac{1}{2}||w||^{2}-\sum_{i}\alpha_{i}y_{i}(w\cdot x_{i}+b)+\sum_{i}\alpha_{i},\alpha_{i}\ge 0

For the inner problem minw,b 12w2iαiyi(wxi+b)+iαimin_{w,b}\ \frac{1}{2}||w||^{2}-\sum_{i}\alpha_{i}y_{i}(w\cdot x_{i}+b)+\sum_{i}\alpha_{i}, the KKT conditions give the optimal solution w=iNαiyixiw^{*}=\sum_{i}^{N}\alpha_{i}^{*}y_{i}x_{i}, along with iNαiyi=0\sum_{i}^{N}\alpha_{i}^{*}y_{i}=0. Substituting these into the expression above reduces the original problem to:

maxα12ijαiαjyiyj(xixj)+iαis.t.αi0\begin{align*} max_{\alpha}\quad&-\frac{1}{2}\sum_{i}\sum_{j}\alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}\cdot x_{j})+\sum_{i}\alpha_{i}\\ s.t.\quad&\alpha_{i}\ge 0 \end{align*}

After solving for α\alpha^{*}, we can compute ww^{*} via w=iNαiyixiw^{*}=\sum_{i}^{N}\alpha_{i}^{*}y_{i}x_{i}. Since the separating hyperplane is determined by the parameters w,bw^{*},b^{*}, and the hyperplane is determined by the αi,xi,yi\alpha_{i},x_{i},y_{i} with αi0\alpha_{i}\neq0, we then pick any jj such that αj0\alpha_{j}\neq0 and obtain b=yjwxjb^{*}=y_{j}-w^{*}\cdot x_{j}. This yields the separating hyperplane wx+b=0w^{*}\cdot x+b^{*}=0.

For the optimization problem of solving for α\alpha^{*} above, we will introduce the SMO algorithm in the next article to solve it. To be continued…