Deriving the SVM (3)

The previous posts covered the derivation of the hard-margin SVM. This post continues with the mathematical derivation of the soft-margin SVM, which allows some samples to be misclassified when the data is not linearly separable. The soft-margin SVM permits some samples to violate the constraint yi(wxi)0y_{i}(w\cdot x_{i})\ge 0.

We can write the optimization objective as:

minw,b12w2+Ci=1mloss(yi(wxi+b)1)min_{w,b}\quad \frac{1}{2}||w||^{2}+C\sum_{i=1}^{m}loss(y_{i}(w\cdot x_{i}+b)-1)

Here CC is a constant that measures how much constraint violation we tolerate, and the loss()loss() function can be the hinge()hinge() function, i.e. losshinge(z)=max(0,1z)loss_{hinge}(z)=max(0,1-z).

So the optimization objective can be written as:

minw,b12w2+Ci=1mmax(0,1yi(wxi+b))min_{w,b}\quad \frac{1}{2}||w||^{2}+C\sum_{i=1}^{m}max(0,1-y_{i}(w\cdot x_{i}+b))

Introducing “slack variables” ξi0\xi_{i}\ge 0, we can rewrite the expression above as

minw,b,ξi12w2+Ci=1mξis.t.yi(wxi+b)1ξiξi0,i=1,2,...,m\begin{align*} min_{w,b,\xi_{i}}\quad& \frac{1}{2}||w||^{2}+C\sum_{i=1}^{m}\xi_{i}\\ s.t.\quad& y_{i}(w\cdot x_{i}+b)\ge1-\xi_{i}\\ &\xi_{i}\ge0,i=1,2,...,m \end{align*}

As with the hard-margin SVM, the problem above is a quadratic programming problem. We can first use Lagrangian duality to convert it into the corresponding dual problem, then solve it with the SMO algorithm. The Lagrangian for the problem above is:

L(w,b,α,ξ,μ)=12w2+Ci=1mξi+i=1mαi(1ξiyi(wxi+b))i=1mμiξiL(w,b,\alpha,\xi,\mu)=\frac{1}{2}||w||^{2}+C\sum_{i=1}^{m}\xi_{i}\\+\sum_{i=1}^{m}\alpha_{i}(1-\xi_{i}-y_{i}(w\cdot x_{i}+b))-\sum_{i=1}^{m}\mu_{i}\xi_{i}

Setting the partial derivatives of LL with respect to w,b,αw,b,\alpha to 00 yields

w=i=1mαiyixi0=i=1mαiyiC=αi+μiw=\sum_{i=1}^{m}\alpha_{i}y_{i}x_{i}\\ 0=\sum_{i=1}^{m}\alpha_{i}y_{i}\\ C=\alpha_{i}+\mu_{i}

Substituting back into L(w,b,α,ξ,μ)L(w,b,\alpha,\xi,\mu) turns the original problem into the dual problem:

minα12i=1mj=1mαiαjyiyjxixji=1mαis.t.i=1mαiyi=0Cαi0i=1,2,...,m\begin{align*} min_ \alpha\quad &\frac{1}{2}\sum_{i=1}^{m}\sum_{j=1}^{m}\alpha_{i}\alpha_{j}y_{i}y_{j}x_{i}\cdot x_{j}-\sum_{i=1}^{m}\alpha_{i}\\ s.t.\quad &\sum_{i=1}^{m}\alpha_{i}y_{i}=0\\ &C\ge \alpha_{i}\ge 0\\ &i=1,2,...,m \end{align*}

As we can see, the only difference from the hard-margin SVM is that αi0\alpha_{i}\ge 0 becomes Cαi0C\ge \alpha_{i}\ge 0. It can likewise be solved conveniently with the SMO algorithm mentioned earlier; the only difference is that the clipping step needs to consider two directions.

In later posts I’ll also introduce the kernel trick for SVMs and commonly used kernels. To be continued…