Deriving the SVM (3)

The previous posts covered the derivation of the hard-margin SVM. This post continues with the mathematical derivation of the soft-margin SVM, which allows some samples to be misclassified when the data is not linearly separable. The soft-margin SVM permits some samples to violate the constraint $y_{i}(w\cdot x_{i})\ge 0$ .

We can write the optimization objective as:

min_{w,b}\quad \frac{1}{2}||w||^{2}+C\sum_{i=1}^{m}loss(y_{i}(w\cdot x_{i}+b)-1)

Here $C$ is a constant that measures how much constraint violation we tolerate, and the $loss()$ function can be the $hinge()$ function, i.e. $loss_{hinge}(z)=max(0,1-z)$ .

So the optimization objective can be written as:

min_{w,b}\quad \frac{1}{2}||w||^{2}+C\sum_{i=1}^{m}max(0,1-y_{i}(w\cdot x_{i}+b))

Introducing “slack variables” $\xi_{i}\ge 0$ , we can rewrite the expression above as

\begin{align*} min_{w,b,\xi_{i}}\quad& \frac{1}{2}||w||^{2}+C\sum_{i=1}^{m}\xi_{i}\\ s.t.\quad& y_{i}(w\cdot x_{i}+b)\ge1-\xi_{i}\\ &\xi_{i}\ge0,i=1,2,...,m \end{align*}

As with the hard-margin SVM, the problem above is a quadratic programming problem. We can first use Lagrangian duality to convert it into the corresponding dual problem, then solve it with the SMO algorithm. The Lagrangian for the problem above is:

L(w,b,\alpha,\xi,\mu)=\frac{1}{2}||w||^{2}+C\sum_{i=1}^{m}\xi_{i}\\+\sum_{i=1}^{m}\alpha_{i}(1-\xi_{i}-y_{i}(w\cdot x_{i}+b))-\sum_{i=1}^{m}\mu_{i}\xi_{i}

Setting the partial derivatives of $L$ with respect to $w,b,\alpha$ to $0$ yields

w=\sum_{i=1}^{m}\alpha_{i}y_{i}x_{i}\\ 0=\sum_{i=1}^{m}\alpha_{i}y_{i}\\ C=\alpha_{i}+\mu_{i}

Substituting back into $L(w,b,\alpha,\xi,\mu)$ turns the original problem into the dual problem:

\begin{align*} min_ \alpha\quad &\frac{1}{2}\sum_{i=1}^{m}\sum_{j=1}^{m}\alpha_{i}\alpha_{j}y_{i}y_{j}x_{i}\cdot x_{j}-\sum_{i=1}^{m}\alpha_{i}\\ s.t.\quad &\sum_{i=1}^{m}\alpha_{i}y_{i}=0\\ &C\ge \alpha_{i}\ge 0\\ &i=1,2,...,m \end{align*}

As we can see, the only difference from the hard-margin SVM is that $\alpha_{i}\ge 0$ becomes $C\ge \alpha_{i}\ge 0$ . It can likewise be solved conveniently with the SMO algorithm mentioned earlier; the only difference is that the clipping step needs to consider two directions.

In later posts I’ll also introduce the kernel trick for SVMs and commonly used kernels. To be continued…

Technology

2018 · 08 · 24