Derivation of SVM (3)

In the previous article, we introduced the derivation of hard-margin SVM. This article will continue with the mathematical derivation of soft-margin SVM, which allows some misclassification of samples when they are not linearly separable. Soft-margin SVM permits some samples to not satisfy the constraint: $y_{i}(w\cdot x_{i})\ge 0$ .

The optimization objective can be written as:

min_{w,b}\quad \frac{1}{2}||w||^{2}+C\sum_{i=1}^{m}loss(y_{i}(w\cdot x_{i}+b)-1)

Here, $C$ is a constant used to measure the degree of constraint violation allowed. The $loss()$ function can use the $hinge()$ function, i.e., $loss_{hinge}(z)=max(0,1-z)$ .

Thus, the optimization objective can be rewritten as:

min_{w,b}\quad \frac{1}{2}||w||^{2}+C\sum_{i=1}^{m}max(0,1-y_{i}(w\cdot x_{i}+b))

By introducing the “slack variable” $\xi_{i}\ge 0$ , the above expression can be rewritten as:

\begin{align*} min_{w,b,\xi_{i}}\quad& \frac{1}{2}||w||^{2}+C\sum_{i=1}^{m}\xi_{i}\\ s.t.\quad& y_{i}(w\cdot x_{i}+b)\ge1-\xi_{i}\\ &\xi_{i}\ge0,i=1,2,...,m \end{align*}

Similar to hard-margin SVM, the above problem is also a quadratic programming problem. It can be converted into a corresponding dual problem using Lagrangian duality and solved with the SMO algorithm. The Lagrangian function corresponding to the above problem is:

L(w,b,\alpha,\xi,\mu)=\frac{1}{2}||w||^{2}+C\sum_{i=1}^{m}\xi_{i}\\+\sum_{i=1}^{m}\alpha_{i}(1-\xi_{i}-y_{i}(w\cdot x_{i}+b))-\sum_{i=1}^{m}\mu_{i}\xi_{i}

Setting the partial derivatives of $L$ with respect to $w, b, \alpha$ to 0, we get:

w=\sum_{i=1}^{m}\alpha_{i}y_{i}x_{i}\\ 0=\sum_{i=1}^{m}\alpha_{i}y_{i}\\ C=\alpha_{i}+\mu_{i}

Substituting into $L(w,b,\alpha,\xi,\mu)$ , the original problem can be transformed into the dual problem:

\begin{align*} min_ \alpha\quad &\frac{1}{2}\sum_{i=1}^{m}\sum_{j=1}^{m}\alpha_{i}\alpha_{j}y_{i}y_{j}x_{i}\cdot x_{j}-\sum_{i=1}^{m}\alpha_{i}\\ s.t.\quad &\sum_{i=1}^{m}\alpha_{i}y_{i}=0\\ &C\ge \alpha_{i}\ge 0\\ &i=1,2,...,m \end{align*}

The only difference from hard-margin SVM is that $\alpha_{i}\ge 0$ becomes $C\ge \alpha_{i}\ge 0$ . The SMO algorithm mentioned earlier can be conveniently used to solve this, with the only difference being that clipping needs to consider both directions.

Later, we will introduce the kernel trick of SVM and commonly used kernels. To be continued…