Derivation of SVM (2)

Language:中文/EN

Context (title): Derivation of SVM (2)

In the previous article (1), we discussed the derivation of the hard-margin SVM and its dual form, which can be simplified into the following form:

minα12i=1Nj=1Nαiαjyiyjxixji=1Nαis.t.i=1Nαiyi=0αi0i=1,2,...,N\begin{align*} min_ \alpha\quad &\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_{i}\alpha_{j}y_{i}y_{j}x_{i}\cdot x_{j}-\sum_{i=1}^{N}\alpha_{i}\\ s.t.\quad &\sum_{i=1}^{N}\alpha_{i}y_{i}=0\\ &\alpha_{i}\ge 0\\ &i=1,2,...,N \end{align*}

This problem can be viewed as a quadratic programming problem with α\alpha as the optimization variable. There are many mature solutions for quadratic programming problems, and the SMO (Sequential Minimal Optimization) algorithm is one of the most efficient for SVM optimization.

SMO Sequential Optimization Algorithm

The SMO algorithm first initializes all variables of α\alpha, for example, let α1,α2,...,αN=0\alpha_{1},\alpha_{2},...,\alpha_{N}=0. Then, it considers two components of α\alpha as variables, such as α1,α2\alpha_{1},\alpha_{2} (when selecting two components αi,αj\alpha_{i},\alpha_{j}, typically the one that most severely violates the KKT conditions mentioned earlier is chosen as αi\alpha_{i}, and then αj\alpha_{j} corresponding to the point xjx_{j} that is farthest from the margin of xix_{i} is chosen as the second variable). The remaining α3,α4,...,αN\alpha_{3},\alpha_{4},...,\alpha_{N} are fixed. According to the constraint i=1Nαiyi=0\sum_{i=1}^{N}\alpha_{i}y_{i}=0, we can derive α1=y1i=2Nαiyi\alpha_{1}=-y_{1}\sum_{i=2}^{N}\alpha_{i}y_{i}. This problem can be transformed into a quadratic programming problem with two variables (let Kij=xixjK_{ij}=x_{i}\cdot x_{j}):

minα1,α2W(α1,α2)=12K11α12+12K22α22+y1y2K12α1α2(α1+α2)+y1α1i=3NyiαiKi1+y2α2i=3NyiαiKi2s.t.α1y1+α2y2=i=3Nyiαi=ζα1,α20\begin{align*} min_{\alpha_{1},\alpha_{2}}\quad W(\alpha_{1},\alpha_{2})=&\frac{1}{2}K_{11}\alpha_{1}^{2}+\frac{1}{2}K_{22}\alpha_{2}^{2}+y_{1}y_{2}K_{12}\alpha_{1}\alpha_{2}\\ &-(\alpha_{1}+\alpha_{2})+y_{1}\alpha_{1}\sum_{i=3}^{N}y_{i}\alpha_{i}K_{i1}+y_{2}\alpha_{2}\sum_{i=3}^{N}y_{i}\alpha_{i}K_{i2}\\ s.t.\quad &\alpha_{1}y_{1}+\alpha_{2}y_{2}=-\sum_{i=3}^{N}y_{i}\alpha_{i}=\zeta\\ &\alpha_{1},\alpha_{2}\ge0 \end{align*}

In this quadratic programming problem, since α1y1+α2y2=ζ\alpha_{1}y_{1}+\alpha_{2}y_{2}=\zeta, we can derive α1=(ζy2α2)y1\alpha_{1}=(\zeta-y_{2}\alpha_{2})y_{1}. Substituting this constraint into W(α1,α2)W(\alpha_{1},\alpha_{2}) results in a single-variable quadratic programming problem. If we initially ignore the inequality constraints, we can directly obtain an analytical solution without numerical computation, significantly speeding up calculations.

Let vi=j=3NαjyjK(xi,xj)v_{i}=\sum_{j=3}^{N}\alpha_{j}y_{j}K(x_{i},x_{j}). Substituting α1=(ζy2α2)y1\alpha_{1}=(\zeta-y_{2}\alpha_{2})y_{1} into W(α1,α2)W(\alpha_{1},\alpha_{2}) gives:

W(α2)=12K11(ζα2y2)2+12K22α22+y2K12(ζα2y2)α2(ζα2y2)y1α2+v1(ζα2y2)+y2v2α2W(\alpha_{2})=\frac{1}{2}K_{11}(\zeta-\alpha_{2}y_{2})^{2}+\frac{1}{2}K_{22}\alpha_{2}^{2}+y_{2}K_{12}(\zeta-\alpha_{2}y_{2})\alpha_{2}-(\zeta-\alpha_{2}y_{2})y_{1}-\alpha_{2}+v_{1}(\zeta-\alpha_{2}y_{2})+y_{2}v_{2}\alpha_{2}

By directly setting Wα2=0\frac{\partial W}{\partial\alpha_{2}}=0, we can obtain the analytical solution for α2\alpha_{2} as α^2=α2+y2(E1E2)η\hat\alpha_{2}=\alpha_{2}+\frac{y_{2}(E_{1}-E_{2})}{\eta}, where Ei=j=1NαjyjKij+byiE_{i}=\sum_{j=1}^{N}\alpha_{j}y_{j}K_{ij}+b-y_{i} and η=K11+K222K12\eta=K_{11}+K_{22}-2K_{12}. The obtained α^2\hat\alpha_{2} has not yet considered the inequality constraints α1,α20\alpha_{1},\alpha_{2}\ge 0. From α1=(ζy2α2)y10\alpha_{1}=(\zeta-y_{2}\alpha_{2})y_{1}\ge0 and α20\alpha_{2}\ge0, we can solve the inequality to find the upper bound HH and lower bound LL for α2\alpha_2. After clipping, the analytical solution for α2\alpha_{2} is:

α2={H,α^2>Hα^2,Lα^2HL,α^2<L\alpha_{2}^{*}= \begin{cases} H,\quad \hat\alpha_{2}>H\\ \hat\alpha_{2},\quad L\le\hat\alpha_{2}\le H\\ L,\quad \hat\alpha_{2}<L \end{cases}

Additionally, based on α1=(ζy2α2)y1\alpha_{1}^{*}=(\zeta-y_{2}\alpha_{2}^{*})y_{1}, we can obtain α1\alpha_{1}^{*}. This completes the update of a set of variables in the SMO algorithm. The process of variable selection, analytical solving, and variable clipping is repeated until all variables of α\alpha satisfy the KKT conditions mentioned in article (1). Then, using the formulas for ww and bb from article (1), we can obtain the trained hyperplane, completing the mathematical derivation of the hard-margin SVM. Future articles will continue to introduce the derivation of soft-margin SVM and the application of kernel methods. To be continued…