Context (title): Derivation of SVM (2)

In the previous article (1), we discussed the derivation of the hard-margin SVM and its dual form, which can be simplified into the following form:

\begin{align*} min_ \alpha\quad &\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_{i}\alpha_{j}y_{i}y_{j}x_{i}\cdot x_{j}-\sum_{i=1}^{N}\alpha_{i}\\ s.t.\quad &\sum_{i=1}^{N}\alpha_{i}y_{i}=0\\ &\alpha_{i}\ge 0\\ &i=1,2,...,N \end{align*}

This problem can be viewed as a quadratic programming problem with $\alpha$ as the optimization variable. There are many mature solutions for quadratic programming problems, and the SMO (Sequential Minimal Optimization) algorithm is one of the most efficient for SVM optimization.

SMO Sequential Optimization Algorithm

The SMO algorithm first initializes all variables of $\alpha$ , for example, let $\alpha_{1},\alpha_{2},...,\alpha_{N}=0$ . Then, it considers two components of $\alpha$ as variables, such as $\alpha_{1},\alpha_{2}$ (when selecting two components $\alpha_{i},\alpha_{j}$ , typically the one that most severely violates the KKT conditions mentioned earlier is chosen as $\alpha_{i}$ , and then $\alpha_{j}$ corresponding to the point $x_{j}$ that is farthest from the margin of $x_{i}$ is chosen as the second variable). The remaining $\alpha_{3},\alpha_{4},...,\alpha_{N}$ are fixed. According to the constraint $\sum_{i=1}^{N}\alpha_{i}y_{i}=0$ , we can derive $\alpha_{1}=-y_{1}\sum_{i=2}^{N}\alpha_{i}y_{i}$ . This problem can be transformed into a quadratic programming problem with two variables (let $K_{ij}=x_{i}\cdot x_{j}$ ):

\begin{align*} min_{\alpha_{1},\alpha_{2}}\quad W(\alpha_{1},\alpha_{2})=&\frac{1}{2}K_{11}\alpha_{1}^{2}+\frac{1}{2}K_{22}\alpha_{2}^{2}+y_{1}y_{2}K_{12}\alpha_{1}\alpha_{2}\\ &-(\alpha_{1}+\alpha_{2})+y_{1}\alpha_{1}\sum_{i=3}^{N}y_{i}\alpha_{i}K_{i1}+y_{2}\alpha_{2}\sum_{i=3}^{N}y_{i}\alpha_{i}K_{i2}\\ s.t.\quad &\alpha_{1}y_{1}+\alpha_{2}y_{2}=-\sum_{i=3}^{N}y_{i}\alpha_{i}=\zeta\\ &\alpha_{1},\alpha_{2}\ge0 \end{align*}

In this quadratic programming problem, since $\alpha_{1}y_{1}+\alpha_{2}y_{2}=\zeta$ , we can derive $\alpha_{1}=(\zeta-y_{2}\alpha_{2})y_{1}$ . Substituting this constraint into $W(\alpha_{1},\alpha_{2})$ results in a single-variable quadratic programming problem. If we initially ignore the inequality constraints, we can directly obtain an analytical solution without numerical computation, significantly speeding up calculations.

Let $v_{i}=\sum_{j=3}^{N}\alpha_{j}y_{j}K(x_{i},x_{j})$ . Substituting $\alpha_{1}=(\zeta-y_{2}\alpha_{2})y_{1}$ into $W(\alpha_{1},\alpha_{2})$ gives:

W(\alpha_{2})=\frac{1}{2}K_{11}(\zeta-\alpha_{2}y_{2})^{2}+\frac{1}{2}K_{22}\alpha_{2}^{2}+y_{2}K_{12}(\zeta-\alpha_{2}y_{2})\alpha_{2}-(\zeta-\alpha_{2}y_{2})y_{1}-\alpha_{2}+v_{1}(\zeta-\alpha_{2}y_{2})+y_{2}v_{2}\alpha_{2}

By directly setting $\frac{\partial W}{\partial\alpha_{2}}=0$ , we can obtain the analytical solution for $\alpha_{2}$ as $\hat\alpha_{2}=\alpha_{2}+\frac{y_{2}(E_{1}-E_{2})}{\eta}$ , where $E_{i}=\sum_{j=1}^{N}\alpha_{j}y_{j}K_{ij}+b-y_{i}$ and $\eta=K_{11}+K_{22}-2K_{12}$ . The obtained $\hat\alpha_{2}$ has not yet considered the inequality constraints $\alpha_{1},\alpha_{2}\ge 0$ . From $\alpha_{1}=(\zeta-y_{2}\alpha_{2})y_{1}\ge0$ and $\alpha_{2}\ge0$ , we can solve the inequality to find the upper bound $H$ and lower bound $L$ for $\alpha_2$ . After clipping, the analytical solution for $\alpha_{2}$ is:

\alpha_{2}^{*}= \begin{cases} H,\quad \hat\alpha_{2}>H\\ \hat\alpha_{2},\quad L\le\hat\alpha_{2}\le H\\ L,\quad \hat\alpha_{2}<L \end{cases}

Additionally, based on $\alpha_{1}^{*}=(\zeta-y_{2}\alpha_{2}^{*})y_{1}$ , we can obtain $\alpha_{1}^{*}$ . This completes the update of a set of variables in the SMO algorithm. The process of variable selection, analytical solving, and variable clipping is repeated until all variables of $\alpha$ satisfy the KKT conditions mentioned in article (1). Then, using the formulas for $w$ and $b$ from article (1), we can obtain the trained hyperplane, completing the mathematical derivation of the hard-margin SVM. Future articles will continue to introduce the derivation of soft-margin SVM and the application of kernel methods. To be continued…