Deriving the SVM (Part 1)

The SVM is a classic method in machine learning. Beyond the hard-margin SVM, there are also variants such as the soft-margin SVM and the kernel trick. This article focuses on deriving the hard-margin SVM.

Functional Distance and Geometric Distance

Suppose the two classes of sample points can be separated exactly; then we can use the hard-margin SVM for classification. Assume the separating hyperplane has the equation $w\cdot x+b=0$ . The distance from each sample point $x_{i}$ to this hyperplane is $|w\cdot x_{i}+b|$ . If we define points whose distance from the hyperplane is positive as the positive class, i.e. $y_{i}=+1$ , and conversely points with negative distance as the negative class, i.e. $y_{i}=-1$ , then the distance from a sample point to the separating hyperplane can be expressed as $\hat{\gamma}_{i}=y_{i}(w\cdot x_{i}+b)$ . This is called the functional distance from the sample point to the separating hyperplane.

Let $\hat{\gamma}=min(\hat{\gamma}_{i})$ be the minimum functional distance. Note that when $w$ and $b$ are both scaled up by some factor, the functional distance $\hat{\gamma}_{i}=y_{i}(w\cdot x_{i}+b)$ increases but the hyperplane itself does not change. We therefore need to impose a constraint on the hyperplane’s $w$ , for example by requiring $||w||=1$ . We can redefine the distance as $\gamma_{i}=y_{i}(\frac{w}{||w||}\cdot x_{i}+\frac{b}{||w||})$ , which is called the geometric distance. Letting $\gamma=min(\gamma_{i})$ , we obtain $\gamma=\frac{\hat\gamma}{||w||}$ .

The Optimization Form of the Maximum Margin

The optimization problem of maximizing the separating distance can then be expressed as follows:

\begin{align*} &max_{w,b}\quad \gamma \\ &s.t.\quad y_{i}(\frac{w}{||w||}\cdot x_{i}+\frac{b}{||w||})\ge\gamma，i=1,2...n \end{align*}

that is,

\begin{align*} &max_{w,b}\quad \frac{\hat\gamma}{||w||} \\ &s.t.\quad y_{i}(w\cdot x_{i}+b)\ge\hat \gamma，i=1,2...n \end{align*}

Note that in the expression above, the value of the functional margin $\hat\gamma$ does not affect the solution of the optimization problem, so we may simply assume $\hat\gamma=1$ . Noting also that maximizing $\frac{1}{||w||}$ is equivalent to minimizing $\frac{1}{2}||w||^{2}$ , the problem above can be rewritten equivalently as:

\begin{align*} &min_{w,b}\quad \frac{1}{2}||w||^{2}\\ &s.t.\quad y_{i}(w\cdot x_{i}+b)-1\ge0 \end{align*}

Lagrange Multipliers and the KKT Conditions

The problem above is an inequality-constrained optimization problem, which can be solved using the method of Lagrange multipliers together with the KKT conditions. Let $\alpha=[\alpha_{1},\alpha_{2},\alpha_{3},...,\alpha_{n}]^{T}$ . We first construct the Lagrangian function: $L(w,b,\alpha)=\frac{1}{2}||w||^{2}-\sum_{i}\alpha_{i}y_{i}(w\cdot x_{i}+b)+\sum_{i}\alpha_{i}$

Suppose $W^{*},b^{*},\alpha^{*}$ are the optimal solution of the optimization problem. Then, according to the KKT conditions, $W^{*},b^{*},\alpha^{*}$ must satisfy the following equations:

\begin{align} &\nabla_{w}L(w^{*},b^{*},\alpha^{*})=w^{*}-\sum_{i=1}^{N}\alpha^{*}_{i}y_{i}x_{i}=0\tag{1}\\ &\nabla_{b}L(w^{*},b^{*},\alpha^{*})=-\sum_{i=1}^{N}\alpha_{i}^{*}y_{i}=0\tag{2}\\ &\alpha_{i}^{*}(y_{i}(w^{*}\cdot x_{i}+b^{*})-1)=0\tag{3}\\ &y_{i}(w^{*}\cdot x_{i}+b^{*})-1\ge 0\tag{4}\\ &\alpha_{i}^{*}\ge0\tag{5} \end{align}

The KKT conditions consist of several parts: 1. the gradient of the Lagrangian with respect to the original optimization variables is zero, as in (1) and (2); 2. the product of each Lagrange multiplier and the left-hand side of its inequality constraint (in standard form) is zero, as in (3); 3. the constraints of the original problem, as in (4); 4. the Lagrange multipliers are nonnegative, as in (5).

From (1) and (2) we obtain:

w^{*}=\sum_{i}^{N}\alpha_{i}^{*}y_{i}x_{i}\\ \sum_{i}^{N}\alpha_{i}^{*}y_{i}=0

The Dual Problem

By Lagrangian duality, the original problem can be rewritten as:

min_{w,b}\ max_{\alpha}\ \frac{1}{2}||w||^{2}-\sum_{i}\alpha_{i}y_{i}(w\cdot x_{i}+b)+\sum_{i}\alpha_{i},\alpha_{i}\ge 0

that is,

max_{\alpha}\ min_{w,b}\ \frac{1}{2}||w||^{2}-\sum_{i}\alpha_{i}y_{i}(w\cdot x_{i}+b)+\sum_{i}\alpha_{i},\alpha_{i}\ge 0

For the inner problem $min_{w,b}\ \frac{1}{2}||w||^{2}-\sum_{i}\alpha_{i}y_{i}(w\cdot x_{i}+b)+\sum_{i}\alpha_{i}$ , the KKT conditions give the optimal solution $w^{*}=\sum_{i}^{N}\alpha_{i}^{*}y_{i}x_{i}$ , along with $\sum_{i}^{N}\alpha_{i}^{*}y_{i}=0$ . Substituting these into the expression above reduces the original problem to:

\begin{align*} max_{\alpha}\quad&-\frac{1}{2}\sum_{i}\sum_{j}\alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}\cdot x_{j})+\sum_{i}\alpha_{i}\\ s.t.\quad&\alpha_{i}\ge 0 \end{align*}

After solving for $\alpha^{*}$ , we can compute $w^{*}$ via $w^{*}=\sum_{i}^{N}\alpha_{i}^{*}y_{i}x_{i}$ . Since the separating hyperplane is determined by the parameters $w^{*},b^{*}$ , and the hyperplane is determined by the $\alpha_{i},x_{i},y_{i}$ with $\alpha_{i}\neq0$ , we then pick any $j$ such that $\alpha_{j}\neq0$ and obtain $b^{*}=y_{j}-w^{*}\cdot x_{j}$ . This yields the separating hyperplane $w^{*}\cdot x+b^{*}=0$ .

For the optimization problem of solving for $\alpha^{*}$ above, we will introduce the SMO algorithm in the next article to solve it. To be continued…

Technology

2018 · 08 · 10