Derivation of SVM (3)
In the previous article, we introduced the derivation of hard-margin SVM. This article will continue with the mathematical derivation of soft-margin SVM, which allows some misclassification of samples when they are not linearly separable. Soft-margin SVM permits some samples to not satisfy the constraint: yi(w⋅xi)≥0.
The optimization objective can be written as:
minw,b21∣∣w∣∣2+Ci=1∑mloss(yi(w⋅xi+b)−1)
Here, C is a constant used to measure the degree of constraint violation allowed. The loss() function can use the hinge() function, i.e., losshinge(z)=max(0,1−z).
Thus, the optimization objective can be rewritten as:
minw,b21∣∣w∣∣2+Ci=1∑mmax(0,1−yi(w⋅xi+b))
By introducing the “slack variable” ξi≥0, the above expression can be rewritten as:
minw,b,ξis.t.21∣∣w∣∣2+Ci=1∑mξiyi(w⋅xi+b)≥1−ξiξi≥0,i=1,2,...,m
Similar to hard-margin SVM, the above problem is also a quadratic programming problem. It can be converted into a corresponding dual problem using Lagrangian duality and solved with the SMO algorithm. The Lagrangian function corresponding to the above problem is:
L(w,b,α,ξ,μ)=21∣∣w∣∣2+Ci=1∑mξi+i=1∑mαi(1−ξi−yi(w⋅xi+b))−i=1∑mμiξi
Setting the partial derivatives of L with respect to w,b,α to 0, we get:
w=i=1∑mαiyixi0=i=1∑mαiyiC=αi+μi
Substituting into L(w,b,α,ξ,μ), the original problem can be transformed into the dual problem:
minαs.t.21i=1∑mj=1∑mαiαjyiyjxi⋅xj−i=1∑mαii=1∑mαiyi=0C≥αi≥0i=1,2,...,m
The only difference from hard-margin SVM is that αi≥0 becomes C≥αi≥0. The SMO algorithm mentioned earlier can be conveniently used to solve this, with the only difference being that clipping needs to consider both directions.
Later, we will introduce the kernel trick of SVM and commonly used kernels. To be continued…