6. Support Vector Machines#
Learning outcome
What are Lagrange multipliers in the context of constrained optimization?
Explain how the decision boundary of the maximum margin classifier depends on each data point. Illustrate your explanation with a drawing.
How does the kernel method allow us to separate otherwise not linearly separable data points?
When is a kernel valid?
Support Vector Machines are one of the most popular classic supervised learning algorithms. This lecture will first discuss the mathematical formalism of solving constrained optimization problems using Lagrange multipliers. We will then look at linear binary classification using the maximum margin and soft margin classifiers. Ultimately, we will present the kernel trick and how it extends the previous two approaches to non-linear boundaries within the more general Support Vector Machine.
6.1. The Constrained Optimization Problem#
We define a constrained optimization problem as
Here,
The
for
To solve it, we define the generalized Lagrangian.
with the additional Lagrange multipliers
Now, we can verify that this optimization problem satisfies
Where does this case-by-case breakdown come from?
In the first case, the constraints are inactive and contribute nil to the sum.
In the second case, the sums increase linearly with
, beyond all bounds.
That is, we recover the original primal problem with
with
From the last line, we can immediately imply the relation between the primal and the dual problems.
where the inequality can, under certain conditions, turn into equality. The conditions for this to become equality are the following (going back to the Lagrangian form from above):
and are convex, i.e. their Hessians are positive semi-definite.The
are affine, i.e., they can be expressed as linear functions of their arguments.
Under the above conditions, the following holds:
The optimal solution
to the primal optimization problem exists.The optimal solution
, to the dual optimization problem exists. . , , and satisfy the Karush-Kuhn-Tucker (KKT) conditions.
The KKT conditions are expressed as the following conditions:
Moreover, if a set
i.e., if
6.2. Maximum Margin Classifier (MMC)#
Alternative names for Maximum Margin Classifier (MMC) are Hard Margin Classifier and Large Margin Classifier. This classifier assumes that the classes are linearly separable.
Now, we can (re-)introduce the linear discriminator. As we have seen, logistic regression

Fig. 6.1 Sigmoid function.#
If

Fig. 6.2 Confidence of logistic regression classifier.#
The intuition here is that we seek to find the model parameters
such that maximizes the distance from the decision boundary for all data points.
For consistency with the standard SVM notation, we slightly reformulate this problem setting:
where
6.2.1. Functional Margin#
The functional margin of
For a confident prediction, we would then like to have a maximum gap between the classes for a good classifier.

Fig. 6.3 Good vs bad linear decision boundary.#
For correctly classified samples, we always have
as
However, if we want to maximize this margin, we could do that without changing the actual decision boundary. Also, note that the classifier does not reflect that the margin is not invariant.
For the entire set of training samples, we define the functional margin as
6.2.2. Geometric Margin#
Now we can define the geometric margin with respect to a single sample

Fig. 6.4 The geometric margin.#
The distance
where
As this was for an example on the
Please note that the geometric margin indeed is scale invariant. Also, note that for
6.2.3. Maximum Margin Classifier#
With these mathematical tools, we are now ready to derive the Support Vector Machine (SVM) for linearly separable sets (a.k.a. Maximum Margin Classifier) by maximizing the previously derived geometric margin:
However, in this formulation, the second constraint
where we applied the definition of
s.t. the constraints are satisfied. This is now a convex objective with linear constraints.
The Support Vector Machine is generated by the primal optimization problem.
or alternatively
Upon checking with the KKT dual complementarity condition, we see that

Fig. 6.5 Maximum Margin Classifier.#
The three samples “-”, “-”, and “+” in the sketch are the only ones for which the KKT constraint is active.
These are called the support vectors.
From the sketch, we can already ascertain that the number of support vectors may be significantly smaller than the number of samples, i.e., the number of active constraints that we have to take into account. Next, we construct the Lagrangian of the optimization problem:
Note that in this case, our Lagrangian only has inequality constraints!
We can then formulate the dual problem
The
If we plug that into the Lagrangian (Eq. (6.28)), we get the dual
which we can then optimize as an optimization problem
The first constraint in this optimization problem singles out the support vectors, whereas the second constraint derives itself from our derivation of the dual of the Lagrangian (see above). The KKT conditions are then also satisfied.
The first KKT condition is satisfied as of our first step in the conversion to the Lagrangian dual.
The second KKT condition is not relevant.
The third KKT condition is satisfied with
, i.e. the support vectors, and for the others.The fourth KKT condition
is satisfied by our construction of the Lagrangian.The fifth KKT condition
is satisfied as of our dual optimization problem formulation.
The two then play together in the following fashion:
The dual problem gives
The primal problem gives
, , with given by

Fig. 6.6 Maximum Margin Classifier solution.#
To derive
As
where
6.3. Advanced Topics: Soft Margin Classifier (SMC)#
What if the data is not linearly separable?

Fig. 6.7 Non-linearly separable sets.#
Data may not be exactly linearly separable, or some data outliers may undesirably deform the exact decision boundary.
6.3.1. Outlier problem#

Fig. 6.8 Sensitivity of MMC to outliers.#
6.3.2. Original SVM optimization problem#
To make the algorithm work for non-linearly separable data, we introduce
Here,
margin
penalization
parameter
controls the weight of the penalization
Then, the Lagrangian of the penalized optimization problem becomes
In the above equation, the second term (
The last equation arises from the additional condition due to slack variables. Upon inserting these conditions into
The “box constraints” (
The resulting dual complementarity conditions for determining the support vectors become:
Support vectors on slack margin (
, i.e. data point is ignored)
Support vectors inside or outside margin (
, i.e. data point violates the MMC)
Support vectors on margin (
& , i.e. data point on the margin)
The optimal
For
A numerically stable option is to average over
Recall: only data with
, i.e., support vectors, will contribute to the SVM prediction (last eq. of Maximum Margin Classifier).
In conclusion, we illustrate the functionality of the slack variables.

Fig. 6.9 Soft Margin Classifier.#
6.4. Advanced Topics: Sequential Minimal Optimization (SMO)#
Efficient algorithm for solving the SVM dual problem
Based on Coordinate Ascent algorithm.
6.4.1. Coordinate Ascent#
Task : find
Perform a component-wise search on
Algorithm
do until converged
for
end for
end do
Sketch of algorithm

Fig. 6.10 Sequential Minimal Optimization.#
Coordinate ascent converges for convex continuous functions but may not converge to the dual optimum!
6.4.2. Outline of SMO for SVM#
Task: solve SVM dual optimization problem
Consider now an iterative update for finding the optimum:
Iteration step delivers a constraint satisfying set of
.Can we just change some
{ ,…., } according to coordinate ascent for finding the next iteration update?No, because
constrains the sum of all , and varying only a single may lead to constraint violation
Fix it by changing a pair of
, , simultaneously.
Algorithm
do until convergence criterion satisfied
Given
constraint satisfyingSelect
for following some estimate which will give fastest ascentFind
// except for all others are kept fixed.
end do
Convergence criterion for SMO: check whether KKT conditions are satisfied up to a chosen tolerance.
Discussion of SMO
assume
given withpick
and for optimization
Note that the r.h.s. is constant during the current iteration step.

Fig. 6.11 Truncation during Sequential Minimal Optimization.#
The box constraints imply
Answer:
with
can be solved for box constraints
set
next iteration update
6.5. Kernel Methods#
Consider binary classification in the non-linearly-separable case, assuming that there is a feature map

Fig. 6.12 Feature map transformation.#
The feature map is essentially a change of basis as:
In general,
Example: XNOR
The following classification problem is non-linear, as there is no linear decision boundary.

Fig. 6.13 XNOR example.#
Upon defining a feature map

Fig. 6.14 XNOR after feature mapping.#
Example: Circular Region
Given is a set of points

Fig. 6.15 Binary classification of a circular region (Source: Wikipedia).#
Here, it is again obvious that if we embed the inputs in a 3D space by adding their squares, i.e.,
But of course, these examples are constructed, as here we could immediately guess
Recall: the dual problem of SVM involves a scalar product
of feature vectors. motivates the general notation of a dual problem with feature maps.
6.5.1. Dual representations#
Motivated by Least Mean Squares (LMS) regression, we consider the following regularized cost function:
with penalty parameter
Regularization helps to suppress the overfitting problem (see Tricks of Optimization). The squared L2 regularization above is also called Tikhonov regularization. In machine learning, linear regression in combination with Tikhonov regularization is often dubbed ridge regression.
Setting the gradient of that loss to zero
with design matrix
and
Substituting the necessary condition
Here,
where
Now we find that we can express the LMS prediction in terms of the kernel function:
In order to find
Upon inserting this result into a linear regression model with feature mapping
where
The kernel trick refers to this formulation of the learning problem, which relies on computing the kernel similarities between a pair of input points
and , instead of computing explicitly.
The term Support Vector Machine typically refers to a margin classifier using the kernel formulation.
Now, let’s do some bookkeeping using the number of data points
Recall from the lecture on linear models that
As typically
The benefit of the dual problem with the kernel
6.5.2. Construction of suitable kernels#
construction from the feature map
(6.66)#direct construction with the constraint that a valid kernel is obtained, i.e., it actually needs to correspond to a possible feature map scalar product.
A necessary and sufficient condition for a valid kernel is that
Example: Kernel 1
Given is
The corresponding feature map for
Example: Kernel 2
Alternative kernel with parameter
belongs to
Considering that
Example: Gaussian kernel
Now, we illustrate that a valid kernel is positive semidefinite, which is the above-mentioned necessary condition.
6.5.2.1. Proof:#
The necessary and sufficient condition is due to Mercer’s theorem:
A given
We replace
This pulls through into the dual problem for determining
6.6. Recapitulation of SVMs#
Problem of non-linearly separable sets
Introduction of controlled violations of the strict margin constraints
Introduction of feature maps
Controlled violation: slack variables
-regularized opt. problem modified dual problem solution by SMOFeature map: kernel function (tensorial product of feature maps)
reformulated dual optimization problem prediction depends only on the kernel functionKernel construction: a lot of freedom but
valid kernelImportant kernel: Gaussian
Solution of non-separable problems:
Pick a suitable kernel function
Run SVM with a kernel function
6.7. Further References#
[Ng, 2022], Chapters 5 and 6
[Murphy, 2022], Section 17.3
[Bishop, 2006], Appendix E: Lagrange Multipliers