4. Support Vector Machines#

At the end of this exercise you will know:

  • How to train a SVM using Sequential Minimal Optimization (SMO)

  • How to train a SVM using Gradient Descent (GD)

  • How different SVM Kernels perform

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
from matplotlib import pyplot as plt

from sklearn.datasets import make_blobs

Summary of the mathematical formalism

Soft Margin SVM Lagrangian:

\[ \mathcal{L}(\omega, b, \xi, \alpha, \mu)=\frac{1}{2} \omega^{T} \omega+C \sum_{i=1}^{m} \xi_{i} -\sum_{i=1}^{m} \alpha_{i}\left[y^{(i)}\left(\omega^{\top} x^{(i)}+b\right)-1+\xi_{i}\right]-\sum_{i=1}^{m} \mu_{i} \xi_{i}. \]

Primal problem:

\[ \min _{\omega, b} \left( \frac{1}{2}\|\omega\|^{2}+C \sum_{i=1}^{m} \xi_{i}\right) \]
\[\begin{split} \text{s.t.}\left\{\begin{array}{l}y^{(i)}\left(\omega^{\top} x^{(i)}+b\right) \geq 1-\xi_{i}, \quad i=1, \ldots, m \\ \xi_{i} \ge 0, \quad i=1, \ldots, \mathrm{m}\end{array}\right. \end{split}\]

Dual problem:

\[ \max_{\alpha} \left( \sum_{i=1}^{m} \alpha_{i}-\frac{1}{2} \sum_{i,j=1}^{m} y^{(i)} y^{(j)} \alpha_{i} \alpha_{j} K(x^{(i)}, x^{(j)}) \right) \]
\[\begin{split} \text { s.t. }\left\{\begin{array}{l} 0 \leq \alpha_{i} \leq C, \quad i=1, \ldots, m \\ \sum_{i=1}^{m} \alpha_{i} y^{(i)}=0 \end{array}\right. \end{split}\]

In the dual problem, we have used that \(w=\sum_{i=1}^{m} \alpha_{i} y^{(i)} x^{(i)}\) and \(\sum_{i=1}^{m} \alpha_{i} y^{(i)}=0\) to get rid of \(w\) and \(b\). To then find \(b\), we use the heuristic \(b^* = \frac{1}{m_{\Sigma}} \sum_{j=1}^{m_{\Sigma}}\left(y^{(j)}-\sum_{i=1}^{m_{\Sigma}} \alpha_{i}^{*} y^{(i)}K(x^{(i)}, x^{(j)})\right)\) over the support vectors \(m_{\Sigma}\). Note, only in the linear case, the kernel becomes \(K(x^{(i)}, x^{(j)}) = \left\langle x^{(i)}, x^{(j)}\right\rangle\). Also note, that only in the dual problem definition we encounter the kernel and can use the kernel trick.

Note: for \(C \to \infty\) this problem ends up being the Hard Margin SVM.

4.1. Artificial Dataset#

We use scikit-learn and make_blobs to generate a binary dataset with input features \(x\in \mathbb{R}^2\) and labels \(y\in \{-1, +1\}\).

# X as features and Y as labels
X, Y = make_blobs(n_samples=500, centers=2, random_state=0, cluster_std=0.6)
# by default the labels are {0, 1}, so we change them to {-1,1}
Y = np.where(Y==0, -1, 1)

# we also center the input data (per dimension) and scale it to unit variance to make trainig more efficient
X = (X - X.mean(axis=0))/X.std(axis=0)

plt.figure(figsize=(6, 6))
plt.scatter(x=X[:, 0], y=X[:, 1], c=Y)
<matplotlib.collections.PathCollection at 0x7f486917ab00>
../_images/2ad707ac09dff86ad47e96e5c6518fc36f0569a045b7eddf5eced8e5166daa73.png

4.2. Sequential Minimal Optimization (SMO)#

This algorithm was originally developed by John Platt in 1998 and is optimized for SVM optimization. This algorithm solves the dual problem in a gradient-free manner. It selects two multiplier \(\alpha_i\) and \(\alpha_j\) and optimizes them while keeping all other \(\alpha\)’s constant. And then itertively repeats the procedure over all \(\alpha\)’s. The efficiency lies in the heuristic used for selecting two \(\alpha\) values, which is based on information from previous iterations. In the end we obtain a vector of \(M\) values for \(\alpha\) corresponding to each training data point, for which most of the \(\alpha\) values are \(0\) and only the non-zero values contribute to the predictions made by the model.

We adapt the implementation of the SMO algorithm from this reference code by Jon Charest.

Visualization Utils

def plot_decision_boundary(model, ax, resolution=100, colors=('b', 'k', 'r'), levels=(-1, 0, 1)):
    """Plots the model's decision boundary on the input axes object.
        Range of decision boundary grid is determined by the training data.
        Returns decision boundary grid and axes object (`grid`, `ax`)."""

    # Generate coordinate grid of shape [resolution x resolution]
    # and evaluate the model over the entire space
    xrange = np.linspace(model.X[:, 0].min(), model.X[:, 0].max(), resolution)
    yrange = np.linspace(model.X[:, 1].min(),
                         model.X[:, 1].max(), resolution)
    grid = [[decision_function(model.alphas, model.Y,
                               model.kernel, model.X,
                               np.array([xr, yr]), model.b) for xr in xrange] for yr in yrange]
    grid = np.array(grid).reshape(len(xrange), len(yrange))

    # Plot decision contours using grid and
    # make a scatter plot of training data
    ax.contour(xrange, yrange, grid, levels=levels, linewidths=(1, 1, 1),
               linestyles=('--', '-', '--'), colors=colors)
    ax.scatter(model.X[:, 0], model.X[:, 1],
               c=model.Y, cmap=plt.cm.viridis, lw=0, alpha=0.25)

    # Plot support vectors (non-zero alphas)
    # as circled points (linewidth > 0)
    mask = np.round(model.alphas, decimals=2) != 0.0
    ax.scatter(model.X[mask, 0], model.X[mask, 1],
               c=model.Y[mask], cmap=plt.cm.viridis, lw=1, edgecolors='k')

    return grid, ax

As a first step, we define a generic SMO model

class SMOModel:
    """Container object for the model used for sequential minimal optimization."""

    def __init__(self, X, Y, C, kernel, alphas, b, errors):
        self.X = X               # training data vector
        self.Y = Y               # class label vector
        self.C = C               # regularization parameter
        self.kernel = kernel     # kernel function
        self.alphas = alphas     # lagrange multiplier vector
        self.b = b               # scalar bias term
        self.errors = errors     # error cache used for selection of alphas
        self._obj = []           # record of objective function value
        self.m = len(self.X)     # store size of training set

The next thing we need to define is the kernel. We start with the simplest linear kernel

\[K(x,x') = x^{\top} x' + b.\]

The implementation of the radial basis function

\[K(x,x') = \exp \left\{ - \gamma ||x-x'||_2^2 \right\} \]

is also provided for comparison.

def linear_kernel(x, y, b=1):
    """Returns the linear combination of arrays `x` and `y` with
    the optional bias term `b` (set to 1 by default)."""

    return x @ y.T + b  # Note the @ operator for matrix multiplication

def gaussian_kernel(x, y, gamma=1):
    """Returns the gaussian similarity of arrays `x` and `y` with
    kernel inverse width parameter `gamma` (set to 1 by default)."""
    
    ######################
    # TODO: you might find this helpful: https://jonchar.net/notebooks/SVM/
    
    if np.ndim(x) == 1 and np.ndim(y) == 1:
        result = np.exp(- gamma * (np.linalg.norm(x - y, 2)) ** 2)
    elif (np.ndim(x) > 1 and np.ndim(y) == 1) or (np.ndim(x) == 1 and np.ndim(y) > 1):
        result = np.exp(- gamma * (np.linalg.norm(x - y, 2, axis=1) ** 2))
    elif np.ndim(x) > 1 and np.ndim(y) > 1:
        result = np.exp(- gamma * (np.linalg.norm(x[:, np.newaxis] -
                        y[np.newaxis, :], 2, axis=2) ** 2))
    return result
    #######################

Now, using the dual problem formulation and a kernel, we define the objective and decision functions. The decision function simply imlements \((\omega x + b)\) by using the kernel trick and the relation \(w=\sum_{i=1}^{m} \alpha_{i} y^{(i)} x^{(i)}\)

# Objective function to optimize, i.e. loss function

def objective_function(alphas, target, kernel, X_train):
    """Returns the SVM objective function based in the input model defined by:
    `alphas`: vector of Lagrange multipliers
    `target`: vector of class labels (-1 or 1) for training data
    `kernel`: kernel function
    `X_train`: training data for model."""

    return np.sum(alphas) - 0.5 * np.sum((target[:, None] * target[None, :]) * kernel(X_train, X_train) * (alphas[:, None] * alphas[None, :]))


# Decision function, i.e. forward model evaluation

def decision_function(alphas, target, kernel, X_train, x_test, b):
    """Applies the SVM decision function to the input feature vectors in `x_test`."""

    result = (alphas * target) @ kernel(X_train, x_test) - b
    return result

The SMO algorithm

We are now ready to implement the SMO algorithm as given in Platt’s paper. The implementation is split into three functions: take_step, examine_example, and train.

  • train is the main training loop and also implements the selection of the first of the two \(\alpha\) values.

  • examine_example implements the selection of the second \(\alpha\) value

  • take_step optimizes the two \(\alpha\) values, the bias \(b\), and the cache.

def take_step(i1, i2, model):

    # Skip if chosen alphas are the same
    if i1 == i2:
        return 0, model

    alph1 = model.alphas[i1]
    alph2 = model.alphas[i2]
    y1 = model.Y[i1]
    y2 = model.Y[i2]
    E1 = model.errors[i1]
    E2 = model.errors[i2]
    s = y1 * y2

    # Compute L & H, the bounds on new possible alpha values
    if (y1 != y2):
        L = max(0, alph2 - alph1)
        H = min(model.C, model.C + alph2 - alph1)
    elif (y1 == y2):
        L = max(0, alph1 + alph2 - model.C)
        H = min(model.C, alph1 + alph2)
    if (L == H):
        return 0, model

    # Compute kernel & 2nd derivative eta
    k11 = model.kernel(model.X[i1], model.X[i1])
    k12 = model.kernel(model.X[i1], model.X[i2])
    k22 = model.kernel(model.X[i2], model.X[i2])
    eta = 2 * k12 - k11 - k22

    # Compute new alpha 2 (a2) if eta is negative
    if (eta < 0):
        a2 = alph2 - y2 * (E1 - E2) / eta
        # Clip a2 based on bounds L & H
        if L < a2 < H:
            a2 = a2
        elif (a2 <= L):
            a2 = L
        elif (a2 >= H):
            a2 = H

    # If eta is non-negative, move new a2 to bound with greater objective function value
    else:
        alphas_adj = model.alphas.copy()
        alphas_adj[i2] = L
        # objective function output with a2 = L
        Lobj = objective_function(alphas_adj, model.Y, model.kernel, model.X)
        alphas_adj[i2] = H
        # objective function output with a2 = H
        Hobj = objective_function(alphas_adj, model.Y, model.kernel, model.X)
        if Lobj > (Hobj + eps):
            a2 = L
        elif Lobj < (Hobj - eps):
            a2 = H
        else:
            a2 = alph2

    # Push a2 to 0 or C if very close
    if a2 < 1e-8:
        a2 = 0.0
    elif a2 > (model.C - 1e-8):
        a2 = model.C

    # If examples can't be optimized within epsilon (eps), skip this pair
    if (np.abs(a2 - alph2) < eps * (a2 + alph2 + eps)):
        return 0, model

    # Calculate new alpha 1 (a1)
    a1 = alph1 + s * (alph2 - a2)

    # Update threshold b to reflect newly calculated alphas
    # Calculate both possible thresholds
    b1 = E1 + y1 * (a1 - alph1) * k11 + y2 * (a2 - alph2) * k12 + model.b
    b2 = E2 + y1 * (a1 - alph1) * k12 + y2 * (a2 - alph2) * k22 + model.b

    # Set new threshold based on if a1 or a2 is bound by L and/or H
    if 0 < a1 and a1 < C:
        b_new = b1
    elif 0 < a2 and a2 < C:
        b_new = b2
    # Average thresholds if both are bound
    else:
        b_new = (b1 + b2) * 0.5

    # Update model object with new alphas & threshold
    model.alphas[i1] = a1
    model.alphas[i2] = a2

    # Update error cache
    # Error cache for optimized alphas is set to 0 if they're unbound
    for index, alph in zip([i1, i2], [a1, a2]):
        if 0.0 < alph < model.C:
            model.errors[index] = 0.0

    # Set non-optimized errors based on equation 12.11 in Platt's book
    non_opt = [n for n in range(model.m) if (n != i1 and n != i2)]
    model.errors[non_opt] = model.errors[non_opt] + \
        y1*(a1 - alph1)*model.kernel(model.X[i1], model.X[non_opt]) + \
        y2*(a2 - alph2) * \
        model.kernel(model.X[i2], model.X[non_opt]) + model.b - b_new

    # Update model threshold
    model.b = b_new

    return 1, model
def examine_example(i2, model):

    y2 = model.Y[i2]
    alph2 = model.alphas[i2]
    E2 = model.errors[i2]
    r2 = E2 * y2

    # Proceed if error is within specified tolerance (tol)
    if ((r2 < -tol and alph2 < model.C) or (r2 > tol and alph2 > 0)):

        if len(model.alphas[(model.alphas != 0) & (model.alphas != model.C)]) > 1:
            # Use 2nd choice heuristic is choose max difference in error
            if model.errors[i2] > 0:
                i1 = np.argmin(model.errors)
            elif model.errors[i2] <= 0:
                i1 = np.argmax(model.errors)
            step_result, model = take_step(i1, i2, model)
            if step_result:
                return 1, model

        # Loop through non-zero and non-C alphas, starting at a random point
        for i1 in np.roll(np.where((model.alphas != 0) & (model.alphas != model.C))[0],
                          np.random.choice(np.arange(model.m))):
            step_result, model = take_step(i1, i2, model)
            if step_result:
                return 1, model

        # loop through all alphas, starting at a random point
        for i1 in np.roll(np.arange(model.m), np.random.choice(np.arange(model.m))):
            step_result, model = take_step(i1, i2, model)
            if step_result:
                return 1, model

    return 0, model
def train(model):

    numChanged = 0
    examineAll = 1 # loop over each alpha in first round

    while (numChanged > 0) or (examineAll):
        numChanged = 0
        if examineAll:
            # loop over all training examples
            for i in range(model.alphas.shape[0]):
                examine_result, model = examine_example(i, model)
                numChanged += examine_result
                if examine_result:
                    obj_result = objective_function(
                        model.alphas, model.Y, model.kernel, model.X)
                    model._obj.append(obj_result)
        else:
            # loop over examples where alphas are not already at their limits
            for i in np.where((model.alphas != 0) & (model.alphas != model.C))[0]:
                examine_result, model = examine_example(i, model)
                numChanged += examine_result
                if examine_result:
                    obj_result = objective_function(
                        model.alphas, model.Y, model.kernel, model.X)
                    model._obj.append(obj_result)
        if examineAll == 1:
            examineAll = 0
        elif numChanged == 0:
            examineAll = 1

    return model

We are now ready to define the model (after defining some hyperparameters).

# Set model parameters and initial values
C = 10
m = len(X)
initial_alphas = np.zeros(m)
initial_b = 0.0

# Set tolerances
tol = 0.01  # error tolerance
eps = 0.01  # alpha tolerance

# Instantiate model
model = SMOModel(
    X, Y, C, 
    kernel=gaussian_kernel, # TODO: try linear_kernel and  gaussian_kernel
    alphas=initial_alphas,
    b=initial_b,
    errors= np.zeros(m)
)

# Initialize error cache
initial_error = decision_function(model.alphas, model.Y, model.kernel,
                                  model.X, model.X, model.b) - model.Y
model.errors = initial_error
np.random.seed(0)
output = train(model)
fig, ax = plt.subplots()
grid, ax = plot_decision_boundary(output, ax)
../_images/8e240938e69159e08ad43e3a837faa1f5fc0ca26dea28d8898825a3713f986ad.png
# loss curve
# note: we started with all alphas = 0 and turned some of them on one by one, and then refined.

plt.plot(model._obj)
[<matplotlib.lines.Line2D at 0x7f486572df00>]
../_images/b2e634d59271a613e34180934259301b0679c3d1ef3fd797d3b9d9e4777a29f4.png

4.3. Multiclass Classification with SVM and SMO#

We look at a problem we have seen before: the classification of the iris dataset. The task is to use two of the measured input features (“sepal_length” and “sepal_width”) and to build a classifier capable of distinguishing among the three possible flowers, which we index by [0, 1, 2].

Getting the data is equivalent to the process we saw in exercise on Linear and Logistic Regression in the Logistic Regression section.

# get iris dataset
from urllib.request import urlretrieve
iris = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
urlretrieve(iris)
df0 = pd.read_csv(iris, sep=',')

# name columns
attributes = ["sepal_length", "sepal_width",
              "petal_length", "petal_width", "class"]
df0.columns = attributes

# add species index
species = list(df0["class"].unique())
df0["class_idx"] = df0["class"].apply(species.index)

print("Count occurence of each class:")
print(df0["class"].value_counts())

# let's extract two of the features, and the indexed classes [0,1,2]
df = df0[["petal_length", "petal_width", "class_idx"]]

X_train = df[['petal_length', 'petal_width']].to_numpy()
Y_train = df['class_idx'].to_numpy()
print("Training data:")
print(df)
Count occurence of each class:
class
Iris-versicolor    50
Iris-virginica     50
Iris-setosa        49
Name: count, dtype: int64
Training data:
     petal_length  petal_width  class_idx
0             1.4          0.2          0
1             1.3          0.2          0
2             1.5          0.2          0
3             1.4          0.2          0
4             1.7          0.4          0
..            ...          ...        ...
144           5.2          2.3          2
145           5.0          1.9          2
146           5.2          2.0          2
147           5.4          2.3          2
148           5.1          1.8          2

[149 rows x 3 columns]

Exercise

Now, implement a SVM-based multi-class classifier, which can be trained using the SMO algorithm. Compare with the solution presented below.

Hint: this repository and the one-vs-all classifier.

####################

class OneVsAll:

    def __init__(self, solver, num_classes, **kwargs):
        self._binary_clf = [solver(i, **kwargs) for i in range(num_classes)]
        self._num_classes = num_classes

    def predict(self, x):
        n = x.shape[0]
        scores = np.zeros((n, self._num_classes))

        for idx in range(self._num_classes):
            model = self._binary_clf[idx]
            scores[:, idx] = decision_function(
                model.alphas,
                model.Y,
                model.kernel,
                model.X,
                x,
                model.b)

        pred = np.argmax(scores, axis=1)
        return pred

    def fit(self):
        np.random.seed(0)

        for idx in range(self._num_classes):
            self._binary_clf[idx] = train(
                self._binary_clf[idx])  # fit(x_train, y_tmp)


def create_binary_clf(class_idx, C, X, Y, kernel):
    # Set model parameters and initial values
    C = 10.
    m = len(X)
    initial_alphas = np.zeros(m)
    initial_b = 0.0

    # Set tolerances
    tol = 0.01  # error tolerance
    eps = 0.01  # alpha tolerance

    Y_tmp = 1. * (Y == class_idx) - 1. * (Y != class_idx)

    # Instantiate model
    model = SMOModel(
        X, Y_tmp, C,
        kernel=kernel,
        alphas=initial_alphas,
        b=initial_b,
        errors=np.zeros(m)
    )

    # Initialize error cache
    initial_error = decision_function(model.alphas, model.Y, model.kernel,
                                      model.X, model.X, model.b) - model.Y
    model.errors = initial_error

    return model


def plot_decision_boundary_multiclass(solver, ax, Y, resolution=100):
    """Plots the model's decision boundary on the input axes object.
        Range of decision boundary grid is determined by the training data.
        Returns decision boundary grid and axes object (`grid`, `ax`)."""

    # Generate coordinate grid of shape [resolution x resolution]
    # and evaluate the model over the entire space
    model0 = solver._binary_clf[0]
    xrange = np.linspace(model0.X[:, 0].min(),
                         model0.X[:, 0].max(), resolution)
    yrange = np.linspace(model0.X[:, 1].min(),
                         model0.X[:, 1].max(), resolution)
    x, y = np.meshgrid(xrange, yrange)
    xy = np.array(list(map(np.ravel, [x, y]))).T  # shape=(num_samples, dim)
    grid = solver.predict(xy)
    grid = np.array(grid).reshape(len(xrange), len(yrange))

    # Plot decision contours using grid and
    # make a scatter plot of training data
    ax.contourf(xrange, yrange, grid, alpha=0.5)
    ax.scatter(model0.X[:, 0], model0.X[:, 1], c=Y)

    return grid, ax


solver = OneVsAll(
    solver=create_binary_clf,
    num_classes=3,
    C=1.,
    X=X_train,
    Y=Y_train,
    kernel=gaussian_kernel  # linear_kernel vs gaussian kernel
)

solver.fit()

fig, ax = plt.subplots()
grid, ax = plot_decision_boundary_multiclass(solver, ax, Y_train)
plt.show()


####################
../_images/d208469eeb92bf0aa353a28a384b41fff0becd1e44778d6746d52cbcb4e0cda6.png

4.4. Gradient Descent Optimization of Soft Margin Classifier#

We can also directly solve the primal problem with gradient-based optimization, if we slightly reformulate it. This reformulation requires using the hinge loss:

\[\mathcal{L}_{hinge}(x) = \max(0, 1-x)\]
drawing

What we would give as an input to the hinge loss is the “raw” output of the classifier, e.g. for linear SVMs \(out= \omega x + b\), multiplied with the correct output \(y\). Thus, the hinge loss of a single sample \(i\) for a linear SVM becomes

\[\mathcal{L}_i = \max \left( 0, \; 1 - y^{(i)}(\omega^{\top} x^{(i)} + b) \right).\]

To fully recover the Soft Margin Classifier, we simply add the squared L2 regularization to this loss, thus the total loss becomes

\[\mathcal{L} = \frac{1}{M}\sum_{i=1}^M \max \left( 0, \; 1 - y^{(i)}(\omega^{\top} x^{(i)} + b) \right) + \lambda ||w||_2^2.\]

Here, the parameter \(\lambda\) controls the amount of shrinkage applied to the coefficients to mitigate overfitting. This is fundamentally different from the \(C\) in the Soft Margin Classifier formalism, which is also a regularization parameter, but it influences the trade off between the margin size and the training error.

If we want to do something like kernels (although there is no dot product here), the best we can do is directly applying the input feature transformation \(x \to \varphi(x)\) which leads to the loss

\[\mathcal{L} = \frac{1}{M}\sum_{i=1}^M \max \left( 0, \; 1 - y^{(i)}(\omega^{\top} \varphi(x^{(i)}) + b) \right) + \lambda ||w||_2^2.\]

However, as you might remember, there is no practically useful \(\varphi\) corresponding to the RBF kernel - there is one, but it is an infinitely long sum. Thus, the hinge loss approach is somewhat restrictive.

Note: there is a similar loss function corresponding to the logistic regression. See this for more details.

Visualization Utils

def visualize_torch(X, Y, model, linear=False):
    """
    based on 
    https://scikit-learn.org/stable/auto_examples/svm/plot_svm_margin.html#sphx-glr-auto-examples-svm-plot-svm-margin-py
    """

    plt.figure(figsize=(6, 6))
    plt.scatter(x=X[:, 0], y=X[:, 1], c=Y, s=10)

    w = model.linear.weight.squeeze().detach().numpy()
    b = model.linear.bias.squeeze().detach().numpy()
    
    delta = 0.02

    if linear:

        # extend bounds by "delta" to improve the plot
        x_min = X[:, 0].min() - delta
        x_max = X[:, 0].max() + delta

        # solving $w0+x1 + w1*x2 + b = 0$ for $x2$ leads to $x2 = -w0/w1 - b/w1$
        a = -w[0] / w[1]
        xx = np.linspace(x_min, x_max, 50)
        yy = a * xx - b / w[1]

        # $margin = 1 / ||w||_2$
        # Why? Recall that the distance between a point (x_p, y_P) and a line
        # $ax+by+c=0$ is given by $|ax_p+by_p+c|/\sqrt{a^2+b^2}$. As we set the
        # functional margin to 1, i.e. $|ax_i+by_i+c|=1$ for a support vector
        # point, then the total margin becomes $1 / ||w||_2$.
        margin = 1 / np.sqrt(np.sum(w**2))
        yy_up = yy + np.sqrt(1+a**2) * margin
        yy_down = yy - np.sqrt(1+a**2) * margin

        plt.plot(xx, yy, "r-")
        plt.plot(xx, yy_up, "r--")
        plt.plot(xx, yy_down, "r--")
        
    else:
        x = np.arange(X[:, 0].min(), X[:, 0].max(), delta)
        y = np.arange(X[:, 1].min(), X[:, 1].max(), delta)
        x, y = np.meshgrid(x, y)
        xy = list(map(np.ravel, [x, y]))
        xy = torch.tensor(xy, dtype=torch.float32).T
        z = model(xy)
        z = z.detach().numpy().reshape(x.shape)
        
        cs0 = plt.contourf(x, y, z, alpha=0.6)
        plt.contour(cs0, '-', levels=[0], colors='r', linewidth=5)
        plt.plot(np.nan, label='decision boundary', color='r')
        plt.legend()
        
    plt.grid()
    plt.xlim([X[:, 0].min() + delta, X[:, 0].max() - delta])
    plt.ylim([X[:, 1].min() + delta, X[:, 1].max() - delta])
    plt.tight_layout()
    plt.show()

Let’s first define a base SVM class, a linear kernel, the hinge loss, and the regularization over weights.

class SupportVectorMachine(nn.Module):

    def __init__(self, input_size, phi):
        super().__init__()
        # X_train.shape should be (num_samples, dim x)
        self.input_size = input_size
        self.phi = phi
        self.linear = nn.Linear(self.input_size, 1)

    def forward(self, x):
        phi_x = self.phi(x)
        out = self.linear(phi_x)
        return out


class PhiLinear(nn.Module):
    
    def __init__(self):
        super().__init__()
        
    def forward(self, x):
        return x
    

def hinge_loss(y, out):
    """Hinge loss"""
    return torch.mean(torch.clamp(1 - y * out, min=0))


def sq_l2_reg(model):
    """Squared L2 regularization of weights"""
    return model.linear.weight.square().sum()

Exercise

Implement the Radial Basis Function kernel. After that, use your new kernel implementation to run a training process and visualize results.

Hint: this reference.

class PhiRBF(nn.Module):
    """Something like a Radial Basis Function feature map
    Lifts the dimension from X.shape[-1] to X.shape[0]"""

    def __init__(self, X_train, gamma):
        super().__init__()
        self.X_train = X_train
        self.gamma = gamma
        
    def forward(self, x):
        ##############################
        # TODO: implement the forward methods
        # The choice of this specific phi is arbitrary and is not the true RBF.
        # We lift the input space from the space of `x` to a space of 
        # dimension equal to the number of training data points.
        out = self.X_train.repeat(x.size(0), 1, 1)
        out = torch.exp(-self.gamma * ((x[:, None] - out) ** 2).sum(dim=2))
        return out
        ##############################

Some preparation before we train the model

# prepare the data
X = torch.Tensor(X)
Y = torch.Tensor(Y)
N = len(Y)
torch.manual_seed(42)

# set hyperparameters
learning_rate = 0.1
epochs = 1000
batch_size = 100
reg_lambda = 0.001
phi_type = 'rbf'  # TODO: try both "lin" and "rbf"

if phi_type == 'lin':
  phi = PhiLinear() 
  input_size = X.shape[1]
elif phi_type == 'rbf':
  phi = PhiRBF(X_train=X, gamma=1.0) 
  input_size = X.shape[0]
  
# initialize model
model = SupportVectorMachine(input_size, phi) 
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

model.train()

# print initial parameters
for name, param in model.named_parameters():
  print(name, ": ", param.data)
linear.weight :  tensor([[ 0.0342,  0.0371, -0.0105,  0.0411, -0.0098,  0.0090, -0.0218,  0.0263,
          0.0394, -0.0328,  0.0389,  0.0084,  0.0330,  0.0061,  0.0216, -0.0063,
          0.0345,  0.0066, -0.0209,  0.0114, -0.0206, -0.0052, -0.0182,  0.0297,
         -0.0353, -0.0206, -0.0126, -0.0269,  0.0042, -0.0442,  0.0404, -0.0380,
          0.0345,  0.0074, -0.0145,  0.0276,  0.0070,  0.0361,  0.0049, -0.0141,
          0.0120, -0.0121,  0.0188,  0.0399,  0.0259, -0.0196,  0.0258,  0.0080,
          0.0227, -0.0273, -0.0443, -0.0173, -0.0343,  0.0367,  0.0129,  0.0185,
          0.0141, -0.0008,  0.0350, -0.0318,  0.0028, -0.0305,  0.0138, -0.0154,
          0.0137, -0.0093,  0.0371, -0.0265, -0.0267, -0.0267,  0.0402,  0.0149,
          0.0430, -0.0369, -0.0444, -0.0350, -0.0301,  0.0181,  0.0160,  0.0372,
         -0.0231, -0.0305,  0.0237, -0.0181,  0.0271, -0.0106,  0.0256, -0.0347,
         -0.0226,  0.0136,  0.0095, -0.0114,  0.0267,  0.0304, -0.0324, -0.0239,
          0.0409, -0.0151, -0.0159, -0.0433, -0.0256,  0.0112, -0.0059, -0.0325,
          0.0010, -0.0305, -0.0379, -0.0246, -0.0391, -0.0285,  0.0447,  0.0084,
          0.0138, -0.0417, -0.0294, -0.0149,  0.0070, -0.0394, -0.0193, -0.0268,
          0.0001, -0.0166, -0.0031, -0.0303, -0.0307, -0.0261, -0.0153, -0.0353,
          0.0375, -0.0089,  0.0385,  0.0139, -0.0379,  0.0309, -0.0123, -0.0171,
         -0.0371, -0.0445,  0.0128, -0.0098,  0.0174, -0.0367,  0.0332, -0.0328,
         -0.0077,  0.0093,  0.0231,  0.0361,  0.0407, -0.0355,  0.0113, -0.0192,
         -0.0049, -0.0335,  0.0407, -0.0328,  0.0239,  0.0157,  0.0145, -0.0242,
          0.0406,  0.0098,  0.0058, -0.0394,  0.0188, -0.0067, -0.0205,  0.0384,
          0.0100, -0.0247, -0.0226, -0.0021,  0.0250, -0.0114, -0.0255, -0.0153,
         -0.0334,  0.0159,  0.0346, -0.0421,  0.0104,  0.0231,  0.0081, -0.0159,
          0.0233,  0.0235,  0.0167, -0.0079, -0.0118,  0.0048, -0.0079, -0.0133,
          0.0286,  0.0384, -0.0044, -0.0100,  0.0007, -0.0027,  0.0108,  0.0125,
         -0.0406, -0.0165,  0.0377,  0.0174, -0.0022, -0.0270, -0.0274, -0.0401,
         -0.0146,  0.0151,  0.0285,  0.0206, -0.0395, -0.0269, -0.0071,  0.0433,
          0.0065, -0.0116,  0.0185, -0.0170, -0.0289,  0.0326, -0.0203, -0.0090,
         -0.0445,  0.0299,  0.0339,  0.0163, -0.0312, -0.0441, -0.0363,  0.0333,
          0.0215,  0.0376,  0.0234,  0.0113, -0.0004, -0.0340, -0.0383, -0.0418,
          0.0183, -0.0220, -0.0090, -0.0257, -0.0081, -0.0315, -0.0292,  0.0148,
         -0.0133,  0.0276, -0.0143, -0.0328, -0.0079, -0.0217, -0.0137, -0.0426,
          0.0250, -0.0311,  0.0225,  0.0203,  0.0320, -0.0343,  0.0322, -0.0211,
          0.0166,  0.0420, -0.0063, -0.0003, -0.0103, -0.0373,  0.0215, -0.0444,
          0.0278,  0.0335,  0.0423, -0.0105, -0.0367,  0.0101,  0.0247, -0.0445,
         -0.0102, -0.0268, -0.0039, -0.0220, -0.0183, -0.0142, -0.0425,  0.0367,
          0.0375, -0.0070, -0.0051, -0.0183, -0.0404, -0.0435,  0.0166, -0.0246,
         -0.0288, -0.0035, -0.0149, -0.0145,  0.0014, -0.0095, -0.0154, -0.0214,
         -0.0364,  0.0375, -0.0179,  0.0119, -0.0155,  0.0036,  0.0417,  0.0206,
         -0.0388,  0.0178,  0.0425,  0.0118,  0.0300,  0.0441, -0.0069,  0.0093,
         -0.0311, -0.0092,  0.0331,  0.0229, -0.0283, -0.0359, -0.0306, -0.0441,
         -0.0345, -0.0111,  0.0302,  0.0075, -0.0340, -0.0359,  0.0222, -0.0333,
         -0.0055,  0.0215, -0.0207, -0.0049, -0.0039, -0.0106, -0.0227, -0.0399,
         -0.0362, -0.0239,  0.0432, -0.0216, -0.0300,  0.0108,  0.0123,  0.0245,
          0.0340,  0.0249, -0.0443,  0.0040,  0.0271, -0.0041, -0.0264,  0.0426,
         -0.0167, -0.0255, -0.0403,  0.0020,  0.0198,  0.0099,  0.0088, -0.0339,
         -0.0418,  0.0008,  0.0408,  0.0258, -0.0260, -0.0058, -0.0330, -0.0216,
          0.0081,  0.0244,  0.0370, -0.0411,  0.0299, -0.0315,  0.0167,  0.0378,
          0.0006,  0.0407, -0.0381, -0.0171,  0.0261, -0.0097, -0.0092, -0.0186,
          0.0308,  0.0219,  0.0143, -0.0251, -0.0363,  0.0048,  0.0132, -0.0206,
         -0.0125,  0.0302,  0.0036,  0.0020, -0.0110, -0.0405, -0.0420, -0.0214,
         -0.0227,  0.0139, -0.0130, -0.0175,  0.0426,  0.0156,  0.0319, -0.0217,
         -0.0183,  0.0164, -0.0298, -0.0292, -0.0022, -0.0164, -0.0335,  0.0265,
          0.0360,  0.0073, -0.0078, -0.0414, -0.0163,  0.0114,  0.0211, -0.0057,
         -0.0177,  0.0249, -0.0356,  0.0283, -0.0173,  0.0007, -0.0088,  0.0054,
         -0.0135,  0.0325, -0.0012,  0.0349,  0.0430, -0.0218, -0.0326,  0.0359,
          0.0350, -0.0341, -0.0035, -0.0441, -0.0366,  0.0086,  0.0119,  0.0095,
         -0.0122,  0.0413,  0.0064, -0.0264, -0.0025,  0.0107,  0.0157, -0.0316,
          0.0168, -0.0228, -0.0372, -0.0244,  0.0431,  0.0382,  0.0400,  0.0263,
          0.0338, -0.0060, -0.0246,  0.0223, -0.0232, -0.0302, -0.0143,  0.0094,
          0.0230, -0.0174, -0.0263,  0.0060, -0.0264, -0.0291,  0.0233, -0.0075,
          0.0409,  0.0435,  0.0134,  0.0154,  0.0103,  0.0007, -0.0033,  0.0006,
          0.0167,  0.0416, -0.0116, -0.0189]])
linear.bias :  tensor([-0.0108])

Now, we can train the model.

We iterate over the data by following these two steps in each epoch:

  1. randomly permute all indices up to the number of training data points at each epoch.

  2. iterate over all batches by picking the samples corresponding to the current subset of indices

for epoch in range(epochs):
    random_nums = torch.randperm(N)

    # Iterate over the individual batches
    for i in range(0, N, batch_size):
        x = X[random_nums[i:i + batch_size]]
        y = Y[random_nums[i:i + batch_size]]

        optimizer.zero_grad()
        output = model(x)

        loss = hinge_loss(y.unsqueeze(1), output) + \
            reg_lambda * sq_l2_reg(model)
        loss.backward()
        optimizer.step()

    print('epoch {}, loss {}'.format(epoch, loss.item()))
# print final parameters
for name, param in model.named_parameters():
  print(name, ": ", param.data)
linear.weight :  tensor([[-4.3840e-02,  3.7154e-02, -1.8599e-02, -2.9914e-01,  4.8049e-02,
         -3.8624e-02, -5.9737e-02, -1.8046e-02,  4.8639e-02, -3.7677e-01,
         -3.0616e-02, -5.9867e-02, -1.8603e-02,  3.0316e-01, -7.6967e-02,
          4.9941e-02, -1.4472e-02,  1.8021e-01,  6.9080e-02,  5.5923e-03,
         -2.4834e-02,  4.9265e-02,  6.8726e-02, -3.2119e-02,  6.5568e-02,
          1.5987e-01, -5.7546e-02,  9.1915e-02, -4.9658e-02, -8.7284e-03,
          8.8835e-02, -2.8591e-02,  1.1857e-02, -2.7662e-02, -1.0790e-01,
         -5.4270e-02,  7.3896e-02,  1.1921e-02, -1.4514e-02,  1.0379e-01,
         -4.4070e-02, -4.4913e-02,  7.5571e-02, -1.0362e-01,  7.2744e-02,
         -3.9317e-02,  4.8648e-02,  8.9887e-02, -1.8805e-01, -1.3462e-03,
         -2.0104e-03, -3.6487e-01, -5.4353e-02,  3.8214e-02, -1.5799e-02,
         -4.9141e-02, -1.4759e-02, -5.6240e-02, -4.1308e-02, -1.2092e-01,
          5.7895e-02,  8.9047e-02, -3.6861e-02, -3.1655e-01,  5.3951e-02,
          7.2144e-02, -1.1520e-01,  1.1261e-01, -4.0637e-02,  1.4511e-01,
         -7.5381e-02, -5.1708e-02,  1.1300e-01,  1.6554e-01,  2.9488e-02,
         -9.2482e-02,  6.8774e-02, -2.9853e-01,  1.2908e-01, -2.5613e-02,
         -5.4678e-02,  7.7007e-03, -2.5630e-01, -1.4662e-01,  1.2987e-01,
         -1.7840e-01,  2.3337e-01,  2.7036e-02, -1.9394e-02,  5.3320e-02,
         -3.8555e-02,  1.6212e-02,  1.3796e-01, -3.4649e-01,  1.1526e-01,
          3.6072e-03, -2.9360e-01,  2.0812e-01, -4.3965e-02, -5.2447e-02,
          1.0303e-01,  7.5507e-02,  1.2622e-01,  8.2226e-02,  1.1012e-01,
         -5.4091e-02, -3.1692e-01, -3.6698e-02,  1.0695e-01, -2.2638e-01,
          9.4052e-02,  1.7713e-01, -2.4419e-01, -3.1327e-01,  8.9345e-02,
          4.4047e-02,  2.8319e-02,  8.7749e-02,  6.7385e-03, -1.1540e-01,
          1.3743e-01, -1.1855e-01, -1.3031e-02, -1.5690e-02,  6.5629e-03,
         -5.3935e-02,  1.0400e-01,  1.3174e-01, -2.1443e-02,  1.9041e-02,
          4.7853e-02,  2.8003e-02, -3.7880e-01, -5.9673e-02, -1.6686e-01,
         -1.6510e-01,  7.1235e-02,  1.3888e-02, -1.3166e-02, -2.3137e-02,
         -1.2203e-02, -9.4643e-03,  7.4931e-02,  6.7022e-02,  4.6328e-02,
          4.1413e-03, -4.7664e-02,  1.1573e-01,  2.0813e-02,  7.8718e-02,
          4.6320e-02, -3.8637e-02, -2.9391e-01,  2.2724e-02, -2.3529e-01,
          1.3415e-01,  4.7424e-02, -5.7352e-02, -5.2678e-02,  1.7793e-01,
          2.2461e-03,  1.8626e-01,  1.8894e-01,  4.1398e-02,  1.8930e-01,
         -5.3658e-02, -8.8961e-02,  1.5941e-01,  4.4156e-02,  2.4779e-02,
          4.8252e-02, -1.2180e-01,  1.0094e-01,  6.3193e-02,  2.4881e-02,
         -5.3403e-02,  1.9461e-01, -5.7637e-02, -1.5025e-01, -5.4754e-03,
          7.0096e-02,  1.1126e-01, -3.9668e-01,  2.1653e-01,  1.6591e-02,
         -1.7901e-01,  1.0294e-01,  6.3018e-02, -3.0676e-01,  8.5504e-02,
          5.8542e-02,  6.1896e-02, -2.2939e-01, -3.7933e-03,  2.6520e-01,
         -2.2157e-02,  2.0982e-01,  1.2099e-01,  9.8588e-02, -2.0795e-02,
         -4.9302e-02, -5.1215e-02,  6.7354e-02,  1.8551e-02,  1.3290e-01,
          1.0492e-01, -5.4236e-02,  5.8354e-02,  1.3614e-02,  1.0947e-02,
          1.5317e-01, -8.6723e-02, -3.4239e-01, -4.3005e-02, -3.8103e-02,
          1.1201e-01,  5.0324e-02,  2.1222e-02, -1.0893e-01,  6.9251e-02,
          7.8891e-02, -5.4532e-03,  1.6121e-01,  1.9021e-01, -1.0281e-01,
         -1.0264e-01,  1.8469e-01,  1.4738e-02, -2.0190e-02,  1.0040e-01,
         -3.2994e-01,  9.8982e-02,  8.0360e-03,  1.0761e-01,  3.0289e-01,
          8.1558e-02,  1.4355e-01,  1.1201e-01, -1.1064e-02,  3.1383e-02,
          3.7764e-02, -1.7266e-02, -1.0658e-02,  3.0246e-02, -3.7480e-01,
          1.2433e-01, -1.5661e-01,  4.9313e-02, -1.1789e-01, -5.6383e-02,
          1.3488e-01, -8.4259e-02,  8.1544e-02,  1.2366e-01, -2.6412e-02,
          5.5292e-02,  1.8400e-02, -6.9494e-03,  6.8208e-02, -1.4954e-01,
         -1.4242e-02, -1.5776e-02, -5.6088e-02,  4.5916e-02, -1.3558e-02,
          2.1876e-01,  4.4712e-02,  6.1068e-02, -1.3148e-01, -1.5033e-01,
         -5.6106e-02, -3.3558e-01, -1.0317e-01,  9.8463e-02, -5.5807e-02,
          1.4955e-02,  3.6068e-02, -3.7595e-02, -3.5458e-01, -5.6479e-02,
         -5.2465e-02, -5.3361e-02,  1.0677e-01,  4.9788e-02, -6.1250e-02,
          8.9735e-02,  3.4024e-02,  2.4850e-01, -8.5773e-03, -5.8058e-02,
         -2.5688e-02, -4.8081e-02,  2.9960e-02, -2.0875e-01, -3.8518e-02,
          7.1406e-02,  1.4220e-01,  4.6614e-02, -2.0379e-01, -1.5046e-01,
         -2.0910e-02, -1.7877e-01,  1.0745e-01,  6.7212e-02,  7.4762e-02,
          1.3230e-01, -3.7537e-02, -4.8878e-02,  5.9741e-03,  9.1832e-02,
          7.7990e-02,  5.4479e-03, -3.0133e-02, -2.0697e-01,  1.5037e-01,
          3.0259e-01, -2.3600e-01, -1.7354e-03, -5.7603e-02, -2.8642e-02,
         -1.9974e-02, -3.5416e-01,  7.5613e-02,  2.8154e-02, -3.0459e-03,
          8.2929e-02,  6.1840e-02, -4.4045e-02, -5.5773e-02, -2.5357e-02,
          1.2704e-01, -4.3917e-02,  9.5720e-02, -1.1248e-01,  2.5141e-02,
         -3.2454e-01, -2.3494e-02,  2.2846e-02, -6.3667e-02, -5.0494e-02,
         -3.1114e-02, -2.5880e-02,  1.5591e-01, -5.6173e-02, -4.7467e-02,
          1.3928e-01,  9.6223e-02,  4.2254e-02, -2.7074e-02, -3.3300e-01,
         -5.0639e-02,  1.1634e-01,  1.2826e-01, -5.6380e-02,  6.9966e-02,
          1.7397e-01, -6.1729e-02, -5.6625e-02, -3.5030e-02, -1.1864e-01,
         -5.4816e-02,  5.0864e-02,  1.8117e-02,  8.7305e-02,  3.9139e-02,
         -3.2584e-02, -1.7974e-02,  1.4027e-01,  1.5126e-01,  2.6297e-02,
          6.7570e-02,  5.6564e-02,  7.6002e-02,  7.7737e-02,  6.3479e-02,
          9.3687e-02,  1.2394e-01,  8.2663e-02, -5.6240e-02, -1.2089e-01,
         -5.3539e-02, -5.7788e-02,  1.9458e-01, -1.7447e-01,  1.4206e-02,
          1.9677e-02, -1.2964e-01, -4.9498e-02,  1.2762e-01,  1.4742e-01,
          2.8410e-02, -3.1652e-01,  4.8532e-02,  7.4229e-02, -3.5664e-02,
         -1.2198e-01,  2.3746e-04,  1.3970e-01, -9.6228e-02,  7.3601e-03,
          8.6030e-02,  1.6295e-01,  9.8407e-02, -2.1554e-01,  1.8469e-02,
         -1.1367e-01, -1.1676e-01,  1.7324e-02,  1.2655e-02, -2.7465e-02,
         -2.2538e-01,  6.5080e-02,  2.0316e-01, -6.2827e-02,  2.9524e-02,
         -5.2807e-02,  5.2069e-04,  6.6638e-02, -5.3416e-02,  7.6444e-03,
          2.1932e-02,  6.3398e-03, -1.9308e-02, -1.3955e-01,  1.8698e-02,
         -5.9288e-02,  4.0943e-03, -4.5194e-02, -6.2980e-02,  6.8262e-02,
         -1.8848e-01,  6.9984e-02,  2.0808e-01, -2.2017e-02,  4.3610e-02,
         -7.8101e-02,  1.9062e-02,  6.2796e-02, -7.9894e-02,  1.3576e-01,
         -2.6056e-02,  2.3333e-02, -2.8570e-02,  1.0872e-01,  9.1267e-02,
          2.5061e-01,  1.0474e-01,  8.5392e-02,  6.8411e-02, -3.8639e-02,
          1.2945e-01, -4.1318e-02,  5.1344e-02, -2.9312e-02,  1.2537e-02,
          1.5366e-02,  1.2453e-01,  2.6893e-01, -2.4416e-02, -6.9235e-02,
          5.6739e-02, -3.9823e-02, -5.0897e-02,  6.2925e-02,  1.6061e-01,
          7.1205e-02,  3.1982e-02,  1.3859e-01,  6.9275e-02, -1.9263e-02,
          1.4228e-01,  8.7136e-02,  1.0707e-02, -2.0420e-01,  5.5799e-02,
          8.1763e-02, -1.4348e-01, -1.5002e-01,  6.0127e-02,  8.3922e-02,
         -3.9288e-01, -6.6251e-03,  2.1520e-01, -7.3047e-02,  6.2121e-02,
         -5.9529e-02, -5.4644e-02,  1.1095e-02,  1.5836e-01, -1.5995e-01,
         -2.3951e-01,  1.6795e-02, -5.3126e-02,  4.9919e-02,  6.4854e-02,
          3.3579e-04,  4.6418e-02, -5.5911e-02,  4.2853e-02, -5.6071e-02]])
linear.bias :  tensor([-0.3326])

In this formulation all samples are used to fit the parameters. This in contranst to the SMO solution might take longer to optimize, but is more stable because we don’t select individial samples and ignore all the others.

visualize_torch(X, Y, model=model)
/tmp/ipykernel_10392/3784346825.py:49: UserWarning: The following kwargs were not used by contour: 'linewidth'
  plt.contour(cs0, '-', levels=[0], colors='r', linewidth=5)
../_images/7d5707a711f81bf13e75773660797900fc4a06f4fb5aea90de2ec7ccb041129e.png