4. Support Vector Machines#
At the end of this exercise you will know:
How to train a SVM using Sequential Minimal Optimization (SMO)
How to train a SVM using Gradient Descent (GD)
How different SVM Kernels perform
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.datasets import make_blobs
Summary of the mathematical formalism
Soft Margin SVM Lagrangian:
Primal problem:
Dual problem:
In the dual problem, we have used that \(w=\sum_{i=1}^{m} \alpha_{i} y^{(i)} x^{(i)}\) and \(\sum_{i=1}^{m} \alpha_{i} y^{(i)}=0\) to get rid of \(w\) and \(b\). To then find \(b\), we use the heuristic \(b^* = \frac{1}{m_{\Sigma}} \sum_{j=1}^{m_{\Sigma}}\left(y^{(j)}-\sum_{i=1}^{m_{\Sigma}} \alpha_{i}^{*} y^{(i)}K(x^{(i)}, x^{(j)})\right)\) over the support vectors \(m_{\Sigma}\). Note, only in the linear case, the kernel becomes \(K(x^{(i)}, x^{(j)}) = \left\langle x^{(i)}, x^{(j)}\right\rangle\). Also note, that only in the dual problem definition we encounter the kernel and can use the kernel trick.
Note: for \(C \to \infty\) this problem ends up being the Hard Margin SVM.
4.1. Artificial Dataset#
We use scikit-learn and make_blobs
to generate a binary dataset with input features \(x\in \mathbb{R}^2\) and labels \(y\in \{-1, +1\}\).
# X as features and Y as labels
X, Y = make_blobs(n_samples=500, centers=2, random_state=0, cluster_std=0.6)
# by default the labels are {0, 1}, so we change them to {-1,1}
Y = np.where(Y==0, -1, 1)
# we also center the input data (per dimension) and scale it to unit variance to make trainig more efficient
X = (X - X.mean(axis=0))/X.std(axis=0)
plt.figure(figsize=(6, 6))
plt.scatter(x=X[:, 0], y=X[:, 1], c=Y)
<matplotlib.collections.PathCollection at 0x7f486917ab00>
4.2. Sequential Minimal Optimization (SMO)#
This algorithm was originally developed by John Platt in 1998 and is optimized for SVM optimization. This algorithm solves the dual problem in a gradient-free manner. It selects two multiplier \(\alpha_i\) and \(\alpha_j\) and optimizes them while keeping all other \(\alpha\)’s constant. And then itertively repeats the procedure over all \(\alpha\)’s. The efficiency lies in the heuristic used for selecting two \(\alpha\) values, which is based on information from previous iterations. In the end we obtain a vector of \(M\) values for \(\alpha\) corresponding to each training data point, for which most of the \(\alpha\) values are \(0\) and only the non-zero values contribute to the predictions made by the model.
We adapt the implementation of the SMO algorithm from this reference code by Jon Charest.
Visualization Utils
def plot_decision_boundary(model, ax, resolution=100, colors=('b', 'k', 'r'), levels=(-1, 0, 1)):
"""Plots the model's decision boundary on the input axes object.
Range of decision boundary grid is determined by the training data.
Returns decision boundary grid and axes object (`grid`, `ax`)."""
# Generate coordinate grid of shape [resolution x resolution]
# and evaluate the model over the entire space
xrange = np.linspace(model.X[:, 0].min(), model.X[:, 0].max(), resolution)
yrange = np.linspace(model.X[:, 1].min(),
model.X[:, 1].max(), resolution)
grid = [[decision_function(model.alphas, model.Y,
model.kernel, model.X,
np.array([xr, yr]), model.b) for xr in xrange] for yr in yrange]
grid = np.array(grid).reshape(len(xrange), len(yrange))
# Plot decision contours using grid and
# make a scatter plot of training data
ax.contour(xrange, yrange, grid, levels=levels, linewidths=(1, 1, 1),
linestyles=('--', '-', '--'), colors=colors)
ax.scatter(model.X[:, 0], model.X[:, 1],
c=model.Y, cmap=plt.cm.viridis, lw=0, alpha=0.25)
# Plot support vectors (non-zero alphas)
# as circled points (linewidth > 0)
mask = np.round(model.alphas, decimals=2) != 0.0
ax.scatter(model.X[mask, 0], model.X[mask, 1],
c=model.Y[mask], cmap=plt.cm.viridis, lw=1, edgecolors='k')
return grid, ax
As a first step, we define a generic SMO model
class SMOModel:
"""Container object for the model used for sequential minimal optimization."""
def __init__(self, X, Y, C, kernel, alphas, b, errors):
self.X = X # training data vector
self.Y = Y # class label vector
self.C = C # regularization parameter
self.kernel = kernel # kernel function
self.alphas = alphas # lagrange multiplier vector
self.b = b # scalar bias term
self.errors = errors # error cache used for selection of alphas
self._obj = [] # record of objective function value
self.m = len(self.X) # store size of training set
The next thing we need to define is the kernel. We start with the simplest linear kernel
The implementation of the radial basis function
is also provided for comparison.
def linear_kernel(x, y, b=1):
"""Returns the linear combination of arrays `x` and `y` with
the optional bias term `b` (set to 1 by default)."""
return x @ y.T + b # Note the @ operator for matrix multiplication
def gaussian_kernel(x, y, gamma=1):
"""Returns the gaussian similarity of arrays `x` and `y` with
kernel inverse width parameter `gamma` (set to 1 by default)."""
######################
# TODO: you might find this helpful: https://jonchar.net/notebooks/SVM/
if np.ndim(x) == 1 and np.ndim(y) == 1:
result = np.exp(- gamma * (np.linalg.norm(x - y, 2)) ** 2)
elif (np.ndim(x) > 1 and np.ndim(y) == 1) or (np.ndim(x) == 1 and np.ndim(y) > 1):
result = np.exp(- gamma * (np.linalg.norm(x - y, 2, axis=1) ** 2))
elif np.ndim(x) > 1 and np.ndim(y) > 1:
result = np.exp(- gamma * (np.linalg.norm(x[:, np.newaxis] -
y[np.newaxis, :], 2, axis=2) ** 2))
return result
#######################
Now, using the dual problem formulation and a kernel
, we define the objective and decision functions.
The decision function simply imlements \((\omega x + b)\) by using the kernel trick and the relation \(w=\sum_{i=1}^{m} \alpha_{i} y^{(i)} x^{(i)}\)
# Objective function to optimize, i.e. loss function
def objective_function(alphas, target, kernel, X_train):
"""Returns the SVM objective function based in the input model defined by:
`alphas`: vector of Lagrange multipliers
`target`: vector of class labels (-1 or 1) for training data
`kernel`: kernel function
`X_train`: training data for model."""
return np.sum(alphas) - 0.5 * np.sum((target[:, None] * target[None, :]) * kernel(X_train, X_train) * (alphas[:, None] * alphas[None, :]))
# Decision function, i.e. forward model evaluation
def decision_function(alphas, target, kernel, X_train, x_test, b):
"""Applies the SVM decision function to the input feature vectors in `x_test`."""
result = (alphas * target) @ kernel(X_train, x_test) - b
return result
The SMO algorithm
We are now ready to implement the SMO algorithm as given in Platt’s paper. The implementation is split into three functions: take_step
, examine_example
, and train
.
train
is the main training loop and also implements the selection of the first of the two \(\alpha\) values.examine_example
implements the selection of the second \(\alpha\) valuetake_step
optimizes the two \(\alpha\) values, the bias \(b\), and the cache.
def take_step(i1, i2, model):
# Skip if chosen alphas are the same
if i1 == i2:
return 0, model
alph1 = model.alphas[i1]
alph2 = model.alphas[i2]
y1 = model.Y[i1]
y2 = model.Y[i2]
E1 = model.errors[i1]
E2 = model.errors[i2]
s = y1 * y2
# Compute L & H, the bounds on new possible alpha values
if (y1 != y2):
L = max(0, alph2 - alph1)
H = min(model.C, model.C + alph2 - alph1)
elif (y1 == y2):
L = max(0, alph1 + alph2 - model.C)
H = min(model.C, alph1 + alph2)
if (L == H):
return 0, model
# Compute kernel & 2nd derivative eta
k11 = model.kernel(model.X[i1], model.X[i1])
k12 = model.kernel(model.X[i1], model.X[i2])
k22 = model.kernel(model.X[i2], model.X[i2])
eta = 2 * k12 - k11 - k22
# Compute new alpha 2 (a2) if eta is negative
if (eta < 0):
a2 = alph2 - y2 * (E1 - E2) / eta
# Clip a2 based on bounds L & H
if L < a2 < H:
a2 = a2
elif (a2 <= L):
a2 = L
elif (a2 >= H):
a2 = H
# If eta is non-negative, move new a2 to bound with greater objective function value
else:
alphas_adj = model.alphas.copy()
alphas_adj[i2] = L
# objective function output with a2 = L
Lobj = objective_function(alphas_adj, model.Y, model.kernel, model.X)
alphas_adj[i2] = H
# objective function output with a2 = H
Hobj = objective_function(alphas_adj, model.Y, model.kernel, model.X)
if Lobj > (Hobj + eps):
a2 = L
elif Lobj < (Hobj - eps):
a2 = H
else:
a2 = alph2
# Push a2 to 0 or C if very close
if a2 < 1e-8:
a2 = 0.0
elif a2 > (model.C - 1e-8):
a2 = model.C
# If examples can't be optimized within epsilon (eps), skip this pair
if (np.abs(a2 - alph2) < eps * (a2 + alph2 + eps)):
return 0, model
# Calculate new alpha 1 (a1)
a1 = alph1 + s * (alph2 - a2)
# Update threshold b to reflect newly calculated alphas
# Calculate both possible thresholds
b1 = E1 + y1 * (a1 - alph1) * k11 + y2 * (a2 - alph2) * k12 + model.b
b2 = E2 + y1 * (a1 - alph1) * k12 + y2 * (a2 - alph2) * k22 + model.b
# Set new threshold based on if a1 or a2 is bound by L and/or H
if 0 < a1 and a1 < C:
b_new = b1
elif 0 < a2 and a2 < C:
b_new = b2
# Average thresholds if both are bound
else:
b_new = (b1 + b2) * 0.5
# Update model object with new alphas & threshold
model.alphas[i1] = a1
model.alphas[i2] = a2
# Update error cache
# Error cache for optimized alphas is set to 0 if they're unbound
for index, alph in zip([i1, i2], [a1, a2]):
if 0.0 < alph < model.C:
model.errors[index] = 0.0
# Set non-optimized errors based on equation 12.11 in Platt's book
non_opt = [n for n in range(model.m) if (n != i1 and n != i2)]
model.errors[non_opt] = model.errors[non_opt] + \
y1*(a1 - alph1)*model.kernel(model.X[i1], model.X[non_opt]) + \
y2*(a2 - alph2) * \
model.kernel(model.X[i2], model.X[non_opt]) + model.b - b_new
# Update model threshold
model.b = b_new
return 1, model
def examine_example(i2, model):
y2 = model.Y[i2]
alph2 = model.alphas[i2]
E2 = model.errors[i2]
r2 = E2 * y2
# Proceed if error is within specified tolerance (tol)
if ((r2 < -tol and alph2 < model.C) or (r2 > tol and alph2 > 0)):
if len(model.alphas[(model.alphas != 0) & (model.alphas != model.C)]) > 1:
# Use 2nd choice heuristic is choose max difference in error
if model.errors[i2] > 0:
i1 = np.argmin(model.errors)
elif model.errors[i2] <= 0:
i1 = np.argmax(model.errors)
step_result, model = take_step(i1, i2, model)
if step_result:
return 1, model
# Loop through non-zero and non-C alphas, starting at a random point
for i1 in np.roll(np.where((model.alphas != 0) & (model.alphas != model.C))[0],
np.random.choice(np.arange(model.m))):
step_result, model = take_step(i1, i2, model)
if step_result:
return 1, model
# loop through all alphas, starting at a random point
for i1 in np.roll(np.arange(model.m), np.random.choice(np.arange(model.m))):
step_result, model = take_step(i1, i2, model)
if step_result:
return 1, model
return 0, model
def train(model):
numChanged = 0
examineAll = 1 # loop over each alpha in first round
while (numChanged > 0) or (examineAll):
numChanged = 0
if examineAll:
# loop over all training examples
for i in range(model.alphas.shape[0]):
examine_result, model = examine_example(i, model)
numChanged += examine_result
if examine_result:
obj_result = objective_function(
model.alphas, model.Y, model.kernel, model.X)
model._obj.append(obj_result)
else:
# loop over examples where alphas are not already at their limits
for i in np.where((model.alphas != 0) & (model.alphas != model.C))[0]:
examine_result, model = examine_example(i, model)
numChanged += examine_result
if examine_result:
obj_result = objective_function(
model.alphas, model.Y, model.kernel, model.X)
model._obj.append(obj_result)
if examineAll == 1:
examineAll = 0
elif numChanged == 0:
examineAll = 1
return model
We are now ready to define the model (after defining some hyperparameters).
# Set model parameters and initial values
C = 10
m = len(X)
initial_alphas = np.zeros(m)
initial_b = 0.0
# Set tolerances
tol = 0.01 # error tolerance
eps = 0.01 # alpha tolerance
# Instantiate model
model = SMOModel(
X, Y, C,
kernel=gaussian_kernel, # TODO: try linear_kernel and gaussian_kernel
alphas=initial_alphas,
b=initial_b,
errors= np.zeros(m)
)
# Initialize error cache
initial_error = decision_function(model.alphas, model.Y, model.kernel,
model.X, model.X, model.b) - model.Y
model.errors = initial_error
np.random.seed(0)
output = train(model)
fig, ax = plt.subplots()
grid, ax = plot_decision_boundary(output, ax)
# loss curve
# note: we started with all alphas = 0 and turned some of them on one by one, and then refined.
plt.plot(model._obj)
[<matplotlib.lines.Line2D at 0x7f486572df00>]
4.3. Multiclass Classification with SVM and SMO#
We look at a problem we have seen before: the classification of the iris dataset. The task is to use two of the measured input features (“sepal_length” and “sepal_width”) and to build a classifier capable of distinguishing among the three possible flowers, which we index by [0, 1, 2].
Getting the data is equivalent to the process we saw in exercise on Linear and Logistic Regression in the Logistic Regression section.
# get iris dataset
from urllib.request import urlretrieve
iris = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
urlretrieve(iris)
df0 = pd.read_csv(iris, sep=',')
# name columns
attributes = ["sepal_length", "sepal_width",
"petal_length", "petal_width", "class"]
df0.columns = attributes
# add species index
species = list(df0["class"].unique())
df0["class_idx"] = df0["class"].apply(species.index)
print("Count occurence of each class:")
print(df0["class"].value_counts())
# let's extract two of the features, and the indexed classes [0,1,2]
df = df0[["petal_length", "petal_width", "class_idx"]]
X_train = df[['petal_length', 'petal_width']].to_numpy()
Y_train = df['class_idx'].to_numpy()
print("Training data:")
print(df)
Count occurence of each class:
class
Iris-versicolor 50
Iris-virginica 50
Iris-setosa 49
Name: count, dtype: int64
Training data:
petal_length petal_width class_idx
0 1.4 0.2 0
1 1.3 0.2 0
2 1.5 0.2 0
3 1.4 0.2 0
4 1.7 0.4 0
.. ... ... ...
144 5.2 2.3 2
145 5.0 1.9 2
146 5.2 2.0 2
147 5.4 2.3 2
148 5.1 1.8 2
[149 rows x 3 columns]
Exercise
Now, implement a SVM-based multi-class classifier, which can be trained using the SMO algorithm. Compare with the solution presented below.
Hint: this repository and the one-vs-all classifier.
####################
class OneVsAll:
def __init__(self, solver, num_classes, **kwargs):
self._binary_clf = [solver(i, **kwargs) for i in range(num_classes)]
self._num_classes = num_classes
def predict(self, x):
n = x.shape[0]
scores = np.zeros((n, self._num_classes))
for idx in range(self._num_classes):
model = self._binary_clf[idx]
scores[:, idx] = decision_function(
model.alphas,
model.Y,
model.kernel,
model.X,
x,
model.b)
pred = np.argmax(scores, axis=1)
return pred
def fit(self):
np.random.seed(0)
for idx in range(self._num_classes):
self._binary_clf[idx] = train(
self._binary_clf[idx]) # fit(x_train, y_tmp)
def create_binary_clf(class_idx, C, X, Y, kernel):
# Set model parameters and initial values
C = 10.
m = len(X)
initial_alphas = np.zeros(m)
initial_b = 0.0
# Set tolerances
tol = 0.01 # error tolerance
eps = 0.01 # alpha tolerance
Y_tmp = 1. * (Y == class_idx) - 1. * (Y != class_idx)
# Instantiate model
model = SMOModel(
X, Y_tmp, C,
kernel=kernel,
alphas=initial_alphas,
b=initial_b,
errors=np.zeros(m)
)
# Initialize error cache
initial_error = decision_function(model.alphas, model.Y, model.kernel,
model.X, model.X, model.b) - model.Y
model.errors = initial_error
return model
def plot_decision_boundary_multiclass(solver, ax, Y, resolution=100):
"""Plots the model's decision boundary on the input axes object.
Range of decision boundary grid is determined by the training data.
Returns decision boundary grid and axes object (`grid`, `ax`)."""
# Generate coordinate grid of shape [resolution x resolution]
# and evaluate the model over the entire space
model0 = solver._binary_clf[0]
xrange = np.linspace(model0.X[:, 0].min(),
model0.X[:, 0].max(), resolution)
yrange = np.linspace(model0.X[:, 1].min(),
model0.X[:, 1].max(), resolution)
x, y = np.meshgrid(xrange, yrange)
xy = np.array(list(map(np.ravel, [x, y]))).T # shape=(num_samples, dim)
grid = solver.predict(xy)
grid = np.array(grid).reshape(len(xrange), len(yrange))
# Plot decision contours using grid and
# make a scatter plot of training data
ax.contourf(xrange, yrange, grid, alpha=0.5)
ax.scatter(model0.X[:, 0], model0.X[:, 1], c=Y)
return grid, ax
solver = OneVsAll(
solver=create_binary_clf,
num_classes=3,
C=1.,
X=X_train,
Y=Y_train,
kernel=gaussian_kernel # linear_kernel vs gaussian kernel
)
solver.fit()
fig, ax = plt.subplots()
grid, ax = plot_decision_boundary_multiclass(solver, ax, Y_train)
plt.show()
####################
4.4. Gradient Descent Optimization of Soft Margin Classifier#
We can also directly solve the primal problem with gradient-based optimization, if we slightly reformulate it. This reformulation requires using the hinge loss:
What we would give as an input to the hinge loss is the “raw” output of the classifier, e.g. for linear SVMs \(out= \omega x + b\), multiplied with the correct output \(y\). Thus, the hinge loss of a single sample \(i\) for a linear SVM becomes
To fully recover the Soft Margin Classifier, we simply add the squared L2 regularization to this loss, thus the total loss becomes
Here, the parameter \(\lambda\) controls the amount of shrinkage applied to the coefficients to mitigate overfitting. This is fundamentally different from the \(C\) in the Soft Margin Classifier formalism, which is also a regularization parameter, but it influences the trade off between the margin size and the training error.
If we want to do something like kernels (although there is no dot product here), the best we can do is directly applying the input feature transformation \(x \to \varphi(x)\) which leads to the loss
However, as you might remember, there is no practically useful \(\varphi\) corresponding to the RBF kernel - there is one, but it is an infinitely long sum. Thus, the hinge loss approach is somewhat restrictive.
Note: there is a similar loss function corresponding to the logistic regression. See this for more details.
Visualization Utils
def visualize_torch(X, Y, model, linear=False):
"""
based on
https://scikit-learn.org/stable/auto_examples/svm/plot_svm_margin.html#sphx-glr-auto-examples-svm-plot-svm-margin-py
"""
plt.figure(figsize=(6, 6))
plt.scatter(x=X[:, 0], y=X[:, 1], c=Y, s=10)
w = model.linear.weight.squeeze().detach().numpy()
b = model.linear.bias.squeeze().detach().numpy()
delta = 0.02
if linear:
# extend bounds by "delta" to improve the plot
x_min = X[:, 0].min() - delta
x_max = X[:, 0].max() + delta
# solving $w0+x1 + w1*x2 + b = 0$ for $x2$ leads to $x2 = -w0/w1 - b/w1$
a = -w[0] / w[1]
xx = np.linspace(x_min, x_max, 50)
yy = a * xx - b / w[1]
# $margin = 1 / ||w||_2$
# Why? Recall that the distance between a point (x_p, y_P) and a line
# $ax+by+c=0$ is given by $|ax_p+by_p+c|/\sqrt{a^2+b^2}$. As we set the
# functional margin to 1, i.e. $|ax_i+by_i+c|=1$ for a support vector
# point, then the total margin becomes $1 / ||w||_2$.
margin = 1 / np.sqrt(np.sum(w**2))
yy_up = yy + np.sqrt(1+a**2) * margin
yy_down = yy - np.sqrt(1+a**2) * margin
plt.plot(xx, yy, "r-")
plt.plot(xx, yy_up, "r--")
plt.plot(xx, yy_down, "r--")
else:
x = np.arange(X[:, 0].min(), X[:, 0].max(), delta)
y = np.arange(X[:, 1].min(), X[:, 1].max(), delta)
x, y = np.meshgrid(x, y)
xy = list(map(np.ravel, [x, y]))
xy = torch.tensor(xy, dtype=torch.float32).T
z = model(xy)
z = z.detach().numpy().reshape(x.shape)
cs0 = plt.contourf(x, y, z, alpha=0.6)
plt.contour(cs0, '-', levels=[0], colors='r', linewidth=5)
plt.plot(np.nan, label='decision boundary', color='r')
plt.legend()
plt.grid()
plt.xlim([X[:, 0].min() + delta, X[:, 0].max() - delta])
plt.ylim([X[:, 1].min() + delta, X[:, 1].max() - delta])
plt.tight_layout()
plt.show()
Let’s first define a base SVM class, a linear kernel, the hinge loss, and the regularization over weights.
class SupportVectorMachine(nn.Module):
def __init__(self, input_size, phi):
super().__init__()
# X_train.shape should be (num_samples, dim x)
self.input_size = input_size
self.phi = phi
self.linear = nn.Linear(self.input_size, 1)
def forward(self, x):
phi_x = self.phi(x)
out = self.linear(phi_x)
return out
class PhiLinear(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return x
def hinge_loss(y, out):
"""Hinge loss"""
return torch.mean(torch.clamp(1 - y * out, min=0))
def sq_l2_reg(model):
"""Squared L2 regularization of weights"""
return model.linear.weight.square().sum()
Exercise
Implement the Radial Basis Function kernel. After that, use your new kernel implementation to run a training process and visualize results.
Hint: this reference.
class PhiRBF(nn.Module):
"""Something like a Radial Basis Function feature map
Lifts the dimension from X.shape[-1] to X.shape[0]"""
def __init__(self, X_train, gamma):
super().__init__()
self.X_train = X_train
self.gamma = gamma
def forward(self, x):
##############################
# TODO: implement the forward methods
# The choice of this specific phi is arbitrary and is not the true RBF.
# We lift the input space from the space of `x` to a space of
# dimension equal to the number of training data points.
out = self.X_train.repeat(x.size(0), 1, 1)
out = torch.exp(-self.gamma * ((x[:, None] - out) ** 2).sum(dim=2))
return out
##############################
Some preparation before we train the model
# prepare the data
X = torch.Tensor(X)
Y = torch.Tensor(Y)
N = len(Y)
torch.manual_seed(42)
# set hyperparameters
learning_rate = 0.1
epochs = 1000
batch_size = 100
reg_lambda = 0.001
phi_type = 'rbf' # TODO: try both "lin" and "rbf"
if phi_type == 'lin':
phi = PhiLinear()
input_size = X.shape[1]
elif phi_type == 'rbf':
phi = PhiRBF(X_train=X, gamma=1.0)
input_size = X.shape[0]
# initialize model
model = SupportVectorMachine(input_size, phi)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
model.train()
# print initial parameters
for name, param in model.named_parameters():
print(name, ": ", param.data)
linear.weight : tensor([[ 0.0342, 0.0371, -0.0105, 0.0411, -0.0098, 0.0090, -0.0218, 0.0263,
0.0394, -0.0328, 0.0389, 0.0084, 0.0330, 0.0061, 0.0216, -0.0063,
0.0345, 0.0066, -0.0209, 0.0114, -0.0206, -0.0052, -0.0182, 0.0297,
-0.0353, -0.0206, -0.0126, -0.0269, 0.0042, -0.0442, 0.0404, -0.0380,
0.0345, 0.0074, -0.0145, 0.0276, 0.0070, 0.0361, 0.0049, -0.0141,
0.0120, -0.0121, 0.0188, 0.0399, 0.0259, -0.0196, 0.0258, 0.0080,
0.0227, -0.0273, -0.0443, -0.0173, -0.0343, 0.0367, 0.0129, 0.0185,
0.0141, -0.0008, 0.0350, -0.0318, 0.0028, -0.0305, 0.0138, -0.0154,
0.0137, -0.0093, 0.0371, -0.0265, -0.0267, -0.0267, 0.0402, 0.0149,
0.0430, -0.0369, -0.0444, -0.0350, -0.0301, 0.0181, 0.0160, 0.0372,
-0.0231, -0.0305, 0.0237, -0.0181, 0.0271, -0.0106, 0.0256, -0.0347,
-0.0226, 0.0136, 0.0095, -0.0114, 0.0267, 0.0304, -0.0324, -0.0239,
0.0409, -0.0151, -0.0159, -0.0433, -0.0256, 0.0112, -0.0059, -0.0325,
0.0010, -0.0305, -0.0379, -0.0246, -0.0391, -0.0285, 0.0447, 0.0084,
0.0138, -0.0417, -0.0294, -0.0149, 0.0070, -0.0394, -0.0193, -0.0268,
0.0001, -0.0166, -0.0031, -0.0303, -0.0307, -0.0261, -0.0153, -0.0353,
0.0375, -0.0089, 0.0385, 0.0139, -0.0379, 0.0309, -0.0123, -0.0171,
-0.0371, -0.0445, 0.0128, -0.0098, 0.0174, -0.0367, 0.0332, -0.0328,
-0.0077, 0.0093, 0.0231, 0.0361, 0.0407, -0.0355, 0.0113, -0.0192,
-0.0049, -0.0335, 0.0407, -0.0328, 0.0239, 0.0157, 0.0145, -0.0242,
0.0406, 0.0098, 0.0058, -0.0394, 0.0188, -0.0067, -0.0205, 0.0384,
0.0100, -0.0247, -0.0226, -0.0021, 0.0250, -0.0114, -0.0255, -0.0153,
-0.0334, 0.0159, 0.0346, -0.0421, 0.0104, 0.0231, 0.0081, -0.0159,
0.0233, 0.0235, 0.0167, -0.0079, -0.0118, 0.0048, -0.0079, -0.0133,
0.0286, 0.0384, -0.0044, -0.0100, 0.0007, -0.0027, 0.0108, 0.0125,
-0.0406, -0.0165, 0.0377, 0.0174, -0.0022, -0.0270, -0.0274, -0.0401,
-0.0146, 0.0151, 0.0285, 0.0206, -0.0395, -0.0269, -0.0071, 0.0433,
0.0065, -0.0116, 0.0185, -0.0170, -0.0289, 0.0326, -0.0203, -0.0090,
-0.0445, 0.0299, 0.0339, 0.0163, -0.0312, -0.0441, -0.0363, 0.0333,
0.0215, 0.0376, 0.0234, 0.0113, -0.0004, -0.0340, -0.0383, -0.0418,
0.0183, -0.0220, -0.0090, -0.0257, -0.0081, -0.0315, -0.0292, 0.0148,
-0.0133, 0.0276, -0.0143, -0.0328, -0.0079, -0.0217, -0.0137, -0.0426,
0.0250, -0.0311, 0.0225, 0.0203, 0.0320, -0.0343, 0.0322, -0.0211,
0.0166, 0.0420, -0.0063, -0.0003, -0.0103, -0.0373, 0.0215, -0.0444,
0.0278, 0.0335, 0.0423, -0.0105, -0.0367, 0.0101, 0.0247, -0.0445,
-0.0102, -0.0268, -0.0039, -0.0220, -0.0183, -0.0142, -0.0425, 0.0367,
0.0375, -0.0070, -0.0051, -0.0183, -0.0404, -0.0435, 0.0166, -0.0246,
-0.0288, -0.0035, -0.0149, -0.0145, 0.0014, -0.0095, -0.0154, -0.0214,
-0.0364, 0.0375, -0.0179, 0.0119, -0.0155, 0.0036, 0.0417, 0.0206,
-0.0388, 0.0178, 0.0425, 0.0118, 0.0300, 0.0441, -0.0069, 0.0093,
-0.0311, -0.0092, 0.0331, 0.0229, -0.0283, -0.0359, -0.0306, -0.0441,
-0.0345, -0.0111, 0.0302, 0.0075, -0.0340, -0.0359, 0.0222, -0.0333,
-0.0055, 0.0215, -0.0207, -0.0049, -0.0039, -0.0106, -0.0227, -0.0399,
-0.0362, -0.0239, 0.0432, -0.0216, -0.0300, 0.0108, 0.0123, 0.0245,
0.0340, 0.0249, -0.0443, 0.0040, 0.0271, -0.0041, -0.0264, 0.0426,
-0.0167, -0.0255, -0.0403, 0.0020, 0.0198, 0.0099, 0.0088, -0.0339,
-0.0418, 0.0008, 0.0408, 0.0258, -0.0260, -0.0058, -0.0330, -0.0216,
0.0081, 0.0244, 0.0370, -0.0411, 0.0299, -0.0315, 0.0167, 0.0378,
0.0006, 0.0407, -0.0381, -0.0171, 0.0261, -0.0097, -0.0092, -0.0186,
0.0308, 0.0219, 0.0143, -0.0251, -0.0363, 0.0048, 0.0132, -0.0206,
-0.0125, 0.0302, 0.0036, 0.0020, -0.0110, -0.0405, -0.0420, -0.0214,
-0.0227, 0.0139, -0.0130, -0.0175, 0.0426, 0.0156, 0.0319, -0.0217,
-0.0183, 0.0164, -0.0298, -0.0292, -0.0022, -0.0164, -0.0335, 0.0265,
0.0360, 0.0073, -0.0078, -0.0414, -0.0163, 0.0114, 0.0211, -0.0057,
-0.0177, 0.0249, -0.0356, 0.0283, -0.0173, 0.0007, -0.0088, 0.0054,
-0.0135, 0.0325, -0.0012, 0.0349, 0.0430, -0.0218, -0.0326, 0.0359,
0.0350, -0.0341, -0.0035, -0.0441, -0.0366, 0.0086, 0.0119, 0.0095,
-0.0122, 0.0413, 0.0064, -0.0264, -0.0025, 0.0107, 0.0157, -0.0316,
0.0168, -0.0228, -0.0372, -0.0244, 0.0431, 0.0382, 0.0400, 0.0263,
0.0338, -0.0060, -0.0246, 0.0223, -0.0232, -0.0302, -0.0143, 0.0094,
0.0230, -0.0174, -0.0263, 0.0060, -0.0264, -0.0291, 0.0233, -0.0075,
0.0409, 0.0435, 0.0134, 0.0154, 0.0103, 0.0007, -0.0033, 0.0006,
0.0167, 0.0416, -0.0116, -0.0189]])
linear.bias : tensor([-0.0108])
Now, we can train the model.
We iterate over the data by following these two steps in each epoch:
randomly permute all indices up to the number of training data points at each epoch.
iterate over all batches by picking the samples corresponding to the current subset of indices
for epoch in range(epochs):
random_nums = torch.randperm(N)
# Iterate over the individual batches
for i in range(0, N, batch_size):
x = X[random_nums[i:i + batch_size]]
y = Y[random_nums[i:i + batch_size]]
optimizer.zero_grad()
output = model(x)
loss = hinge_loss(y.unsqueeze(1), output) + \
reg_lambda * sq_l2_reg(model)
loss.backward()
optimizer.step()
print('epoch {}, loss {}'.format(epoch, loss.item()))
# print final parameters
for name, param in model.named_parameters():
print(name, ": ", param.data)
linear.weight : tensor([[-4.3840e-02, 3.7154e-02, -1.8599e-02, -2.9914e-01, 4.8049e-02,
-3.8624e-02, -5.9737e-02, -1.8046e-02, 4.8639e-02, -3.7677e-01,
-3.0616e-02, -5.9867e-02, -1.8603e-02, 3.0316e-01, -7.6967e-02,
4.9941e-02, -1.4472e-02, 1.8021e-01, 6.9080e-02, 5.5923e-03,
-2.4834e-02, 4.9265e-02, 6.8726e-02, -3.2119e-02, 6.5568e-02,
1.5987e-01, -5.7546e-02, 9.1915e-02, -4.9658e-02, -8.7284e-03,
8.8835e-02, -2.8591e-02, 1.1857e-02, -2.7662e-02, -1.0790e-01,
-5.4270e-02, 7.3896e-02, 1.1921e-02, -1.4514e-02, 1.0379e-01,
-4.4070e-02, -4.4913e-02, 7.5571e-02, -1.0362e-01, 7.2744e-02,
-3.9317e-02, 4.8648e-02, 8.9887e-02, -1.8805e-01, -1.3462e-03,
-2.0104e-03, -3.6487e-01, -5.4353e-02, 3.8214e-02, -1.5799e-02,
-4.9141e-02, -1.4759e-02, -5.6240e-02, -4.1308e-02, -1.2092e-01,
5.7895e-02, 8.9047e-02, -3.6861e-02, -3.1655e-01, 5.3951e-02,
7.2144e-02, -1.1520e-01, 1.1261e-01, -4.0637e-02, 1.4511e-01,
-7.5381e-02, -5.1708e-02, 1.1300e-01, 1.6554e-01, 2.9488e-02,
-9.2482e-02, 6.8774e-02, -2.9853e-01, 1.2908e-01, -2.5613e-02,
-5.4678e-02, 7.7007e-03, -2.5630e-01, -1.4662e-01, 1.2987e-01,
-1.7840e-01, 2.3337e-01, 2.7036e-02, -1.9394e-02, 5.3320e-02,
-3.8555e-02, 1.6212e-02, 1.3796e-01, -3.4649e-01, 1.1526e-01,
3.6072e-03, -2.9360e-01, 2.0812e-01, -4.3965e-02, -5.2447e-02,
1.0303e-01, 7.5507e-02, 1.2622e-01, 8.2226e-02, 1.1012e-01,
-5.4091e-02, -3.1692e-01, -3.6698e-02, 1.0695e-01, -2.2638e-01,
9.4052e-02, 1.7713e-01, -2.4419e-01, -3.1327e-01, 8.9345e-02,
4.4047e-02, 2.8319e-02, 8.7749e-02, 6.7385e-03, -1.1540e-01,
1.3743e-01, -1.1855e-01, -1.3031e-02, -1.5690e-02, 6.5629e-03,
-5.3935e-02, 1.0400e-01, 1.3174e-01, -2.1443e-02, 1.9041e-02,
4.7853e-02, 2.8003e-02, -3.7880e-01, -5.9673e-02, -1.6686e-01,
-1.6510e-01, 7.1235e-02, 1.3888e-02, -1.3166e-02, -2.3137e-02,
-1.2203e-02, -9.4643e-03, 7.4931e-02, 6.7022e-02, 4.6328e-02,
4.1413e-03, -4.7664e-02, 1.1573e-01, 2.0813e-02, 7.8718e-02,
4.6320e-02, -3.8637e-02, -2.9391e-01, 2.2724e-02, -2.3529e-01,
1.3415e-01, 4.7424e-02, -5.7352e-02, -5.2678e-02, 1.7793e-01,
2.2461e-03, 1.8626e-01, 1.8894e-01, 4.1398e-02, 1.8930e-01,
-5.3658e-02, -8.8961e-02, 1.5941e-01, 4.4156e-02, 2.4779e-02,
4.8252e-02, -1.2180e-01, 1.0094e-01, 6.3193e-02, 2.4881e-02,
-5.3403e-02, 1.9461e-01, -5.7637e-02, -1.5025e-01, -5.4754e-03,
7.0096e-02, 1.1126e-01, -3.9668e-01, 2.1653e-01, 1.6591e-02,
-1.7901e-01, 1.0294e-01, 6.3018e-02, -3.0676e-01, 8.5504e-02,
5.8542e-02, 6.1896e-02, -2.2939e-01, -3.7933e-03, 2.6520e-01,
-2.2157e-02, 2.0982e-01, 1.2099e-01, 9.8588e-02, -2.0795e-02,
-4.9302e-02, -5.1215e-02, 6.7354e-02, 1.8551e-02, 1.3290e-01,
1.0492e-01, -5.4236e-02, 5.8354e-02, 1.3614e-02, 1.0947e-02,
1.5317e-01, -8.6723e-02, -3.4239e-01, -4.3005e-02, -3.8103e-02,
1.1201e-01, 5.0324e-02, 2.1222e-02, -1.0893e-01, 6.9251e-02,
7.8891e-02, -5.4532e-03, 1.6121e-01, 1.9021e-01, -1.0281e-01,
-1.0264e-01, 1.8469e-01, 1.4738e-02, -2.0190e-02, 1.0040e-01,
-3.2994e-01, 9.8982e-02, 8.0360e-03, 1.0761e-01, 3.0289e-01,
8.1558e-02, 1.4355e-01, 1.1201e-01, -1.1064e-02, 3.1383e-02,
3.7764e-02, -1.7266e-02, -1.0658e-02, 3.0246e-02, -3.7480e-01,
1.2433e-01, -1.5661e-01, 4.9313e-02, -1.1789e-01, -5.6383e-02,
1.3488e-01, -8.4259e-02, 8.1544e-02, 1.2366e-01, -2.6412e-02,
5.5292e-02, 1.8400e-02, -6.9494e-03, 6.8208e-02, -1.4954e-01,
-1.4242e-02, -1.5776e-02, -5.6088e-02, 4.5916e-02, -1.3558e-02,
2.1876e-01, 4.4712e-02, 6.1068e-02, -1.3148e-01, -1.5033e-01,
-5.6106e-02, -3.3558e-01, -1.0317e-01, 9.8463e-02, -5.5807e-02,
1.4955e-02, 3.6068e-02, -3.7595e-02, -3.5458e-01, -5.6479e-02,
-5.2465e-02, -5.3361e-02, 1.0677e-01, 4.9788e-02, -6.1250e-02,
8.9735e-02, 3.4024e-02, 2.4850e-01, -8.5773e-03, -5.8058e-02,
-2.5688e-02, -4.8081e-02, 2.9960e-02, -2.0875e-01, -3.8518e-02,
7.1406e-02, 1.4220e-01, 4.6614e-02, -2.0379e-01, -1.5046e-01,
-2.0910e-02, -1.7877e-01, 1.0745e-01, 6.7212e-02, 7.4762e-02,
1.3230e-01, -3.7537e-02, -4.8878e-02, 5.9741e-03, 9.1832e-02,
7.7990e-02, 5.4479e-03, -3.0133e-02, -2.0697e-01, 1.5037e-01,
3.0259e-01, -2.3600e-01, -1.7354e-03, -5.7603e-02, -2.8642e-02,
-1.9974e-02, -3.5416e-01, 7.5613e-02, 2.8154e-02, -3.0459e-03,
8.2929e-02, 6.1840e-02, -4.4045e-02, -5.5773e-02, -2.5357e-02,
1.2704e-01, -4.3917e-02, 9.5720e-02, -1.1248e-01, 2.5141e-02,
-3.2454e-01, -2.3494e-02, 2.2846e-02, -6.3667e-02, -5.0494e-02,
-3.1114e-02, -2.5880e-02, 1.5591e-01, -5.6173e-02, -4.7467e-02,
1.3928e-01, 9.6223e-02, 4.2254e-02, -2.7074e-02, -3.3300e-01,
-5.0639e-02, 1.1634e-01, 1.2826e-01, -5.6380e-02, 6.9966e-02,
1.7397e-01, -6.1729e-02, -5.6625e-02, -3.5030e-02, -1.1864e-01,
-5.4816e-02, 5.0864e-02, 1.8117e-02, 8.7305e-02, 3.9139e-02,
-3.2584e-02, -1.7974e-02, 1.4027e-01, 1.5126e-01, 2.6297e-02,
6.7570e-02, 5.6564e-02, 7.6002e-02, 7.7737e-02, 6.3479e-02,
9.3687e-02, 1.2394e-01, 8.2663e-02, -5.6240e-02, -1.2089e-01,
-5.3539e-02, -5.7788e-02, 1.9458e-01, -1.7447e-01, 1.4206e-02,
1.9677e-02, -1.2964e-01, -4.9498e-02, 1.2762e-01, 1.4742e-01,
2.8410e-02, -3.1652e-01, 4.8532e-02, 7.4229e-02, -3.5664e-02,
-1.2198e-01, 2.3746e-04, 1.3970e-01, -9.6228e-02, 7.3601e-03,
8.6030e-02, 1.6295e-01, 9.8407e-02, -2.1554e-01, 1.8469e-02,
-1.1367e-01, -1.1676e-01, 1.7324e-02, 1.2655e-02, -2.7465e-02,
-2.2538e-01, 6.5080e-02, 2.0316e-01, -6.2827e-02, 2.9524e-02,
-5.2807e-02, 5.2069e-04, 6.6638e-02, -5.3416e-02, 7.6444e-03,
2.1932e-02, 6.3398e-03, -1.9308e-02, -1.3955e-01, 1.8698e-02,
-5.9288e-02, 4.0943e-03, -4.5194e-02, -6.2980e-02, 6.8262e-02,
-1.8848e-01, 6.9984e-02, 2.0808e-01, -2.2017e-02, 4.3610e-02,
-7.8101e-02, 1.9062e-02, 6.2796e-02, -7.9894e-02, 1.3576e-01,
-2.6056e-02, 2.3333e-02, -2.8570e-02, 1.0872e-01, 9.1267e-02,
2.5061e-01, 1.0474e-01, 8.5392e-02, 6.8411e-02, -3.8639e-02,
1.2945e-01, -4.1318e-02, 5.1344e-02, -2.9312e-02, 1.2537e-02,
1.5366e-02, 1.2453e-01, 2.6893e-01, -2.4416e-02, -6.9235e-02,
5.6739e-02, -3.9823e-02, -5.0897e-02, 6.2925e-02, 1.6061e-01,
7.1205e-02, 3.1982e-02, 1.3859e-01, 6.9275e-02, -1.9263e-02,
1.4228e-01, 8.7136e-02, 1.0707e-02, -2.0420e-01, 5.5799e-02,
8.1763e-02, -1.4348e-01, -1.5002e-01, 6.0127e-02, 8.3922e-02,
-3.9288e-01, -6.6251e-03, 2.1520e-01, -7.3047e-02, 6.2121e-02,
-5.9529e-02, -5.4644e-02, 1.1095e-02, 1.5836e-01, -1.5995e-01,
-2.3951e-01, 1.6795e-02, -5.3126e-02, 4.9919e-02, 6.4854e-02,
3.3579e-04, 4.6418e-02, -5.5911e-02, 4.2853e-02, -5.6071e-02]])
linear.bias : tensor([-0.3326])
In this formulation all samples are used to fit the parameters. This in contranst to the SMO solution might take longer to optimize, but is more stable because we don’t select individial samples and ignore all the others.
visualize_torch(X, Y, model=model)
/tmp/ipykernel_10392/3784346825.py:49: UserWarning: The following kwargs were not used by contour: 'linewidth'
plt.contour(cs0, '-', levels=[0], colors='r', linewidth=5)