9. Multilayer Perceptron#

9.1. Limitations of Linear Regression#

Going back to lecture Tricks of Optimization, we discussed that the most general linear model could be any linear combination of \(x\)-values lifted to a predefined basis space \(\varphi(x)\), e.g., polynomial, exponent, sin, tanh, etc. basis:

(9.1)#\[h(x)= \vartheta^{\top} \varphi(x) = \vartheta_0 + \vartheta_1 \cdot x + \vartheta_3 \cdot x^2 + \vartheta_3 \cdot \exp(x) + \vartheta_4 \cdot \sin(x) + \vartheta_5 \cdot \tanh(x) + ...\]

We also saw that through hyperparameter tuning, one can find a basis that captures the underlying true relation between inputs and outputs well. In fact, for many practical problems, the approach of manually selecting meaningful basis functions and then training a linear model will be good enough, especially if we know something about the data beforehand.

The problem is that for many tasks, we don’t have such a-priori information, and exploring the space of all possible combinations of basis functions ends in an infeasible combinatoric problem. Especially if our datasets are large, e.g., ImageNet, it is unrealistic to think that we can manually transform the inputs to a linearly separable space.

../_images/under_overfitting.png

Fig. 9.1 Under- and overfitting (Source: Techniques for handling underfitting and overfitting in Machine Learning)#

Q: If linear models are not all, what else?

Deep Learning is the field that tries to systematically explore the space of possible non-linear relations \(h\) between input \(x\) and output \(y\). As a reminder, a non-linear relation can be, for example

(9.2)#\[ h(x) = x^{\vartheta_0} + \max\{0, \vartheta_1 \cdot x\} + ... \]

In the scope of this class, we will look at the most popular and successful non-linear building blocks. We will see the Multilayer Perceptron, Convolutional Layer, and Recurrent Neural Network.

9.2. Perceptron#

Perceptron is a binary linear classifier and a single-layer neural network.

../_images/perceptron.png

Fig. 9.2 The perceptron (Source: What the Hell is Perceptron?)#

The perceptron generalizes the linear hypothesis \(\vartheta^{\top} x\) by subjecting it to a step function \(f\) as

(9.3)#\[h(x) = f(\vartheta^{\top}x).\]

Note: The term \(\vartheta x\) is actually an affine transformation, not just linear. We use the notation with \(w_0 = b\) for brevity.

In the case of two-class classification, we use the sign function

(9.4)#\[\begin{split}f(a)= \left\{\begin{array}{l} +1 \quad \text{if } a\ge 0, \\ -1 \quad \text{else.} \end{array}\right.\end{split}\]

\(f(a)\) is called activation function as it represents a simple model of how neurons respond to input stimuli. Other common activation functions are the sigmoid, tanh, and ReLU (\(=\max(0,x)\)).

../_images/activation_functions.png

Fig. 9.3 Activation functions (Source: Introduction to Different Activation Functions for Deep Learning)#

9.3. Multilayer Perceptron (MLP)#

If we stack multiple perceptrons after each other with a user-defined dimension of the intermediate (a.k.a. latent or hidden) space, we get a multilayer perceptron.

../_images/mlp.png

Fig. 9.4 Multilayer Perceptron (Source: Training Deep Neural Networks)#

We could write down the stack of such layer-to-layer transformations as

(9.5)#\[h(x) = W_4 f_3 ( W_3 f_2(W_2 f_1(W_1 \mathbf{x}))).\]

In the image above, the black line connecting entry \(i\) of the input with the corresponding entry \(j\) of hidden layer 1 corresponds to the row \(i\) and column \(j\) of the learnable weights matrix \(W_{1}\). Note that the output layer does not have an activation function - in the case of regression, we typically stop with the linear transformation, and in the case of classification, we typically apply the softmax function to the output vector.

It is crucial to have non-linear activation functions. Why? Simply concatenating linear functions results in a new linear function! You immediately see it if you remove the activations \(f_i\) in the equation above.

By the Universal Approximation Theorem, a single-layer perceptron with any “squashing” activation function (i.e., \(h(x)=W_2 f(W_1 x)\)) can approximate essentially any functional \(h: x \to y\). More on that in [Goodfellow et al., 2016], Section 6.4.1. However, empirically we see improved performance when we stack multiple layers, adding the depth (number of hidden layers) and width (dimension of hidden layers) of a neural network to the hyperparameters of paramount importance.

Exercise: Learning XOR

Find the simplest MLP capable of learning the XOR function, and fit its parameters.

../_images/xor_function.png

Fig. 9.5 XOR function (Source: [Goodfellow et al., 2016], Section 6.1)#

9.4. Further References#