Skip to content

Latest commit

 

History

History
278 lines (184 loc) · 11.6 KB

introduction.md

File metadata and controls

278 lines (184 loc) · 11.6 KB

Introduction

Contents

  1. AI vs ML vs DL
  2. Layers of a network are matrix-function sandwiches
  3. Neural networks are universal function approximators
  4. Softmax classification
  5. Gradient descent
  6. Backpropagation
  7. Double descent
  8. Conclusion

AI vs ML vs DL

AI vs ML vs DL (source: figshare)

  1. Artificial Intellegence is any kind of software that makes intellegent decisions in some sense. This could be as simple as hard-coded expert heuristics. Simple example: thermostat.
  2. Machine Learning is a kind of software that somehow improves (is trained) when given data. Expert knowledge is often used to structure the model and what features of the data are used. Simple example: Least squares fit to a linear model.
  3. Deep Learning is a recent paradigm of machine learning using large artificial neural networks, with many layers for feature learning. Models are more of a blackbox that learn features from the raw data.

In the general machine learning setup you have a model that can be thought of as a parameterized function, $f$, that takes in some vector of input data, $\vec{x}$, and returns some prediction, $y$. Internally the model is parameterized by weights, $\vec{w}$.

$$ y = f(\vec{x}; \vec{w}) $$

In the setting of supervized learning, then we have a training dataset that has pairs of input samples, $\vec{x}$, and the true label of what the prediction should be, $\tilde{y}$.

Layers of a network are matrix-function sandwiches

Neural network layer:

$$ y_i = \phi(z_i) = \phi(w_{ij} x_{j} + b_{j}) $$

Example neural net, Bishop, p. 181

A multi-layer neural network is a composition of functions.

$$ f(\vec{x}; \vec{w}) = f_n \circ \ldots f_2 \circ f_1(\vec{x}) = \phi_{n}( \ldots \phi_{2}(W_2 \phi_{1}(W_1 \vec{x} + \vec{b}_1 ) + \vec{b}_2) \ldots ) $$

A multi-layer neural network is a composition of functions. source: towardsdatascience.com

There are a variety of nonlinear functions used in practice.

Example activation functions, Bishop, p. 184

Note that we need the activation functions, $\phi$, to add a nonlinearity, because otherwise, if every layer was just a linear matrix multiply, then the layers could be effectively reduced to one:

$$ \vec{y} = W_{n} \ldots W_{2} W_{1} \vec{x} = W \vec{x} $$

See also:

Neural networks are universal function approximators

Neural networks need nonlinearities. ReLU is a simple example.

ReLU function from Prince (2023).

A neural network that uses ReLU acivations is a piecewise linear function.

ReLU composition from Prince (2023).

⇒ Neural networks are universal function approximators.

Softmax classification

TODO

See also:

Gradient descent

source: 1805.04829

The workhorse algorithm for optimizing (training) model parameters is gradient descent:

$$ \vec{w}[t+1] = \vec{w}[t] - \eta \frac{\partial L}{\partial \vec{w}}[t] $$

In Stochastic Gradient Descent (SGD), you chunk the training data into minibatches (AKA batches), $\vec{x}_{it}$, and take a gradient descent step with each minibatch:

$$ \vec{w}[t+1] = \vec{w}[t] - \frac{\eta}{m} \sum_{i=1}^m \frac{\partial L}{\partial \vec{w}}[\vec{x}_{it}] $$

where

  • $t \in \mathbf{N}$ is the learning step number
  • $\eta$ is the learning rate
  • $m$ is the number of samples in a minibatch, called the batch size
  • $L$ is the loss function
  • $\frac{\partial L}{\partial \vec{w}}$ is the gradient

Notice that the gradient will be more noisey when the batch size is small. One might think this is bad, but in a lot of cases it turns out to help to have some noise in the gradient as a regularizer. Regularization is basically any technique that helps a model generalize, get better evaluation error. There is a ton of literature about how to change the learning rate with batch size.

There are many additions to SGD that are used in state-of-the-art optimizers:

  • SGD with momentum
  • RMSProp
  • Adam, AdamW
  • ...

Advanced optimizers add parameters that add to memory overhead.

Backpropagation

How do we calculate $\frac{\partial L}{\partial w_{i}}$?

Chain rule of calculus, case of scalar-valued functions with multiple inputs:

$$ \frac{d}{d x} f\left( g_1(x), \ldots, g_k(x) \right) = \sum_{j=1}^{k} \frac{\partial f}{\partial g_j} \frac{d g_j}{dx} $$

Using this to write the gradient with respect to some parameter:

$$ \frac{\partial L}{\partial w_i} = \sum_{j} \frac{\partial L}{\partial a_j} \frac{\partial a_j}{\partial w_i} = \sum_j \delta_j \frac{\partial a_j}{\partial w_i} $$

where

  • $\frac{\partial L}{\partial w_i}$ is the gradient
  • $a_j$ is the activation at that layer
  • $\delta_j \equiv \frac{\partial L}{\partial a_j}$ is the delta at that layer
  • $\frac{\partial a_j}{\partial w_i}$ is the rest of the derivative in following layers, recursively using the chain rule
graph LR;
    x1[x1]-->h1((h1))
    x1-->h2((h2))
    x1-->h3((h3))
    x1-->h4((h4))
    x2[x2]-->h1
    x2-->h2
    x2-->h3
    x2-->h4
    h1-->a1((a1))
    h1-->a2((a2))
    h2-->a1
    h2-->a2
    h3-->a1
    h3-->a2
    h4-->a1
    h4-->a2
    a1-->|logit a1|CEL[CrossEntropyLoss]
    a2-->|logit a2|CEL
    T[truth label]-->CEL
    CEL-->L
    L-->|backprop dL/da1|a1
Loading

In general, training involves:

  1. Forward pass: Calculate the inference of the model: $y = f(x; w)$, caching the activations, $a_j$, along the way
  2. Loss: $L$ (e.g. cross entropy)
  3. Backward pass: $\delta_j = \frac{\partial L}{\partial a_j}$
  4. Weight update: $-\eta \frac{\partial L}{\partial w_i}$

See also:

Double descent

Example training and test loss curves for a single training experiment:

Example traing and test loss curves, source: google.

Early stopping is a method of picking the optimally trained model by stopping training when the test loss is at a minimum.

Now thinking about what happens if we do additional experiments with larger models. Note that in the following plot, the curves are not losses for single experiments, but the final losses for many different experiments with increasing model sizes.

In the tradditional way of thinking about overfitting, the rising test loss with model complexity is part the bias-variance tradeoff.

Bias-variance tradeoff (source: Wikipedia).

But what is actually observed with increasing model size, is that the test loss goes down again.

Double descent, source: 1912.02292.

Thought of as a function of the size of the model, in the the classical machine learning regime ($n_\mathrm{param} \ll n_\mathrm{data}$), there is an optimal model size that minizes test loss. At larger model sizes, in the critical regime ($n_\mathrm{param} \sim n_\mathrm{data}$), the test loss rises again as part of the high-variance part of the classical bias-variance tradeoff. But at larger model sizes, ($n_\mathrm{param} \gg n_\mathrm{data}$), even passed those that achieve zero train loss, larger models show better generalization. This is the success of deep learning.

See also:

Conclusion

  • Deep learning is a paradigm of machine learning that pushes the scale of the model and data.
  • Neural networks are universal function approximators.
  • SGD-like optimizers are the workhorses of deep learning.
  • TODO: Softmax classification.
  • Double descent is the surpising phenomena that neural networks generalize better with many parameters.

See also

Pedagogy

Classical machine learning textbooks:

Deep learning textbooks:

Online courses: