Introduction

AI vs ML vs DL

Artificial Intellegence is any kind of software that makes intellegent decisions in some sense. This could be as simple as hard-coded expert heuristics. Simple example: thermostat.
Machine Learning is a kind of software that somehow improves (is trained) when given data. Expert knowledge is often used to structure the model and what features of the data are used. Simple example: Least squares fit to a linear model.
Deep Learning is a recent paradigm of machine learning using large artificial neural networks, with many layers for feature learning. Models are more of a blackbox that learn features from the raw data.

In the general machine learning setup you have a model that can be thought of as a parameterized function, $f$, that takes in some vector of input data, $\vec{x}$, and returns some prediction, $y$. Internally the model is parameterized by weights, $\vec{w}$.

$$ y = f(\vec{x}; \vec{w}) $$

In the setting of supervized learning, then we have a training dataset that has pairs of input samples, $\vec{x}$, and the true label of what the prediction should be, $\tilde{y}$.

Layers of a network are matrix-function sandwiches

Neural network layer:

$$ y_i = \phi(z_i) = \phi(w_{ij} x_{j} + b_{j}) $$

A multi-layer neural network is a composition of functions.

$$ f(\vec{x}; \vec{w}) = f_n \circ \ldots f_2 \circ f_1(\vec{x}) = \phi_{n}( \ldots \phi_{2}(W_2 \phi_{1}(W_1 \vec{x} + \vec{b}_1 ) + \vec{b}_2) \ldots ) $$

There are a variety of nonlinear functions used in practice.

Note that we need the activation functions, $\phi$, to add a nonlinearity, because otherwise, if every layer was just a linear matrix multiply, then the layers could be effectively reduced to one:

$$ \vec{y} = W_{n} \ldots W_{2} W_{1} \vec{x} = W \vec{x} $$

See also:

Bishop, C.M. (2024). Deep Learning: Foundations and Concepts.
Bradley, T.D. (2019). Matrices as tensor network diagrams.

Neural networks are universal function approximators

Neural networks need nonlinearities. ReLU is a simple example.

A neural network that uses ReLU acivations is a piecewise linear function.

⇒ Neural networks are universal function approximators.

Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators.
Prince, S.J.D. (2023). Understanding Deep Learning.

Softmax classification

TODO

See also:

Roelants, P. (2019). Softmax classification with cross-entropy.

Gradient descent

The workhorse algorithm for optimizing (training) model parameters is gradient descent:

$$ \vec{w}[t+1] = \vec{w}[t] - \eta \frac{\partial L}{\partial \vec{w}}[t] $$

In Stochastic Gradient Descent (SGD), you chunk the training data into minibatches (AKA batches), $\vec{x}_{it}$, and take a gradient descent step with each minibatch:

$$ \vec{w}[t+1] = \vec{w}[t] - \frac{\eta}{m} \sum_{i=1}^m \frac{\partial L}{\partial \vec{w}}[\vec{x}_{it}] $$

where

$t \in \mathbf{N}$ is the learning step number
$\eta$ is the learning rate
$m$ is the number of samples in a minibatch, called the batch size
$L$ is the loss function
$\frac{\partial L}{\partial \vec{w}}$ is the gradient

Notice that the gradient will be more noisey when the batch size is small. One might think this is bad, but in a lot of cases it turns out to help to have some noise in the gradient as a regularizer. Regularization is basically any technique that helps a model generalize, get better evaluation error. There is a ton of literature about how to change the learning rate with batch size.

There are many additions to SGD that are used in state-of-the-art optimizers:

SGD with momentum
RMSProp
Adam, AdamW
...

Advanced optimizers add parameters that add to memory overhead.

Backpropagation

How do we calculate $\frac{\partial L}{\partial w_{i}}$?

Chain rule of calculus, case of scalar-valued functions with multiple inputs:

$$ \frac{d}{d x} f\left( g_1(x), \ldots, g_k(x) \right) = \sum_{j=1}^{k} \frac{\partial f}{\partial g_j} \frac{d g_j}{dx} $$

Using this to write the gradient with respect to some parameter:

$$ \frac{\partial L}{\partial w_i} = \sum_{j} \frac{\partial L}{\partial a_j} \frac{\partial a_j}{\partial w_i} = \sum_j \delta_j \frac{\partial a_j}{\partial w_i} $$

where

$\frac{\partial L}{\partial w_i}$ is the gradient
$a_j$ is the activation at that layer
$\delta_j \equiv \frac{\partial L}{\partial a_j}$ is the delta at that layer
$\frac{\partial a_j}{\partial w_i}$ is the rest of the derivative in following layers, recursively using the chain rule

graph LR;
    x1[x1]-->h1((h1))
    x1-->h2((h2))
    x1-->h3((h3))
    x1-->h4((h4))
    x2[x2]-->h1
    x2-->h2
    x2-->h3
    x2-->h4
    h1-->a1((a1))
    h1-->a2((a2))
    h2-->a1
    h2-->a2
    h3-->a1
    h3-->a2
    h4-->a1
    h4-->a2
    a1-->|logit a1|CEL[CrossEntropyLoss]
    a2-->|logit a2|CEL
    T[truth label]-->CEL
    CEL-->L
    L-->|backprop dL/da1|a1

Loading

In general, training involves:

Forward pass: Calculate the inference of the model: $y = f(x; w)$, caching the activations, $a_j$, along the way
Loss: $L$ (e.g. cross entropy)
Backward pass: $\delta_j = \frac{\partial L}{\partial a_j}$
Weight update: $-\eta \frac{\partial L}{\partial w_i}$

See also:

Automatic differentiation - "autograd"
Backpropagation
Johnson, J. (2017). Backpropagation for a linear layer.
Parr, T. & Howard, J. (2018). The matrix calculus you need for deep learning.

Double descent

Example training and test loss curves for a single training experiment:

Early stopping is a method of picking the optimally trained model by stopping training when the test loss is at a minimum.

Now thinking about what happens if we do additional experiments with larger models. Note that in the following plot, the curves are not losses for single experiments, but the final losses for many different experiments with increasing model sizes.

In the tradditional way of thinking about overfitting, the rising test loss with model complexity is part the bias-variance tradeoff.

But what is actually observed with increasing model size, is that the test loss goes down again.

Thought of as a function of the size of the model, in the the classical machine learning regime ($n_\mathrm{param} \ll n_\mathrm{data}$), there is an optimal model size that minizes test loss. At larger model sizes, in the critical regime ($n_\mathrm{param} \sim n_\mathrm{data}$), the test loss rises again as part of the high-variance part of the classical bias-variance tradeoff. But at larger model sizes, ($n_\mathrm{param} \gg n_\mathrm{data}$), even passed those that achieve zero train loss, larger models show better generalization. This is the success of deep learning.

See also:

Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off.
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2019). Deep double descent: Where bigger models and more data hurt.
MLU-explain: Bias-variance tradeoff
MLU-explain: Double descent 1
MLU-explain: Double descent 2

Conclusion

Deep learning is a paradigm of machine learning that pushes the scale of the model and data.
Neural networks are universal function approximators.
SGD-like optimizers are the workhorses of deep learning.
TODO: Softmax classification.
Double descent is the surpising phenomena that neural networks generalize better with many parameters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduction.md

introduction.md

Introduction

Contents

AI vs ML vs DL

Layers of a network are matrix-function sandwiches

Neural networks are universal function approximators

Softmax classification

Gradient descent

Backpropagation

Double descent

Conclusion

See also

Pedagogy

Files

introduction.md

Latest commit

History

introduction.md

File metadata and controls

Introduction

Contents

AI vs ML vs DL

Layers of a network are matrix-function sandwiches

Neural networks are universal function approximators

Softmax classification

Gradient descent

Backpropagation

Double descent

Conclusion

See also

Pedagogy