- AI vs ML vs DL
- Layers of a network are matrix-function sandwiches
- Neural networks are universal function approximators
- Softmax classification
- Gradient descent
- Backpropagation
- Double descent
- Conclusion
- Artificial Intellegence is any kind of software that makes intellegent decisions in some sense. This could be as simple as hard-coded expert heuristics. Simple example: thermostat.
- Machine Learning is a kind of software that somehow improves (is trained) when given data. Expert knowledge is often used to structure the model and what features of the data are used. Simple example: Least squares fit to a linear model.
- Deep Learning is a recent paradigm of machine learning using large artificial neural networks, with many layers for feature learning. Models are more of a blackbox that learn features from the raw data.
In the general machine learning setup you have a model that can be
thought of as a parameterized function,
In the setting of supervized learning, then we have a training dataset
that has pairs of input samples,
Neural network layer:
A multi-layer neural network is a composition of functions.
There are a variety of nonlinear functions used in practice.
Note that we need the activation functions,
See also:
- Bishop, C.M. (2024). Deep Learning: Foundations and Concepts.
- Bradley, T.D. (2019). Matrices as tensor network diagrams.
Neural networks need nonlinearities. ReLU is a simple example.
A neural network that uses ReLU acivations is a piecewise linear function.
⇒ Neural networks are universal function approximators.
- Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators.
- Prince, S.J.D. (2023). Understanding Deep Learning.
TODO
See also:
- Roelants, P. (2019). Softmax classification with cross-entropy.
The workhorse algorithm for optimizing (training) model parameters is gradient descent:
In Stochastic Gradient Descent (SGD), you chunk the training data into minibatches (AKA batches),
where
-
$t \in \mathbf{N}$ is the learning step number -
$\eta$ is the learning rate -
$m$ is the number of samples in a minibatch, called the batch size -
$L$ is the loss function -
$\frac{\partial L}{\partial \vec{w}}$ is the gradient
Notice that the gradient will be more noisey when the batch size is small. One might think this is bad, but in a lot of cases it turns out to help to have some noise in the gradient as a regularizer. Regularization is basically any technique that helps a model generalize, get better evaluation error. There is a ton of literature about how to change the learning rate with batch size.
There are many additions to SGD that are used in state-of-the-art optimizers:
- SGD with momentum
- RMSProp
- Adam, AdamW
- ...
Advanced optimizers add parameters that add to memory overhead.
How do we calculate
Chain rule of calculus, case of scalar-valued functions with multiple inputs:
Using this to write the gradient with respect to some parameter:
where
-
$\frac{\partial L}{\partial w_i}$ is the gradient -
$a_j$ is the activation at that layer -
$\delta_j \equiv \frac{\partial L}{\partial a_j}$ is the delta at that layer -
$\frac{\partial a_j}{\partial w_i}$ is the rest of the derivative in following layers, recursively using the chain rule
graph LR;
x1[x1]-->h1((h1))
x1-->h2((h2))
x1-->h3((h3))
x1-->h4((h4))
x2[x2]-->h1
x2-->h2
x2-->h3
x2-->h4
h1-->a1((a1))
h1-->a2((a2))
h2-->a1
h2-->a2
h3-->a1
h3-->a2
h4-->a1
h4-->a2
a1-->|logit a1|CEL[CrossEntropyLoss]
a2-->|logit a2|CEL
T[truth label]-->CEL
CEL-->L
L-->|backprop dL/da1|a1
In general, training involves:
-
Forward pass: Calculate the inference of the model:
$y = f(x; w)$ , caching the activations,$a_j$ , along the way -
Loss:
$L$ (e.g. cross entropy) -
Backward pass:
$\delta_j = \frac{\partial L}{\partial a_j}$ -
Weight update:
$-\eta \frac{\partial L}{\partial w_i}$
See also:
- Automatic differentiation - "autograd"
- Backpropagation
- Johnson, J. (2017). Backpropagation for a linear layer.
- Parr, T. & Howard, J. (2018). The matrix calculus you need for deep learning.
Example training and test loss curves for a single training experiment:
Early stopping is a method of picking the optimally trained model by stopping training when the test loss is at a minimum.
Now thinking about what happens if we do additional experiments with larger models. Note that in the following plot, the curves are not losses for single experiments, but the final losses for many different experiments with increasing model sizes.
In the tradditional way of thinking about overfitting, the rising test loss with model complexity is part the bias-variance tradeoff.
But what is actually observed with increasing model size, is that the test loss goes down again.
Thought of as a function of the size of the model,
in the the classical machine learning regime
(
See also:
- Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off.
- Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2019). Deep double descent: Where bigger models and more data hurt.
- MLU-explain: Bias-variance tradeoff
- MLU-explain: Double descent 1
- MLU-explain: Double descent 2
- Deep learning is a paradigm of machine learning that pushes the scale of the model and data.
- Neural networks are universal function approximators.
- SGD-like optimizers are the workhorses of deep learning.
- TODO: Softmax classification.
- Double descent is the surpising phenomena that neural networks generalize better with many parameters.
Classical machine learning textbooks:
- Bishop, C.M. (2006). Pattern Recognition and Machine Learning.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.).
Deep learning textbooks:
- Bishop, C.M. (2024). Deep Learning: Foundations and Concepts.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning.
- Prince, S.J.D. (2023). Understanding Deep Learning.
Online courses:
- Bekman, S. (2023). Machine Learning Engineering Open Book.
- Chollet, F. (2021). Deep Learning with Python.
- Labonne, M. (2023). Large Language Model Course.
- Microsoft. (2023). Generative AI for Beginners.
- PyTorch. (2024). Official PyTorch Documentary: Powering the AI Revolution.
- Ustyuzhanin, A. (2020). Deep Learning 101.
- Up next: Computer vision