This is a basic handcrafted neural network implementation. It's not much but the motive was to prove myself that I understood backpropagation.
We use only dense (linear) layers and sigmoid activation functions. The code was hardcoded to classify mnist but you can easily modify it for other purposes. We also take advange of vectorized operations with numpy, the version with loops is many times slower on my cpu.
Default network shape is (784,40,10), minibatch size is 10 and learning rate is 3.0. With the default hyperparameters network achieves 95% accuracy.
I downloaded mnist data from: http://yann.lecun.com/exdb/mnist/. The data is also in this repository unmodified.
init_net
initializes the network with given shape. forward
takes the input and predicts the label. backward
implements the backpropagation algorithm which is the heart of the neural network.
Gradients are accumulated through a minibatch. optimize
applies the accumulated gradients to weights and biases. zero_grad
zeroes the gradients between minibatches.
I like to think of a neural network as interconnected gears. I made the following picture which helped me a lot. Note that superscripts here denotes the layer numbers. They are in reverse order starting from 0 which is the output layer.
Here we can easily see how weights in a layer can affect the final gear i.e. the cost. If you rotate W0 gear one cycle, how much is the cost gear rotated? That ratio is basically the gradient of W0 gear.
With backpropagation, all we want is to know how much each weight affects the final cost. Once we know that, we can easily modify weights to decrease the cost. The naive approach is to calculate ∂C/∂w directly for each weight. In that case we modify a weight just a bit, send the input again, check how much the cost has changed. There is a big problem here. For each weight, we have to send the input again and calculate the whole forward pass. That's obviously not practical. Backpropagation solves this problem in a way that we calculate the derivatives only with a single forward pass.
Here is how it's work. We start from the final gear (cost) and ask ourselves which gear directly affects it? As you can tell from the image, the answer is A0, so we take derivative ∂C/∂a0 and note it somewhere. Remember that the derivative only tells how much A0 affects C. Next we ask which gear directly affects A0 and it's Z0. We calculate ∂a0/∂z0. Next we ask which gear directly affects Z0, there are actually two, W0 and A-1. Let's continue with W0 first and calculate ∂z0/∂w0.
Now we have three pieces of information. How much W0 affects Z0, how much Z0 affects A0 and how much A0 affects the cost. We multiply these three derivatives and what we get is how much W0 affects the final cost, i.e. W0's gradient. Check the image, this is the equation in the second group.
Here's the crucial part, since we already calculated how much Z0 affects the cost, we can consider it as a checkpoint. So instead of calculating how much W-1 affects the cost directly, we can calculate how much it affects the Z0. As we already know how much Z0 affects cost. Finally, we can multiply these two derivatives and recover the total effects i.e. how much W-1 affects the final cost.
So let's calculate the gradient of W-1. Which gear directly affects Z0? They are W0 and A-1, but we already calculated W0, and we're trying to get to W-1. Hence, we continue with A-1 and calculate ∂z0/∂a-1. Next, which gear directly affects A-1? It's Z-1. We calculate ∂a-1/∂z-1. Next, which gear directly affects the Z-1?
Both A-2 and W-1. We're interested in W-1 so we calculate ∂z-1/∂w-1. Let's phrase it. We know how much W-1 affects Z-1 and we know how much Z-1 affects A-1 and we know how much A-1 affects Z0. We multiply these three derivatives and learn how much W-1 affects Z0. Once that's done, we can multiply this with gradient of Z0 to learn how much W-1 affects the final cost, which is the gradient of W-1. This is the ∂C/∂W-1 equation in the third group of equations.
As you may guess, now Z-1 is our new checkpoint, so we repeat the same process for the rest of the layers. For instance, to calculate gradient of W-2, we follow the exact steps in the previous section, only the indices change. Once we calculated the gradients, the job of backpropagation is done. Optimizer applies those gradients to actual weights and biases.
Note that we never mentioned biases in this writing or in the picture. That's because it's very easy to calculate and while we're calculating the gradients of weights, we implicitly calculate the gradients of biases as well. Imagine another gear connected to Z0 and it's named B0, just like the weight. How much B0 affects Z0 is a constant. ∂z0/∂b0 is actually 1. So how much B0 affects the cost is the same as how much Z0 affects it, therefore gradient of B0 is the gradient of Z0, which we already calculated. You can see it in the backward
function with the assignments dc_db = dc_dz
.
The main source I used was Grant Sanderson's awesome neural network series, mostly the fourth video. Another helpful source was Casper Hansen's neural network tutorial.