Playing around with gradient descent and its limits.
Following Ali Rahimi's talk at NIPS 2017 (https://www.youtube.com/watch?v=Qi1Yry33TQE), I decided to explore the limits of gradient descent (GD) on ill-conditioned problems.
I take the same use case that he mentions: a simple one hidden layer neural network, with no non-linearities. This is equivalent to two consecutive matrix multiplications.
Data is generated randomly. There are 600 datapoints, each of dimension 13. Labels are in dimension 4. The (13,4) matrix (named A in the code) mapping datapoints to labels that we are trying to learn is also generated randomly, with a Gaussian distribution.
The trick here to play on the condition number is to write A in its singular value decomposition:
where D is diagonal with Gaussian values, U and V are orthogonal. Generating U and V can be done by taking the first matrix in the QR decomposition of 2 random matrices.
A parameter named "ev" controls the weight that we add to the first singular value of this matrix (we add "ev" to the first element of D). This weight represents the ill-condition of the problem. A heavy such weigth will lead to GD needing a smaller and smaller learning rate to converge, if it converges at all.
I compare here GD versus Newton's method, with 3 different ways to do backpropagation:
_manually (this is a great exercise)
_with Pytorch
_with Tensorflow
The 3 scripts are respectively np_manual.py, pytorch.py and tensorf.py.
I compare GD with Newton Raphson's method. Since dimension is not too large here, the Hessian can easily be calculated and inversed.
Let's take an example with Pytorch and an "ev" value of 1, which is the order of magnitude of the condition number of A (in this case A will typically have a condition number of 3 or 4). That is still quite small. The largest order of magnitude of the learning rate not leading to a divergence of gradient descent (after line search on these orders of magnitudes) is :
In this case, appeared to be an ideal learning rate. The gradient descent converged in around 20 epochs. However, it converges to a higher loss than with Newton's method.
Setting the condition number via "ev" to a much higher value will lead gradient needing a smaller and smaller learning rate in order not to diverge.
For instance, let's now set "ev" to 10000.
The first learning rate not diverging is :
We have to go lower and lower to find converging learning rates, but even then convergence is slowed down.