Using the Adalite optimizer on nanogpt, like Lilith, but with less mem usage
- Test 4, going lower, lr 8e-5 and 1e-5, but it appears that the lr is too low for a signifigant change at 1e-5, but it may have a lower final loss compared to other lrs if allowed to run more steps?
- Test 3, wait what? a lower lr of 1e-4 keeps loss at a stable drop, and val loss is better than adam by about ~0.2?
-
Test 2, using a higher lr happens to not work at all, (was trying 1e-3/8e-4/5e-4, but that also goes Nan in 400/800/2000 steps), there are strange lr issues in this optimizer
-
Test 1, with adalite at all the defaults except lr at 3e-4, was using the nanogpt default of 1e-2, but it went to Nan in like 100 steps, here the adamW baseline is the default karpathy shakespeare-GPT, both at 10m parameters, tiny-shakespeare dataset. These two appear to take the same average time, 100ms per step for 16k tokens. Adam here is clearly superior, will have to see if my hyper-params are bad, or scheduler, or if batch-size helps