Adalite

Using the Adalite optimizer on nanogpt, like Lilith, but with less mem usage

Running tests

Test 4, going lower, lr 8e-5 and 1e-5, but it appears that the lr is too low for a signifigant change at 1e-5, but it may have a lower final loss compared to other lrs if allowed to run more steps?

Test 3, wait what? a lower lr of 1e-4 keeps loss at a stable drop, and val loss is better than adam by about ~0.2?

Test 2, using a higher lr happens to not work at all, (was trying 1e-3/8e-4/5e-4, but that also goes Nan in 400/800/2000 steps), there are strange lr issues in this optimizer
Test 1, with adalite at all the defaults except lr at 3e-4, was using the nanogpt default of 1e-2, but it went to Nan in like 100 steps, here the adamW baseline is the default karpathy shakespeare-GPT, both at 10m parameters, tiny-shakespeare dataset. These two appear to take the same average time, 100ms per step for 16k tokens. Adam here is clearly superior, will have to see if my hyper-params are bad, or scheduler, or if batch-size helps

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
config		config
data		data
LICENSE		LICENSE
README.md		README.md
configurator.py		configurator.py
model.py		model.py
sample.py		sample.py
train.py		train.py