-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove support for huge models #1021
Comments
The offending code is this:
|
#1022 BIT-601 Scaling law on EMA loss proposes computing the scaling law and the resultant effective number of model parameters on the exponentially moving average loss for a server, which should greatly improve the definition of the result. Initially the scaling law was computed on per-batch results, partly due to a validation configuration that pre-computed batchwise stats and then averaged it into a final EMA result. This configuration is now overridden so that the scaling law is computed on the EMA'd loss, and then the model size result is itself averaged into an EMA store to allow for zero push penalty in the case of non-responsiveness. The neural language model scaling law [1] is typically meant to be computed on a loss averaged over the entire training data, asymptotically approaching the natural entropy of text according to [2]. Currently it is computed within-batch only, which frequently sees losses below 1.69 (the natural entropy of text), in which case the clamping is deleterious. From [2]:
[1] (OpenAI scaling laws) Kaplan, Jared, et al. "Scaling laws for neural language models." arXiv:2001.08361 (2020) |
Describe the bug
The scaling law upgrade is actually backfiring on people running fine-tuned miners, because they are hitting a "glass ceiling" when it comes to loss. Scaling laws seem to prefer small miners with low loss.
To Reproduce
Steps to reproduce the behavior:
bittensor/bittensor/_neuron/text/core_validator/__init__.py
Line 1000 in d532ad5
Expected behavior
The reproduction should return a higher reproduction each time its being run. A loss of 0 is theoretically possible.
Environment:
Additional context
A fine-tuned 2.7B receives the same weights as a unturned 6B and an unturned 20B. This triggered an investigation into the reason why this would be the case. Turns out, the scaling law has a minimum of 1.69, which is not mentioned in the corresponding paper and is known by some to be an incorrect estimation. The paper can be disproven by fact.
The text was updated successfully, but these errors were encountered: