Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove support for huge models #1021

Closed
mrseeker opened this issue Dec 1, 2022 · 2 comments
Closed

Remove support for huge models #1021

mrseeker opened this issue Dec 1, 2022 · 2 comments

Comments

@mrseeker
Copy link
Contributor

mrseeker commented Dec 1, 2022

Describe the bug
The scaling law upgrade is actually backfiring on people running fine-tuned miners, because they are hitting a "glass ceiling" when it comes to loss. Scaling laws seem to prefer small miners with low loss.

To Reproduce
Steps to reproduce the behavior:

  1. Execute the following code with a loss of 2:
    def scaling_law_loss_to_params(loss):
  2. Execute the same code with a loss of 1.69.
  3. Execute the same code with a loss of 1.5.
  4. Execute the same code with a loss of 1.
  5. Execute the same code with a loss of 0.

Expected behavior
The reproduction should return a higher reproduction each time its being run. A loss of 0 is theoretically possible.

Environment:

  • OS and Distro: N/A
  • Bittensor Version: 3.5.0

Additional context
A fine-tuned 2.7B receives the same weights as a unturned 6B and an unturned 20B. This triggered an investigation into the reason why this would be the case. Turns out, the scaling law has a minimum of 1.69, which is not mentioned in the corresponding paper and is known by some to be an incorrect estimation. The paper can be disproven by fact.

@mrseeker
Copy link
Contributor Author

mrseeker commented Dec 1, 2022

The offending code is this:

torch.log(torch.clamp(loss, 1.69)) / 0.076) # loss lower bound 1.69 is entropy of natural text

@opentaco
Copy link
Contributor

opentaco commented Dec 1, 2022

#1022 BIT-601 Scaling law on EMA loss proposes computing the scaling law and the resultant effective number of model parameters on the exponentially moving average loss for a server, which should greatly improve the definition of the result.

Initially the scaling law was computed on per-batch results, partly due to a validation configuration that pre-computed batchwise stats and then averaged it into a final EMA result. This configuration is now overridden so that the scaling law is computed on the EMA'd loss, and then the model size result is itself averaged into an EMA store to allow for zero push penalty in the case of non-responsiveness.

The neural language model scaling law [1] is typically meant to be computed on a loss averaged over the entire training data, asymptotically approaching the natural entropy of text according to [2]. Currently it is computed within-batch only, which frequently sees losses below 1.69 (the natural entropy of text), in which case the clamping is deleterious.

From [2]:

The first term 𝐸 captures the loss for an ideal generative process on the data distribution, and should correspond to the entropy of natural text.
The loss comprises three terms: the Bayes risk 𝐸, i.e. the minimal loss achievable for next-token prediction on the full distribution 𝑃, a.k.a the “entropy of natural text.”; ...
Empirically, we find after fitting (2) that 𝐸 = 1.69 ...

[1] (OpenAI scaling laws) Kaplan, Jared, et al. "Scaling laws for neural language models." arXiv:2001.08361 (2020)
[2] (DeepMind) Hoffmann, Jordan, et al. "Training Compute-Optimal Large Language Models." arXiv preprint arXiv:2203.15556 (2022).

@mrseeker mrseeker closed this as completed Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants