Remove support for huge models #1021

mrseeker · 2022-12-01T07:29:52Z

Describe the bug
The scaling law upgrade is actually backfiring on people running fine-tuned miners, because they are hitting a "glass ceiling" when it comes to loss. Scaling laws seem to prefer small miners with low loss.

To Reproduce
Steps to reproduce the behavior:

Execute the following code with a loss of 2:

bittensor/bittensor/_neuron/text/core_validator/__init__.py

Line 1000 in d532ad5

def scaling_law_loss_to_params(loss):
Execute the same code with a loss of 1.69.
Execute the same code with a loss of 1.5.
Execute the same code with a loss of 1.
Execute the same code with a loss of 0.

Expected behavior
The reproduction should return a higher reproduction each time its being run. A loss of 0 is theoretically possible.

Environment:

OS and Distro: N/A
Bittensor Version: 3.5.0

Additional context
A fine-tuned 2.7B receives the same weights as a unturned 6B and an unturned 20B. This triggered an investigation into the reason why this would be the case. Turns out, the scaling law has a minimum of 1.69, which is not mentioned in the corresponding paper and is known by some to be an incorrect estimation. The paper can be disproven by fact.

mrseeker · 2022-12-01T07:33:56Z

The offending code is this:

bittensor/bittensor/_neuron/text/core_validator/__init__.py

Line 1004 in d532ad5

    
           torch.log(torch.clamp(loss, 1.69)) / 0.076)  # loss lower bound 1.69 is entropy of natural text

opentaco · 2022-12-01T16:15:54Z

#1022 BIT-601 Scaling law on EMA loss proposes computing the scaling law and the resultant effective number of model parameters on the exponentially moving average loss for a server, which should greatly improve the definition of the result.

Initially the scaling law was computed on per-batch results, partly due to a validation configuration that pre-computed batchwise stats and then averaged it into a final EMA result. This configuration is now overridden so that the scaling law is computed on the EMA'd loss, and then the model size result is itself averaged into an EMA store to allow for zero push penalty in the case of non-responsiveness.

The neural language model scaling law [1] is typically meant to be computed on a loss averaged over the entire training data, asymptotically approaching the natural entropy of text according to [2]. Currently it is computed within-batch only, which frequently sees losses below 1.69 (the natural entropy of text), in which case the clamping is deleterious.

From [2]:

The first term 𝐸 captures the loss for an ideal generative process on the data distribution, and should correspond to the entropy of natural text.
The loss comprises three terms: the Bayes risk 𝐸, i.e. the minimal loss achievable for next-token prediction on the full distribution 𝑃, a.k.a the “entropy of natural text.”; ...
Empirically, we find after fitting (2) that 𝐸 = 1.69 ...

[1] (OpenAI scaling laws) Kaplan, Jared, et al. "Scaling laws for neural language models." arXiv:2001.08361 (2020)
[2] (DeepMind) Hoffmann, Jordan, et al. "Training Compute-Optimal Large Language Models." arXiv preprint arXiv:2203.15556 (2022).

mrseeker closed this as completed Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove support for huge models #1021

Remove support for huge models #1021

mrseeker commented Dec 1, 2022

mrseeker commented Dec 1, 2022

opentaco commented Dec 1, 2022 •

edited by jira bot

Loading

Remove support for huge models #1021

Remove support for huge models #1021

Comments

mrseeker commented Dec 1, 2022

mrseeker commented Dec 1, 2022

opentaco commented Dec 1, 2022 • edited by jira bot Loading

opentaco commented Dec 1, 2022 •

edited by jira bot

Loading