-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute model param count once #204
base: main
Are you sure you want to change the base?
Conversation
@TevenLeScao , while we are at it why do we print:
for the whole model? What's the practical point of the first one? I'm trying to think what can be done with this information? I surely am missing something... Is this info of any practical use for the pipe stages? when we print the same per process? And I think the 2nd one is misleading, as it's not w/o embeddings. It's without repeat count of tied params. May I suggest that it says instead:
Thanks. and if agreed for the latter then we can adjust the constant name to be |
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Jake Tae <jaesungtae@gmail.com>
Hey ! The number without embeddings is actually just that: the number of non-embedding parameters, not the number of unique parameters. This is the relevant number to estimate the loss against, which is why we're using it in the tensorboards for arch-and-scaling. I wanted to also print exact model parameters (including embeddings and withotu double-counts) but haven't managed to get unique counting to work. |
OK, then something is very wrong in the reporting. e.g. for the 104B model it prints:
The formula we use comes from https://github.com/bigscience-workshop/bigscience/blob/master/experiments/gpt2-utils.md#calculate-model-size For:
The difference is 60M params, so embeddings params are ignored when doing math by hand and the simplified formula is used instead So But there is no 125B anywhere in sight, so the |
@TevenLeScao, I looked deeper and we have wrong counting for PP, e.g. see: First let's do a manual approximate math for this config:
gives:
So we know we are dealing with a 52M model, with about half the params taken by word embeddings. Now run the start of the training and look at the logs (filtered out all but the relevant parts). The upcase logs are from deepspeed's pipe, the last line is ours.
So for gpu=1 and gpus=2/tp=2 all is good, but once pp>1 is involved it's wrong. It reports Here is the code that we need to borrow to do it correctly: |
@stas00 @TevenLeScao I assume the PP > 1 problem is related to the warning in Megatron-DeepSpeed/megatron/utils.py Line 279 in c9afebc
And I'm also suspecting if this is the last untied knot in #40, which was followed up by #99. Just trying to put the pieces together to add clarity! |
yes, and I pointed to the code that leads to correct data! The current 20 to 50% over-reported size leads to very different results over what is really happening. Do you want to tackle that, @jaketae? |
Yes, I'd love to take it. Will take a look at the DS reference code today. Thanks! |
And to debug while you're working at it, you may choose the same as I did here that is tweaking:
to 3 different set ups and checking that the deepspeed (upcase) log matches ours. |
@stas00 I'm trying to test the code, but haven't figured out what the best way to go about it is. Do you modify entries in |
both outputs already show up in the same log file. Please see #204 (comment) - I just filtered out that information from the rest of the logged info. You just change the setup in the script, yes. I don't know what run.sh is, I use the following script:
If you want to use it, you will need to adjusts paths as they are hardcoded to my setup. But the main point I'm trying to convey is that I just tweaked this section of the above script 3 times as described in my comment I linked to above and re-run the script.
|
Co-authored-by: Alexander Jipa <azzhipa@amazon.com>
This PR fixes #203 by using the
args
global variable holder to save and access model parameter counts during gigaflops counting. This is sensible given that the number of model parameters stays constant throughout the training iteration, rendering it unnecessary to call the model param count function at every iteration.Additionally fixes: #123