Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need model size dumped at init #123

Open
stas00 opened this issue Oct 4, 2021 · 5 comments · May be fixed by #204
Open

Need model size dumped at init #123

stas00 opened this issue Oct 4, 2021 · 5 comments · May be fixed by #204
Labels
Good First Issue Good for newcomers

Comments

@stas00
Copy link
Contributor

stas00 commented Oct 4, 2021

We need to have a diagnostic model size dumped during the framework init. We currently get a report per rank and not the total.

 > number of parameters on (tensor, pipeline) model parallel rank (0, 1): 1745293312
 > number of parameters on (tensor, pipeline) model parallel rank (2, 1): 1745293312
 > number of parameters on (tensor, pipeline) model parallel rank (3, 0): 1986465792
 > number of parameters on (tensor, pipeline) model parallel rank (3, 7): 1986498560

Later on ZeRO engine does dump the right thing amongst multiple other numbers and repeated on each rank

[2021-10-02 16:08:53,028] [INFO] [engine.py:134:__init__] RANK=0 STAGE=0 LAYERS=7 [0, 7) STAGE_PARAMS=1986465792 (1986.466M) TOTAL_PARAMS=57778896896 (57778.897M) UNIQUE_PARAMS=56814206976 (56814.207M)

But ideally we just want a print like:

Model size: 57B (57778896896 params)

Just on rank 0.

Thanks.

@stas00 stas00 added the Good First Issue Good for newcomers label Oct 4, 2021
@jtboing
Copy link

jtboing commented Oct 5, 2021

I think I can try and take this issue. However, I have to know what do you do get the diagnostics dump?

@jtboing
Copy link

jtboing commented Oct 5, 2021

Also, does the dump happen when starting the workflows?

@stas00
Copy link
Contributor Author

stas00 commented Oct 5, 2021

Thank you for offering to work on this, @jtboing

We, the BS group, haven't added anything yet to this functionality, so it's totally up to you how you do it - please have a look at the various info logged during Meg-DS startup and add it where you feel is right. Probably the best place to do it is where the model is created since you can then easily query the params.

I don't think it really matters where, other than that we could easily grep for something like:

grep "Model size" log.txt

here is my cheatsheet if it helps:

# calculate the number of parameters:
#
# 1. count all params
sum(p.numel() for p in model.parameters())
#
# 2. avoid double counting shared parameters (only if there is a shared storage(), normal tied vars don't have this issue, as model.parameters() doesn't return shared vars)
sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values())
#
# 3. count only the trainable parameters:
pytorch_total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

@jtboing
Copy link

jtboing commented Nov 25, 2021

Hello. Sorry that this hasn't been done sooner and I am trying to get through this now. I am looking for where the Meg-DS startup script/process. Can you point to me which script/process initiates the framework init?

@stas00
Copy link
Contributor Author

stas00 commented Nov 25, 2021

We have already started sorting it out here: #204 (as a side effect of another need).

@stas00 stas00 linked a pull request Nov 25, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good First Issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants