Need model size dumped at init #123

stas00 · 2021-10-04T00:46:59Z

We need to have a diagnostic model size dumped during the framework init. We currently get a report per rank and not the total.

 > number of parameters on (tensor, pipeline) model parallel rank (0, 1): 1745293312
 > number of parameters on (tensor, pipeline) model parallel rank (2, 1): 1745293312
 > number of parameters on (tensor, pipeline) model parallel rank (3, 0): 1986465792
 > number of parameters on (tensor, pipeline) model parallel rank (3, 7): 1986498560

Later on ZeRO engine does dump the right thing amongst multiple other numbers and repeated on each rank

[2021-10-02 16:08:53,028] [INFO] [engine.py:134:__init__] RANK=0 STAGE=0 LAYERS=7 [0, 7) STAGE_PARAMS=1986465792 (1986.466M) TOTAL_PARAMS=57778896896 (57778.897M) UNIQUE_PARAMS=56814206976 (56814.207M)

But ideally we just want a print like:

Model size: 57B (57778896896 params)

Just on rank 0.

Thanks.

The text was updated successfully, but these errors were encountered:

jtboing · 2021-10-05T17:30:50Z

I think I can try and take this issue. However, I have to know what do you do get the diagnostics dump?

jtboing · 2021-10-05T17:31:05Z

Also, does the dump happen when starting the workflows?

stas00 · 2021-10-05T19:12:18Z

Thank you for offering to work on this, @jtboing

We, the BS group, haven't added anything yet to this functionality, so it's totally up to you how you do it - please have a look at the various info logged during Meg-DS startup and add it where you feel is right. Probably the best place to do it is where the model is created since you can then easily query the params.

I don't think it really matters where, other than that we could easily grep for something like:

grep "Model size" log.txt

here is my cheatsheet if it helps:

# calculate the number of parameters:
#
# 1. count all params
sum(p.numel() for p in model.parameters())
#
# 2. avoid double counting shared parameters (only if there is a shared storage(), normal tied vars don't have this issue, as model.parameters() doesn't return shared vars)
sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values())
#
# 3. count only the trainable parameters:
pytorch_total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

jtboing · 2021-11-25T20:09:59Z

Hello. Sorry that this hasn't been done sooner and I am trying to get through this now. I am looking for where the Meg-DS startup script/process. Can you point to me which script/process initiates the framework init?

stas00 · 2021-11-25T20:13:01Z

We have already started sorting it out here: #204 (as a side effect of another need).

stas00 added the Good First Issue Good for newcomers label Oct 4, 2021

stas00 linked a pull request Nov 25, 2021 that will close this issue

Compute model param count once #204

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need model size dumped at init #123

Need model size dumped at init #123

stas00 commented Oct 4, 2021 •

edited

Loading

jtboing commented Oct 5, 2021

jtboing commented Oct 5, 2021

stas00 commented Oct 5, 2021 •

edited

Loading

jtboing commented Nov 25, 2021

stas00 commented Nov 25, 2021 •

edited

Loading

Need model size dumped at init #123

Need model size dumped at init #123

Comments

stas00 commented Oct 4, 2021 • edited Loading

jtboing commented Oct 5, 2021

jtboing commented Oct 5, 2021

stas00 commented Oct 5, 2021 • edited Loading

jtboing commented Nov 25, 2021

stas00 commented Nov 25, 2021 • edited Loading

stas00 commented Oct 4, 2021 •

edited

Loading

stas00 commented Oct 5, 2021 •

edited

Loading

stas00 commented Nov 25, 2021 •

edited

Loading