fix meta device initialization for very large models #54

mayank31398 · 2024-03-20T21:38:51Z

I am trying to replicate the current code in my training codebase.
It seems to get stuck.
Can you see if its the same case for you?
I am also opening this PR for a fix that works for me.

lchu-ibm · 2024-03-20T21:52:38Z

@mayank31398

In short, currently we need a post-init to match a true init which is done with reset_parameters(). The way you proposed here (which is actually what we did before) will require that post-init to be done after the FSDP wrapping call and we haven't got bandwidth addressing that. see #15

For details, see #6

lchu-ibm · 2024-03-20T21:54:09Z

It seems to get stuck.

for large model like 70b, this will take decent amount of time (10-20 mins ish, or even a little more). But I think it should work.

In the future we will revert back to old implementation (the one you proposed here) once the above issue is fixed.

mayank31398 · 2024-03-20T21:58:10Z

@lchu-ibm hmm weird, I am not seeing #15 on my end. but I am not using FMS maybe that could be a factor. could be some difference in modeling code maybe?
I didn't understand #6 though. I am not familiar with TP implementation of torch.
I can try my own TP implementation with FSDP.
Feel free to close this issue though :)

lchu-ibm · 2024-03-20T22:01:59Z

@mayank31398

Yes, it is a little bit complicated. But in short -

What you proposed is definitely correct, and we also used it that way in the past. But for a certain issue we have, we need to make sure at least one full correct copy of the model is presented before the FSDP call, thus we use rank==0 trick. This should save cpu model copy from world_size copies to only 1 copy, but still less efficient than meta-device-all-ranks.

once that issue is fixed, we will revert back to old implementation (the one you have here.)

mayank31398 · 2024-03-20T22:14:25Z

Makes sense
Closing issue

lchu-ibm · 2024-03-26T17:37:04Z

@mayank31398 this should happen soon: #64

fix

84a32f1

mayank31398 closed this Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix meta device initialization for very large models #54

fix meta device initialization for very large models #54

mayank31398 commented Mar 20, 2024

lchu-ibm commented Mar 20, 2024

lchu-ibm commented Mar 20, 2024

mayank31398 commented Mar 20, 2024

lchu-ibm commented Mar 20, 2024

mayank31398 commented Mar 20, 2024

lchu-ibm commented Mar 26, 2024

fix meta device initialization for very large models #54

fix meta device initialization for very large models #54

Conversation

mayank31398 commented Mar 20, 2024

lchu-ibm commented Mar 20, 2024

lchu-ibm commented Mar 20, 2024

mayank31398 commented Mar 20, 2024

lchu-ibm commented Mar 20, 2024

mayank31398 commented Mar 20, 2024

lchu-ibm commented Mar 26, 2024