-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fsdp with custom process groups #2006
Conversation
This also allows user to specify per module sharding strategy as well? |
0b8b11a
to
74aec1a
Compare
#2006 (comment) |
LGTM, will approve when we add docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to hold this PR until we get multiple approval here just because it looks scary and might break release, so we should be extra careful. Will lift once we have sign-offs from all parties. This is basically in lieu of codeowners for this part of the repo, which isn't configured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good to me, left a few minor comments. Could you also test that a mosaicgpt-125m runs and loss curve looks reasonable? (i.e. make sure this PR doesn't break stuff unrelated to this PR)
Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>
58f210b
to
a348607
Compare
a348607
to
15ac04e
Compare
@bcui19 added docs @dakinggg made suggested updts, suggested runs from diff branches here (using tutel_moe branch in examples repo since it points to this branch). |
ddc9f39
to
eb2efde
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
What does this PR do?
This PR enables FSDP _auto_wrap to accept and propagate custom arguments, most notably, it enables the user to propagate dist process_groups.
Custom process_groups enable users to instantiate fsdp modules which shard / synchronize parameters over a subset of accelerators. This is useful for MoE Expert layers and TensorParallel layers.
What issue(s) does this change relate to?
Mixture of Experts (MoE)
https://mosaicml.atlassian.net/browse/RESEARCH-351
https://mosaicml.atlassian.net/browse/CO-1716
https://mosaicml.atlassian.net/browse/CO-1715
https://mosaicml.atlassian.net/browse/CO-1714
https://mosaicml.atlassian.net/browse/CO-1712
TensorParallel (TP)
https://mosaicml.atlassian.net/browse/CO-1635
https://mosaicml.atlassian.net/browse/RESEARCH-442
In general this doesn't fix an issue, but enables a feature which enable TP and MoE models
mosaicml/examples#180
also see: https://github.com/mosaicml/tutel/pull/1
btw I ran this "test" on 16 gpus with different process_group configurations to validate different configurations run. For the MoE setup with pg 'self' or 'set1', I do test if the different configurations result in the expert weights not being sync'd. I do not check if params / grads are sync'd in appropriately in the TP setting; since TP params are sharded, the params shouldn't be equal, so idk what to check, just the fact that they params are gathered, have the appropriate shape, and run might be test enough.
Before submitting
pre-commit
on your change? (see thepre-commit
section of prerequisites)Tests: this PR enables "super-users" to use advanced features and defaults to previous behavior. All former tests passing is test enough.