Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsdp with custom process groups #2006

Merged
merged 36 commits into from
Mar 1, 2023

Conversation

vchiley
Copy link
Contributor

@vchiley vchiley commented Feb 26, 2023

What does this PR do?

This PR enables FSDP _auto_wrap to accept and propagate custom arguments, most notably, it enables the user to propagate dist process_groups.
Custom process_groups enable users to instantiate fsdp modules which shard / synchronize parameters over a subset of accelerators. This is useful for MoE Expert layers and TensorParallel layers.

What issue(s) does this change relate to?

Mixture of Experts (MoE)
https://mosaicml.atlassian.net/browse/RESEARCH-351
https://mosaicml.atlassian.net/browse/CO-1716
https://mosaicml.atlassian.net/browse/CO-1715
https://mosaicml.atlassian.net/browse/CO-1714
https://mosaicml.atlassian.net/browse/CO-1712

TensorParallel (TP)
https://mosaicml.atlassian.net/browse/CO-1635
https://mosaicml.atlassian.net/browse/RESEARCH-442

In general this doesn't fix an issue, but enables a feature which enable TP and MoE models
mosaicml/examples#180
also see: https://github.com/mosaicml/tutel/pull/1

btw I ran this "test" on 16 gpus with different process_group configurations to validate different configurations run. For the MoE setup with pg 'self' or 'set1', I do test if the different configurations result in the expert weights not being sync'd. I do not check if params / grads are sync'd in appropriately in the TP setting; since TP params are sharded, the params shouldn't be equal, so idk what to check, just the fact that they params are gathered, have the appropriate shape, and run might be test enough.

Before submitting

  • Have you read the contributor guidelines?
  • Did you update any related docs and document your change?
  • Did you run the tests locally to make sure they pass?
  • Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

Tests: this PR enables "super-users" to use advanced features and defaults to previous behavior. All former tests passing is test enough.

@vchiley vchiley self-assigned this Feb 26, 2023
@dskhudia dskhudia self-requested a review February 26, 2023 18:31
@dskhudia
Copy link
Contributor

This also allows user to specify per module sharding strategy as well?

composer/trainer/mosaic_fsdp.py Outdated Show resolved Hide resolved
composer/trainer/mosaic_fsdp.py Show resolved Hide resolved
@vchiley
Copy link
Contributor Author

vchiley commented Feb 27, 2023

#2006 (comment)
Yes this enables the user to pass custom args to fsdp for every fsdp wrapped module; we include some of the mosaicml tooling for easily setting the configs
cc @dskhudia

@bcui19
Copy link
Contributor

bcui19 commented Feb 27, 2023

LGTM, will approve when we add docs

Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to hold this PR until we get multiple approval here just because it looks scary and might break release, so we should be extra careful. Will lift once we have sign-offs from all parties. This is basically in lieu of codeowners for this part of the repo, which isn't configured.

Requiring approvals from @bcui19 @dakinggg and myself.

Copy link
Contributor

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good to me, left a few minor comments. Could you also test that a mosaicgpt-125m runs and loss curve looks reasonable? (i.e. make sure this PR doesn't break stuff unrelated to this PR)

composer/trainer/mosaic_fsdp.py Outdated Show resolved Hide resolved
composer/trainer/mosaic_fsdp.py Outdated Show resolved Hide resolved
composer/trainer/mosaic_fsdp.py Show resolved Hide resolved
composer/trainer/mosaic_fsdp.py Show resolved Hide resolved
composer/trainer/mosaic_fsdp.py Outdated Show resolved Hide resolved
Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>
@vchiley vchiley force-pushed the vchil/fsdp_process_group branch 3 times, most recently from 58f210b to a348607 Compare February 28, 2023 00:58
@vchiley
Copy link
Contributor Author

vchiley commented Feb 28, 2023

@bcui19 added docs

@dakinggg made suggested updts, suggested runs from diff branches here (using tutel_moe branch in examples repo since it points to this branch).
Updt: those lines are soooo similar that I had to go verify that they were running from the correct branches 😅 . We're good
Screenshot 2023-02-27 at 5 59 56 PM
Are there two loss curves in the room with us now? 👻

Copy link
Contributor

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

docs/source/notes/distributed_training.rst Outdated Show resolved Hide resolved
docs/source/notes/distributed_training.rst Show resolved Hide resolved
docs/source/notes/distributed_training.rst Outdated Show resolved Hide resolved
composer/trainer/mosaic_fsdp.py Outdated Show resolved Hide resolved
vchiley and others added 3 commits February 27, 2023 20:20
Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>
Copy link
Contributor

@bcui19 bcui19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

composer/trainer/mosaic_fsdp.py Outdated Show resolved Hide resolved
composer/trainer/mosaic_fsdp.py Outdated Show resolved Hide resolved
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@vchiley vchiley enabled auto-merge (squash) February 28, 2023 19:04
@mvpatel2000 mvpatel2000 enabled auto-merge (squash) February 28, 2023 20:00
@mvpatel2000 mvpatel2000 merged commit 964bb05 into mosaicml:dev Mar 1, 2023
@vchiley vchiley mentioned this pull request Jul 29, 2023
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants