fsdp with custom process groups #2006

vchiley · 2023-02-26T06:20:21Z

What does this PR do?

This PR enables FSDP _auto_wrap to accept and propagate custom arguments, most notably, it enables the user to propagate dist process_groups.
Custom process_groups enable users to instantiate fsdp modules which shard / synchronize parameters over a subset of accelerators. This is useful for MoE Expert layers and TensorParallel layers.

What issue(s) does this change relate to?

Mixture of Experts (MoE)
https://mosaicml.atlassian.net/browse/RESEARCH-351
https://mosaicml.atlassian.net/browse/CO-1716
https://mosaicml.atlassian.net/browse/CO-1715
https://mosaicml.atlassian.net/browse/CO-1714
https://mosaicml.atlassian.net/browse/CO-1712

TensorParallel (TP)
https://mosaicml.atlassian.net/browse/CO-1635
https://mosaicml.atlassian.net/browse/RESEARCH-442

In general this doesn't fix an issue, but enables a feature which enable TP and MoE models
mosaicml/examples#180
also see: https://github.com/mosaicml/tutel/pull/1

btw I ran this "test" on 16 gpus with different process_group configurations to validate different configurations run. For the MoE setup with pg 'self' or 'set1', I do test if the different configurations result in the expert weights not being sync'd. I do not check if params / grads are sync'd in appropriately in the TP setting; since TP params are sharded, the params shouldn't be equal, so idk what to check, just the fact that they params are gathered, have the appropriate shape, and run might be test enough.

Before submitting

Have you read the contributor guidelines?
Did you update any related docs and document your change?
Did you run the tests locally to make sure they pass?
Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

Tests: this PR enables "super-users" to use advanced features and defaults to previous behavior. All former tests passing is test enough.

dskhudia · 2023-02-27T05:44:17Z

This also allows user to specify per module sharding strategy as well?

composer/trainer/mosaic_fsdp.py

vchiley · 2023-02-27T18:21:38Z

#2006 (comment)
Yes this enables the user to pass custom args to fsdp for every fsdp wrapped module; we include some of the mosaicml tooling for easily setting the configs
cc @dskhudia

bcui19 · 2023-02-27T21:53:04Z

LGTM, will approve when we add docs

mvpatel2000

I'm going to hold this PR until we get multiple approval here just because it looks scary and might break release, so we should be extra careful. Will lift once we have sign-offs from all parties. This is basically in lieu of codeowners for this part of the repo, which isn't configured.

Requiring approvals from @bcui19 @dakinggg and myself.

dakinggg

I think this looks good to me, left a few minor comments. Could you also test that a mosaicgpt-125m runs and loss curve looks reasonable? (i.e. make sure this PR doesn't break stuff unrelated to this PR)

composer/trainer/mosaic_fsdp.py

Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>

vchiley · 2023-02-28T01:36:30Z

@bcui19 added docs

@dakinggg made suggested updts, suggested runs from diff branches here (using tutel_moe branch in examples repo since it points to this branch).
Updt: those lines are soooo similar that I had to go verify that they were running from the correct branches 😅 . We're good

Are there two loss curves in the room with us now? 👻

dakinggg

LGTM

docs/source/notes/distributed_training.rst

composer/trainer/mosaic_fsdp.py

Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>

bcui19

LGTM!

composer/trainer/mosaic_fsdp.py

mvpatel2000

LGTM.

abhi-mosaic and others added 17 commits February 11, 2023 07:44

wip

3a280cd

remove exit

e84bbc6

wip

855360b

Merge branch 'dev' into abhi/fsdp_process_group

acec8c7

updates

6fb7be2

print out

8c8d0c2

remove fun features

f7ca399

add pg cache

f245b23

pg testing

c3b43ca

fix pg instantiation bug

a690d36

undo extra edits

0976448

destroy process groups

955cd2f

lint

1193c34

add pg options

e5f8011

rm destroy_process_group

14b74a5

propagating license

0a3d6f0

making links perminent

3fbe1c9

vchiley requested review from bcui19 and abhi-mosaic February 26, 2023 06:20

vchiley self-assigned this Feb 26, 2023

dskhudia self-requested a review February 26, 2023 18:31

Merge branch 'dev' into vchil/fsdp_process_group

f57eb74

dskhudia reviewed Feb 27, 2023

View reviewed changes

composer/trainer/mosaic_fsdp.py Outdated Show resolved Hide resolved

composer/trainer/mosaic_fsdp.py Show resolved Hide resolved

enable mosaicml ui for more custom per module fsdp kwargs

74aec1a

vchiley force-pushed the vchil/fsdp_process_group branch from 0b8b11a to 74aec1a Compare February 27, 2023 18:01

daya review comments

4a31f62

vchiley and others added 2 commits February 27, 2023 18:30

lint

fbe6a30

Merge branch 'dev' into vchil/fsdp_process_group

a6d895f

mvpatel2000 requested changes Feb 27, 2023

View reviewed changes

dakinggg reviewed Feb 27, 2023

View reviewed changes

abhi-mosaic mentioned this pull request Feb 27, 2023

Allow module specific FSDP process groups #1997

Closed

Update composer/trainer/mosaic_fsdp.py

1e252f7

Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>

vchiley force-pushed the vchil/fsdp_process_group branch 3 times, most recently from 58f210b to a348607 Compare February 28, 2023 00:58

adding to docs, dk review comments

15ac04e

vchiley force-pushed the vchil/fsdp_process_group branch from a348607 to 15ac04e Compare February 28, 2023 01:08

vchiley requested review from bcui19 and dakinggg February 28, 2023 01:41

updt docs

eb2efde

vchiley force-pushed the vchil/fsdp_process_group branch from ddc9f39 to eb2efde Compare February 28, 2023 01:45

Merge branch 'dev' into vchil/fsdp_process_group

f7a27f7

dakinggg approved these changes Feb 28, 2023

View reviewed changes

vchiley and others added 3 commits February 27, 2023 20:20

Apply suggestions from code review

281844c

Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>

dk review

e57ae38

Merge branch 'dev' into vchil/fsdp_process_group

c5d0f68

bcui19 approved these changes Feb 28, 2023

View reviewed changes

composer/trainer/mosaic_fsdp.py Outdated Show resolved Hide resolved

composer/trainer/mosaic_fsdp.py Outdated Show resolved Hide resolved

Merge branch 'dev' into vchil/fsdp_process_group

28df3a1

mvpatel2000 approved these changes Feb 28, 2023

View reviewed changes

vchiley enabled auto-merge (squash) February 28, 2023 19:04

Merge branch 'dev' into vchil/fsdp_process_group

e236dbd

mvpatel2000 disabled auto-merge February 28, 2023 20:00

mvpatel2000 enabled auto-merge (squash) February 28, 2023 20:00

Merge branch 'dev' into vchil/fsdp_process_group

774bb8f

mvpatel2000 merged commit 964bb05 into mosaicml:dev Mar 1, 2023

vchiley mentioned this pull request Jul 29, 2023

torch2.0.1 custom auto wrap #2400

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fsdp with custom process groups #2006

fsdp with custom process groups #2006

vchiley commented Feb 26, 2023 •

edited

Loading

dskhudia commented Feb 27, 2023

vchiley commented Feb 27, 2023

bcui19 commented Feb 27, 2023

mvpatel2000 left a comment

dakinggg left a comment

vchiley commented Feb 28, 2023 •

edited

Loading

dakinggg left a comment

bcui19 left a comment

mvpatel2000 left a comment

fsdp with custom process groups #2006

fsdp with custom process groups #2006

Conversation

vchiley commented Feb 26, 2023 • edited Loading

What does this PR do?

What issue(s) does this change relate to?

Before submitting

dskhudia commented Feb 27, 2023

vchiley commented Feb 27, 2023

bcui19 commented Feb 27, 2023

mvpatel2000 left a comment

Choose a reason for hiding this comment

dakinggg left a comment

Choose a reason for hiding this comment

vchiley commented Feb 28, 2023 • edited Loading

dakinggg left a comment

Choose a reason for hiding this comment

bcui19 left a comment

Choose a reason for hiding this comment

mvpatel2000 left a comment

Choose a reason for hiding this comment

vchiley commented Feb 26, 2023 •

edited

Loading

vchiley commented Feb 28, 2023 •

edited

Loading