Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] FSDP vs Deepspeed ZeRO3 / ZeRO++ #66

Closed
jeromeku opened this issue Sep 1, 2024 · 2 comments
Closed

[Question] FSDP vs Deepspeed ZeRO3 / ZeRO++ #66

jeromeku opened this issue Sep 1, 2024 · 2 comments

Comments

@jeromeku
Copy link

jeromeku commented Sep 1, 2024

@stas00

Many thanks for this invaluable resource and your generosity in sharing your knowledge.

Was hoping you could lend some insight on FSDP vs DeepSpeed ZeRO-3:

  1. Partitioning granularity:

    • When using ZeRO-3, do you know if there is an equivalent of torch FSDP's auto-wrap policy? This policy lets users specify the bounds of each gathered unit. I.e., one can specify that transformer blocks are treated as a single unit such that during the forwards / backwards passes an entire transformer block will be gathered at a time.
    • Reading the DeepSpeed source partition_parameters.py, my understanding is that each parameter is partitioned into a ds_tensor which represents each gpu's "horizontal" slice of the param. What determines how many of these params are gathered at a time "vertically"?
      • E.g., if my model has 4 layers, with sum(layer1.params + layer2.params + layer3.params) < layer4.params, how can I gather layer{1,2,3} together as a unit and layer4 as another unit during forward / backward?
  2. HSDP vs ZeRO++ hpZ

    • These are mentioned in your section on ZeRO with multiple replicas. How do these compare in your experience?
    • From the ZeRO++ paper, specifically Figure 4, it seems the model (primary params) are still being fully partitioned across the entire cluster, and intra-node partitioning (secondary params) is happening only in backwards, which differs from HSDP (Hybrid Shard) per my understanding, where the model is replicated across nodes and partitioned only within node.

I've posted these same questions to the DeepSpeed repo, but would greatly appreciate your thoughts as well.

@jeromeku jeromeku changed the title [Question] FSDP vs Deepspeed [Question] FSDP vs Deepspeed ZeRO3 / ZeRO++ Sep 1, 2024
@stas00
Copy link
Owner

stas00 commented Sep 3, 2024

The hybrid question I don't have the understanding as I have only tried it once and currently have no need for it.

re: granularity: As Deepspeed's intention is ease of use - the user doesn't need to mess with low-level details specific to each model. It determines which weights are needed for the next forward and prefetches them. It uses the stage3_prefetch_bucket_size setting to control how much to prefetch so that you could optimize your setup to be network-efficient (since a low setting would mean lots of less efficient collective trips). Then it uses stage3_param_persistence_threshold to keep some smaller params unsharded. So if you set stage3_prefetch_bucket_size to the size of the transformer block you will get the same outcome as FSDP's.

In other words Deepspeed slices the performance optimization in a different way, it has a buffer-centric view, rather than layer-view.

@jeromeku
Copy link
Author

jeromeku commented Sep 7, 2024

@stas00

Many thanks for taking the time to respond.

Regarding partitioning granularity, just discovered that DeepSpeed introduced a way to group at a module level -- see here for discussion.

@jeromeku jeromeku closed this as completed Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants