[Question] `FSDP` vs `Deepspeed ZeRO3 / ZeRO++` #66

jeromeku · 2024-09-01T12:37:58Z

@stas00

Many thanks for this invaluable resource and your generosity in sharing your knowledge.

Was hoping you could lend some insight on FSDP vs DeepSpeed ZeRO-3:

Partitioning granularity:
- When using ZeRO-3, do you know if there is an equivalent of torch FSDP's auto-wrap policy? This policy lets users specify the bounds of each gathered unit. I.e., one can specify that transformer blocks are treated as a single unit such that during the forwards / backwards passes an entire transformer block will be gathered at a time.
- Reading the DeepSpeed source partition_parameters.py, my understanding is that each parameter is partitioned into a ds_tensor which represents each gpu's "horizontal" slice of the param. What determines how many of these params are gathered at a time "vertically"?
  - E.g., if my model has 4 layers, with sum(layer1.params + layer2.params + layer3.params) < layer4.params, how can I gather layer{1,2,3} together as a unit and layer4 as another unit during forward / backward?
HSDP vs ZeRO++ hpZ
- These are mentioned in your section on ZeRO with multiple replicas. How do these compare in your experience?
- From the ZeRO++ paper, specifically Figure 4, it seems the model (primary params) are still being fully partitioned across the entire cluster, and intra-node partitioning (secondary params) is happening only in backwards, which differs from HSDP (Hybrid Shard) per my understanding, where the model is replicated across nodes and partitioned only within node.

I've posted these same questions to the DeepSpeed repo, but would greatly appreciate your thoughts as well.

The text was updated successfully, but these errors were encountered:

stas00 · 2024-09-03T17:04:21Z

The hybrid question I don't have the understanding as I have only tried it once and currently have no need for it.

re: granularity: As Deepspeed's intention is ease of use - the user doesn't need to mess with low-level details specific to each model. It determines which weights are needed for the next forward and prefetches them. It uses the stage3_prefetch_bucket_size setting to control how much to prefetch so that you could optimize your setup to be network-efficient (since a low setting would mean lots of less efficient collective trips). Then it uses stage3_param_persistence_threshold to keep some smaller params unsharded. So if you set stage3_prefetch_bucket_size to the size of the transformer block you will get the same outcome as FSDP's.

In other words Deepspeed slices the performance optimization in a different way, it has a buffer-centric view, rather than layer-view.

jeromeku · 2024-09-07T02:11:46Z

@stas00

Many thanks for taking the time to respond.

Regarding partitioning granularity, just discovered that DeepSpeed introduced a way to group at a module level -- see here for discussion.

jeromeku changed the title ~~[Question] FSDP vs Deepspeed~~ [Question] FSDP vs Deepspeed ZeRO3 / ZeRO++ Sep 1, 2024

jeromeku closed this as completed Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] `FSDP` vs `Deepspeed ZeRO3 / ZeRO++` #66

[Question] `FSDP` vs `Deepspeed ZeRO3 / ZeRO++` #66

jeromeku commented Sep 1, 2024 •

edited

Loading

stas00 commented Sep 3, 2024

jeromeku commented Sep 7, 2024

[Question] FSDP vs Deepspeed ZeRO3 / ZeRO++ #66

[Question] FSDP vs Deepspeed ZeRO3 / ZeRO++ #66

Comments

jeromeku commented Sep 1, 2024 • edited Loading

stas00 commented Sep 3, 2024

jeromeku commented Sep 7, 2024

[Question] `FSDP` vs `Deepspeed ZeRO3 / ZeRO++` #66

[Question] `FSDP` vs `Deepspeed ZeRO3 / ZeRO++` #66

jeromeku commented Sep 1, 2024 •

edited

Loading