You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many thanks for this invaluable resource and your generosity in sharing your knowledge.
Was hoping you could lend some insight on FSDP vs DeepSpeed ZeRO-3:
Partitioning granularity:
When using ZeRO-3, do you know if there is an equivalent of torch FSDP's auto-wrap policy? This policy lets users specify the bounds of each gathered unit. I.e., one can specify that transformer blocks are treated as a single unit such that during the forwards / backwards passes an entire transformer block will be gathered at a time.
Reading the DeepSpeed source partition_parameters.py, my understanding is that each parameter is partitioned into a ds_tensor which represents each gpu's "horizontal" slice of the param. What determines how many of these params are gathered at a time "vertically"?
E.g., if my model has 4 layers, with sum(layer1.params + layer2.params + layer3.params) < layer4.params, how can I gather layer{1,2,3} together as a unit and layer4 as another unit during forward / backward?
HSDP vs ZeRO++ hpZ
These are mentioned in your section on ZeRO with multiple replicas. How do these compare in your experience?
From the ZeRO++ paper, specifically Figure 4, it seems the model (primary params) are still being fully partitioned across the entire cluster, and intra-node partitioning (secondary params) is happening only in backwards, which differs from HSDP (Hybrid Shard) per my understanding, where the model is replicated across nodes and partitioned only within node.
I've posted these same questions to the DeepSpeed repo, but would greatly appreciate your thoughts as well.
The text was updated successfully, but these errors were encountered:
jeromeku
changed the title
[Question] FSDP vs Deepspeed
[Question] FSDP vs Deepspeed ZeRO3 / ZeRO++Sep 1, 2024
The hybrid question I don't have the understanding as I have only tried it once and currently have no need for it.
re: granularity: As Deepspeed's intention is ease of use - the user doesn't need to mess with low-level details specific to each model. It determines which weights are needed for the next forward and prefetches them. It uses the stage3_prefetch_bucket_size setting to control how much to prefetch so that you could optimize your setup to be network-efficient (since a low setting would mean lots of less efficient collective trips). Then it uses stage3_param_persistence_threshold to keep some smaller params unsharded. So if you set stage3_prefetch_bucket_size to the size of the transformer block you will get the same outcome as FSDP's.
In other words Deepspeed slices the performance optimization in a different way, it has a buffer-centric view, rather than layer-view.
@stas00
Many thanks for this invaluable resource and your generosity in sharing your knowledge.
Was hoping you could lend some insight on
FSDP
vsDeepSpeed ZeRO-3
:Partitioning granularity:
ZeRO-3
, do you know if there is an equivalent of torch FSDP's auto-wrap policy? This policy lets users specify the bounds of each gathered unit. I.e., one can specify that transformer blocks are treated as a single unit such that during the forwards / backwards passes an entire transformer block will be gathered at a time.ds_tensor
which represents each gpu's "horizontal" slice of the param. What determines how many of these params are gathered at a time "vertically"?sum(layer1.params + layer2.params + layer3.params) < layer4.params
, how can I gatherlayer{1,2,3}
together as a unit andlayer4
as another unit during forward / backward?HSDP
vsZeRO++ hpZ
Figure 4
, it seems the model (primary
params) are still being fully partitioned across the entire cluster, and intra-node partitioning (secondary
params) is happening only in backwards, which differs fromHSDP
(Hybrid Shard
) per my understanding, where the model is replicated across nodes and partitioned only within node.I've posted these same questions to the DeepSpeed repo, but would greatly appreciate your thoughts as well.
The text was updated successfully, but these errors were encountered: