How would you use shard scalar weight using ZeRO Stage 3? #6986

ggoggam · 2025-01-31T07:12:03Z

ggoggam
Jan 31, 2025

I have a problem where I need a scalar weight tensor during training. Since the model is too large, it would be nice to use ZeRO Stage 3 to shard the model weights, but it seems it is not possible since the scalar weight cannot be sharded across multiple GPUs.

How should one train a model with scalar weight in this case?

One way I thought about was having a vector of size (1 x # GPUs), then mean-reducing the vector before using it. If there is better or clever tricks, please let me know!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How would you use shard scalar weight using ZeRO Stage 3? #6986

{{title}}

Replies: 0 comments

Select a reply

How would you use shard scalar weight using ZeRO Stage 3? #6986

ggoggam Jan 31, 2025

Replies: 0 comments

ggoggam
Jan 31, 2025