Skip to content

Commit

Permalink
Fix moe cpu offload (microsoft#5220)
Browse files Browse the repository at this point in the history
The MoE- param gradients norms don't need to be averaged when created on
CPU only when using 1-DP training. However, I just moved the tensor back
to GPU to get average when having data-parallel on the MoE parameters
and using CPU-offload.

This PR addresses microsoft#5203

---------

Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
  • Loading branch information
2 people authored and rraminen committed May 9, 2024
1 parent 8efb351 commit 3111f4d
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion deepspeed/runtime/zero/stage_1_and_2.py
Original file line number Diff line number Diff line change
Expand Up @@ -1946,8 +1946,10 @@ def _average_expert_grad_norms(self, norm_groups):
for i, norm in enumerate(norm_groups):
if self.is_moe_param_group[i]:
scaled_norm_tensor = norm * 1.0 / dist.get_world_size(group=self.real_dp_process_group[i])
if self.device == 'cpu':
scaled_norm_tensor = scaled_norm_tensor.to(get_accelerator().current_device_name())
dist.all_reduce(scaled_norm_tensor, group=self.real_dp_process_group[i])
norm_groups[i] = scaled_norm_tensor
norm_groups[i] = scaled_norm_tensor.to(self.device)

def unscale_and_clip_grads(self, grad_groups_flat, total_norm):
# compute combined scale factor for this group
Expand Down

0 comments on commit 3111f4d

Please sign in to comment.