Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Deepspeed Crashes when using MoE, Stage 2 Offload with DeepSpeedCPUAdam #5203

Open
KyleMylonakisProtopia opened this issue Feb 27, 2024 · 3 comments
Labels
bug Something isn't working training

Comments

@KyleMylonakisProtopia
Copy link

Describe the bug
When performing a training run with a model with Mixture of Experts (MoE) layers using stage 2 offload with the DeepSpeedCPUAdam optimizer, during the parameter update step the following runtime error is thrown.

│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/lightning/fabric/wrappers.py:92 in step                                                                                                                                                                │
│                                                                                                                                                                                                                                                                        │
│ ❱  92 │   │   output = self._strategy.optimizer_step(                                                                                                                                                                                                                  │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/lightning/fabric/strategies/strategy.py:206 in optimizer_step                                                                                                                                          │
│                                                                                                                                                                                                                                                                        │
│ ❱ 206 │   │   return self.precision.optimizer_step(optimizer, **kwargs)                                                                                                                                                                                                │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/lightning/fabric/plugins/precision/deepspeed.py:100 in optimizer_step                                                                                                                                  │
│                                                                                                                                                                                                                                                                        │
│ ❱ 100 │   │   return optimizer.step(**kwargs)                                                                                                                                                                                                                          │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1842 in step                                                                                                                                                   │
│                                                                                                                                                                                                                                                                        │
│ ❱ 1842 │   │   scaled_global_grad_norm = self.scaled_global_norm()                                                                                                                                                                                                     │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1786 in scaled_global_norm                                                                                                                                     │
│                                                                                                                                                                                                                                                                        │
│ ❱ 1786 │   │   │   self._average_expert_grad_norms(norm_groups)                                                                                                                                                                                                        │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1949 in _average_expert_grad_norms                                                                                                                             │
│                                                                                                                                                                                                                                                                        │
│ ❱ 1949 │   │   │   │   dist.all_reduce(scaled_norm_tensor, group=self.real_dp_process_group[i])                                                                                                                                                                        │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/deepspeed/comm/comm.py:117 in log_wrapper                                                                                                                                                              │
│                                                                                                                                                                                                                                                                        │
│ ❱ 117 │   │   │   return func(*args, **kwargs)                                                                                                                                                                                                                         │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/deepspeed/comm/comm.py:496 in all_reduce                                                                                                                                                               │
│                                                                                                                                                                                                                                                                        │
│ ❱ 496 │   return cdb.all_reduce(tensor, op, group, async_op)                                                                                                                                                                                                           │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:489 in _fn                                                                                                                                                                 │
│                                                                                                                                                                                                                                                                        │
│ ❱  489 │   │   │   │   return fn(*args, **kwargs)                                                                                                                                                                                                                      │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/deepspeed/comm/torch.py:159 in all_reduce                                                                                                                                                              │
│                                                                                                                                                                                                                                                                        │
│ ❱ 159 │   │   return torch.distributed.all_reduce(tensor=tensor, op=op, group=group, async_op=                                                                                                                                                                         │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:72 in wrapper                                                                                                                                                         │
│                                                                                                                                                                                                                                                                        │
│ ❱ 72 │   │   │   return func(*args, **kwargs)                                                                                                                                                                                                                          │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:1992 in all_reduce                                                                                                                                               │
│                                                                                                                                                                                                                                                                        │
│ ❱ 1992 │   work = group.allreduce([tensor], opts)                                                                                                                                                                                                                      │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: No backend type associated with device type cpu

When using a ep_size=1 for the expert layers, the call to self._average_expert_grad_norms(norm_groups) is not necessary and commenting this out resolves the issue. This of course is not a general solution for ep_size > 1, however in my case it would be sufficient to continue my work.

To Reproduce
Steps to reproduce the behavior:

  1. Create a model with a mixture of experts layer present.
  2. Use a deepspeed config with stage 2 offload, such as
config={
        "zero_optimization": {
            "stage": 2,
            "offload_optimizer": {"device": "cpu", "pin_memory": True},
            "overlap_comm": True,
            "contiguous_gradients": True,
            "allgather_bucket_size": 5e8,
            "reduce_bucket_size": 5e8,
            "allgather_partitions": True,
            "reduce_scatter": True,
            # "round_robin_gradients": True,
        }
    }
  1. Use the DeepSpeedCPUAdam optimizer for efficient CPU offload
  2. Train the model and perform an update step.

Expected behavior
Model training should occur with no issues or errors thrown.

ds_report output

[2024-02-27 09:16:20,976] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/torch']
torch version .................... 2.2.0+cu121
deepspeed install path ........... ['/home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.13.3, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1
shared memory (/dev/shm) size .... 125.77 GB

Screenshots
N/A

System info (please complete the following information):

  • OS: Ubuntu 20.04.6 LTS
  • GPU count and types: 4x RTX A6000
  • Python version: 3.11

Launcher context
Pytorch Lightning

Docker context
Bare metal.

Additional context
I have ep_size=1 for my mixture of expert layers, so this bug is totally avoidable by just not having the all reduce step.

@RezaYazdaniAminabadi
Copy link
Contributor

Hi @KyleMylonakisProtopia
Please give this PR a try, hopefully it resolves the issue.
Best,
Reza

@KyleMylonakisProtopia
Copy link
Author

That PR seems to resolve the issue. Thanks for looking at it!

@RezaYazdaniAminabadi
Copy link
Contributor

@tjruwase, let's please close this and merge the PR :)

github-merge-queue bot pushed a commit that referenced this issue Mar 4, 2024
The MoE- param gradients norms don't need to be averaged when created on
CPU only when using 1-DP training. However, I just moved the tensor back
to GPU to get average when having data-parallel on the MoE parameters
and using CPU-offload.

This PR addresses #5203

---------

Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
ShellyNR pushed a commit to ShellyNR/DeepSpeed that referenced this issue Mar 11, 2024
The MoE- param gradients norms don't need to be averaged when created on
CPU only when using 1-DP training. However, I just moved the tensor back
to GPU to get average when having data-parallel on the MoE parameters
and using CPU-offload.

This PR addresses microsoft#5203

---------

Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
rraminen pushed a commit to ROCm/DeepSpeed that referenced this issue May 9, 2024
The MoE- param gradients norms don't need to be averaged when created on
CPU only when using 1-DP training. However, I just moved the tensor back
to GPU to get average when having data-parallel on the MoE parameters
and using CPU-offload.

This PR addresses microsoft#5203

---------

Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants