[BUG] Deepspeed Crashes when using MoE, Stage 2 Offload with DeepSpeedCPUAdam #5203

KyleMylonakisProtopia · 2024-02-27T15:19:13Z

Describe the bug
When performing a training run with a model with Mixture of Experts (MoE) layers using stage 2 offload with the DeepSpeedCPUAdam optimizer, during the parameter update step the following runtime error is thrown.

│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/lightning/fabric/wrappers.py:92 in step                                                                                                                                                                │
│                                                                                                                                                                                                                                                                        │
│ ❱  92 │   │   output = self._strategy.optimizer_step(                                                                                                                                                                                                                  │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/lightning/fabric/strategies/strategy.py:206 in optimizer_step                                                                                                                                          │
│                                                                                                                                                                                                                                                                        │
│ ❱ 206 │   │   return self.precision.optimizer_step(optimizer, **kwargs)                                                                                                                                                                                                │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/lightning/fabric/plugins/precision/deepspeed.py:100 in optimizer_step                                                                                                                                  │
│                                                                                                                                                                                                                                                                        │
│ ❱ 100 │   │   return optimizer.step(**kwargs)                                                                                                                                                                                                                          │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1842 in step                                                                                                                                                   │
│                                                                                                                                                                                                                                                                        │
│ ❱ 1842 │   │   scaled_global_grad_norm = self.scaled_global_norm()                                                                                                                                                                                                     │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1786 in scaled_global_norm                                                                                                                                     │
│                                                                                                                                                                                                                                                                        │
│ ❱ 1786 │   │   │   self._average_expert_grad_norms(norm_groups)                                                                                                                                                                                                        │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1949 in _average_expert_grad_norms                                                                                                                             │
│                                                                                                                                                                                                                                                                        │
│ ❱ 1949 │   │   │   │   dist.all_reduce(scaled_norm_tensor, group=self.real_dp_process_group[i])                                                                                                                                                                        │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/deepspeed/comm/comm.py:117 in log_wrapper                                                                                                                                                              │
│                                                                                                                                                                                                                                                                        │
│ ❱ 117 │   │   │   return func(*args, **kwargs)                                                                                                                                                                                                                         │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/deepspeed/comm/comm.py:496 in all_reduce                                                                                                                                                               │
│                                                                                                                                                                                                                                                                        │
│ ❱ 496 │   return cdb.all_reduce(tensor, op, group, async_op)                                                                                                                                                                                                           │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:489 in _fn                                                                                                                                                                 │
│                                                                                                                                                                                                                                                                        │
│ ❱  489 │   │   │   │   return fn(*args, **kwargs)                                                                                                                                                                                                                      │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/deepspeed/comm/torch.py:159 in all_reduce                                                                                                                                                              │
│                                                                                                                                                                                                                                                                        │
│ ❱ 159 │   │   return torch.distributed.all_reduce(tensor=tensor, op=op, group=group, async_op=                                                                                                                                                                         │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:72 in wrapper                                                                                                                                                         │
│                                                                                                                                                                                                                                                                        │
│ ❱ 72 │   │   │   return func(*args, **kwargs)                                                                                                                                                                                                                          │
│                                                                                                                                                                                                                                                                        │
│ /home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:1992 in all_reduce                                                                                                                                               │
│                                                                                                                                                                                                                                                                        │
│ ❱ 1992 │   work = group.allreduce([tensor], opts)                                                                                                                                                                                                                      │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: No backend type associated with device type cpu

When using a ep_size=1 for the expert layers, the call to self._average_expert_grad_norms(norm_groups) is not necessary and commenting this out resolves the issue. This of course is not a general solution for ep_size > 1, however in my case it would be sufficient to continue my work.

To Reproduce
Steps to reproduce the behavior:

Create a model with a mixture of experts layer present.
Use a deepspeed config with stage 2 offload, such as

config={
        "zero_optimization": {
            "stage": 2,
            "offload_optimizer": {"device": "cpu", "pin_memory": True},
            "overlap_comm": True,
            "contiguous_gradients": True,
            "allgather_bucket_size": 5e8,
            "reduce_bucket_size": 5e8,
            "allgather_partitions": True,
            "reduce_scatter": True,
            # "round_robin_gradients": True,
        }
    }

Use the DeepSpeedCPUAdam optimizer for efficient CPU offload
Train the model and perform an update step.

Expected behavior
Model training should occur with no issues or errors thrown.

ds_report output

[2024-02-27 09:16:20,976] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/torch']
torch version .................... 2.2.0+cu121
deepspeed install path ........... ['/home/kyle/.conda/envs/llama2-chat/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.13.3, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1
shared memory (/dev/shm) size .... 125.77 GB

Screenshots
N/A

System info (please complete the following information):

OS: Ubuntu 20.04.6 LTS
GPU count and types: 4x RTX A6000
Python version: 3.11

Launcher context
Pytorch Lightning

Docker context
Bare metal.

Additional context
I have ep_size=1 for my mixture of expert layers, so this bug is totally avoidable by just not having the all reduce step.

The text was updated successfully, but these errors were encountered:

RezaYazdaniAminabadi · 2024-03-02T00:43:00Z

Hi @KyleMylonakisProtopia
Please give this PR a try, hopefully it resolves the issue.
Best,
Reza

KyleMylonakisProtopia · 2024-03-04T12:32:49Z

That PR seems to resolve the issue. Thanks for looking at it!

RezaYazdaniAminabadi · 2024-03-04T17:30:59Z

@tjruwase, let's please close this and merge the PR :)

The MoE- param gradients norms don't need to be averaged when created on CPU only when using 1-DP training. However, I just moved the tensor back to GPU to get average when having data-parallel on the MoE parameters and using CPU-offload. This PR addresses #5203 --------- Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>

The MoE- param gradients norms don't need to be averaged when created on CPU only when using 1-DP training. However, I just moved the tensor back to GPU to get average when having data-parallel on the MoE parameters and using CPU-offload. This PR addresses microsoft#5203 --------- Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>

KyleMylonakisProtopia added bug Something isn't working training labels Feb 27, 2024

RezaYazdaniAminabadi mentioned this issue Mar 2, 2024

Fix moe cpu offload #5220

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Deepspeed Crashes when using MoE, Stage 2 Offload with DeepSpeedCPUAdam #5203

[BUG] Deepspeed Crashes when using MoE, Stage 2 Offload with DeepSpeedCPUAdam #5203

KyleMylonakisProtopia commented Feb 27, 2024

RezaYazdaniAminabadi commented Mar 2, 2024

KyleMylonakisProtopia commented Mar 4, 2024

RezaYazdaniAminabadi commented Mar 4, 2024

[BUG] Deepspeed Crashes when using MoE, Stage 2 Offload with DeepSpeedCPUAdam #5203

[BUG] Deepspeed Crashes when using MoE, Stage 2 Offload with DeepSpeedCPUAdam #5203

Comments

KyleMylonakisProtopia commented Feb 27, 2024

RezaYazdaniAminabadi commented Mar 2, 2024

KyleMylonakisProtopia commented Mar 4, 2024

RezaYazdaniAminabadi commented Mar 4, 2024