Add fp16 support of Qwen1.5MoE models (A2.7B) to DeepSpeed-FastGen #5403

ZonePG · 2024-04-12T03:36:00Z

This PR adds support for Qwen1.5MoE-A2.7B models.

Test Code

for mii pipeline:

import mii

pipe = mii.pipeline("/data/zonepg/models/Qwen/Qwen1.5-MoE-A2.7B")
responses = pipe("DeepSpeed is", max_new_tokens=128, do_sample=False)
if pipe.is_rank_0:
    print(responses[0])

for huggingface:

import mii

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
tokenizer = AutoTokenizer.from_pretrained("/data/zonepg/models/Qwen/Qwen1.5-MoE-A2.7B")
model = AutoModelForCausalLM.from_pretrained("/data/zonepg/models/Qwen/Qwen1.5-MoE-A2.7B", device_map="auto", torch_dtype=torch.float16, trust_remote_code=True).eval()
print(model)
inputs = tokenizer('DeepSpeed is', return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs, max_new_tokens=128, do_sample=False, repetition_penalty=1.0)
test = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False)
print(test)

Qwen1.5-MoE-A2.7B

Huggingface output with prompt "DeepSpeed is":

 a deep learning framework that is designed to accelerate the training of large-scale neural networks. It is built on top of PyTorch and provides a set of tools and techniques for optimizing the performance of deep learning models.

DeepSpeed supports a variety of hardware accelerators, including GPUs, TPUs, and FPGAs, and can be used to train models on distributed systems, such as clusters of GPUs or TPUs.

One of the key features of DeepSpeed is its ability to automatically parallelize the training of deep learning models across multiple GPUs or TPUs. This can significantly reduce the time required to train large models, as it allows the

DeepSpeed-FastGen output with prompt "DeepSpeed is":

 a deep learning framework that is designed to accelerate the training of large-scale neural networks. It is built on top of PyTorch and provides a set of tools and techniques for optimizing the performance of deep learning models.

DeepSpeed supports a variety of hardware accelerators, including GPUs, TPUs, and FPGAs, and can be used to train models on distributed systems, such as clusters of GPUs or TPUs.

One of the key features of DeepSpeed is its ability to automatically parallelize the training of deep learning models across multiple GPUs or TPUs. This can significantly reduce the time required to train large models, as it allows the

DeepSpeed-FastGen output with prompt "DeepSpeed is" with 8-way sharding:

 a deep learning framework that is designed to accelerate the training of large-scale neural networks. It is built on top of PyTorch and provides a set of tools and techniques for optimizing the performance of deep learning models.

DeepSpeed supports a variety of hardware accelerators, including GPUs, TPUs, and FPGAs, and can be used to train models on distributed systems, such as clusters of GPUs or TPUs.

One of the key features of DeepSpeed is its ability to automatically parallelize the training of deep learning models across multiple GPUs or TPUs. This can significantly reduce the time required to train large models, as it allows the

ZonePG · 2024-04-12T03:44:28Z

deepspeed/inference/v2/model_implementations/qwen_v2_moe/model.py

+        shared_expert_output = self.shared_expert_mlp_2(shared_expert_output, cur_params.shared_moe_mlp_2, b=None)
+        shared_expert_gate_output = self.shared_expert_gate(hidden_states, cur_params.shared_moe_gate, b=None)[..., :1]
+        # shared_expert_gate_output shape[-1] is 1
+        shared_expert_output.mul_(torch.sigmoid(shared_expert_gate_output))


I am not sure if using torch.sigmoid directly will affect performance?

heiseon · 2024-07-05T07:13:41Z

When I use your source code to build DeepSpeed，and run “for mii pipeline” code， the process is blocked and no error。How should I identify the problem? I use 4090 GPU and transformer is 4.41.0.dev0, torch is 2.2.1, cuda version is 11.8.
BTW the transformer code runs well , just runs very slow.

ZonePG · 2024-07-05T18:46:01Z

Hi, @heiseon I just created a new conda environment and built it from my deepspeed code and deepspeed-mii offical source code, and it’s ok without any issues.

maybe you can delete ~/.cache/torch_extensions/pyxxx_cuxxx and try it again.

my path is /data/zonepg/.cache/torch_extensions/py311_cu121.

heiseon · 2024-07-10T13:19:38Z

Hi, @heiseon I just created a new conda environment and built it from my deepspeed code and deepspeed-mii offical source code, and it’s ok without any issues.

maybe you can delete ~/.cache/torch_extensions/pyxxx_cuxxx and try it again.

my path is /data/zonepg/.cache/torch_extensions/py311_cu121.

delete ~/.cache/torch_extensions/pyxxx_cuxxx is worked for me.
I have another question. When using the Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 quantized version, an error occurs: 'Could not find a mapping for dependency "mlp.experts.18.gate_proj.bias"'. Does it mean that the GPTQ quantized version of the model is not supported?"
Would you like assistance with addressing this issue or understanding more about GPTQ quantization compatibility?

ZonePG · 2024-07-10T16:10:12Z

Hi, @heiseon It does not support quantized Qwen models currently. Supporting this would likely require a big effort, so it might not be considered in the short term.

Ref: microsoft#5403

based on PR #5403 (Qwen1.5-MOE) and #5219 (Qwen1.5), support Qwen2 series model. including: 0.5B, 1.5B, 7B, 57B-A14B, and 72B models. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

ZonePG requested review from mrwyattii, awan-10 and arashb as code owners April 12, 2024 03:36

Add fp16 support of Qwen1.5MoE models (A2.7B) to DeepSpeed-FastGen

2ead98b

ZonePG force-pushed the master branch from b458df5 to 2ead98b Compare April 12, 2024 03:40

ZonePG commented Apr 12, 2024

View reviewed changes

ZonePG mentioned this pull request Apr 12, 2024

[FEATURE REQUEST] Add Support for Qwen1.5-MoE Architecture in DeepSpeed-MII microsoft/DeepSpeed-MII#457

Open

adk9 self-requested a review May 22, 2024 20:40

Merge branch 'microsoft:master' into master

27b66b0

HeyangQin approved these changes Jul 15, 2024

View reviewed changes

HeyangQin enabled auto-merge July 15, 2024 23:43

loadams and others added 2 commits July 15, 2024 17:59

Merge branch 'master' into master

67646f6

Merge branch 'master' into master

a89fceb

loadams disabled auto-merge July 16, 2024 17:26

loadams requested review from lekurile and removed request for mrwyattii July 16, 2024 17:27

Merge branch 'master' into master

b8929ca

lekurile approved these changes Jul 16, 2024

View reviewed changes

loadams added this pull request to the merge queue Jul 16, 2024

xslingcn added a commit to xslingcn/DeepSpeed that referenced this pull request Jul 17, 2024

Add support for Qwen2Moe (A14B)

007fe4c

Ref: microsoft#5403

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 18, 2024

Merge branch 'master' into master

71838c3

adk9 approved these changes Aug 1, 2024

View reviewed changes

loadams merged commit 249c1db into microsoft:master Aug 1, 2024
7 checks passed

ZonePG mentioned this pull request Aug 17, 2024

fix fp16 Qwen2 series model to DeepSpeed-FastGen #6028

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fp16 support of Qwen1.5MoE models (A2.7B) to DeepSpeed-FastGen #5403

Add fp16 support of Qwen1.5MoE models (A2.7B) to DeepSpeed-FastGen #5403

ZonePG commented Apr 12, 2024

ZonePG Apr 12, 2024 •

edited

Loading

heiseon commented Jul 5, 2024

ZonePG commented Jul 5, 2024

heiseon commented Jul 10, 2024 •

edited

Loading

ZonePG commented Jul 10, 2024

Add fp16 support of Qwen1.5MoE models (A2.7B) to DeepSpeed-FastGen #5403

Add fp16 support of Qwen1.5MoE models (A2.7B) to DeepSpeed-FastGen #5403

Conversation

ZonePG commented Apr 12, 2024

Test Code

Qwen1.5-MoE-A2.7B

ZonePG Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

heiseon commented Jul 5, 2024

ZonePG commented Jul 5, 2024

heiseon commented Jul 10, 2024 • edited Loading

ZonePG commented Jul 10, 2024

ZonePG Apr 12, 2024 •

edited

Loading

heiseon commented Jul 10, 2024 •

edited

Loading