Support FSDP worker and vLLM Ascend #332

as12138 · 2025-02-21T03:10:04Z

This pr is committed for supporting Ascend NPU backend.
Co-authored-by: Chendong98 chendong136@huawei.com
Co-authored-by: zheliuyu 15750543867@163.com
In this pr, we add the capability to determine the type of NPU device and we also add a new script for training on NPU.

These are change lists:

examples/grpo_trainer/run_qwen2-7b_npu.sh a new script for training on NPU
examples/grpo_trainer/run_qwen2-7b_npu.sh remove verison of vllm
requirements-npu.txt requirements for NPU
verl/bert_padding.py Adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
verl/single_controller/ray/base.py
verl/third_party/vllm/vllm_spmd/dtensor_weight_loaders.py
verl/trainer/fsdp_sft_trainer.py
verl/utils/flops_counter.py
verl/utils/fsdp_utils.py
verl/workers/actor/dp_actor.py
verl/workers/critic/dp_critic.py
verl/workers/fsdp_workers.py
verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py
verl/workers/sharding_manager/fsdp_vllm.py
verl/utils/device.py get device type for different device

Here are our roadmap:

RoadMap

sft
ppo
grpo

News

[2025.02.21] The GRPO algorithm is supported for training on NPU with the FSDP backend.

Requirements
We use this PR testing on Ascend NPU and GPU to ensure the same codes can run on different devices. The device information is 8 Atlas 800T A2 and 8 A100. Other software information is shown in the following table.

Software	Version
transformers	4.47.1
accelerate	1.3.0
torch_npu	2.5.1.rc1
CANN	8.1.RC1 (Not Released)

About mean error
Due to differences in hardware structure, we cannot guarantee that the loss of Ascend NPU is exactly the same as that of the GPU. According to our experience, the loss differences less than 2% is acceptable. If the loss difference is greater than 2%, we will try to fix it. The calculation formula is as follows.

N represents the number of training steps. For more information, please refer to

huangk10 · 2025-02-21T06:33:08Z

does this pr work on multi nodes?

as12138 · 2025-02-21T07:00:13Z

does this pr work on multi nodes?

I am currently conducting tests on a single node only, and will subsequently supplement with multi-node testing results.

Co-authored-by: Chendong98 <chendong136@huawei.com>

docs/ascend/ascend.md

[fix] npu does not support torch.distributed.ReduceOp.AVG

pyproject.toml

requirements-npu.txt

verl/utils/fsdp_utils.py

verl/workers/fsdp_workers.py

as12138 changed the title ~~support ASCEND NPU~~ [WIP] support ASCEND NPU Feb 21, 2025

as12138 force-pushed the vllm-0.7-npu branch from 7510441 to 8e1637e Compare February 21, 2025 06:23

Yikun mentioned this pull request Feb 21, 2025

Add Ascend NPU support for veRL #338

Open

as12138 force-pushed the vllm-0.7-npu branch 2 times, most recently from 0afd136 to d496b70 Compare February 21, 2025 07:59

as12138 changed the title ~~[WIP] support ASCEND NPU~~ Support FSDP worker and vLLM Ascend Feb 21, 2025

as12138 force-pushed the vllm-0.7-npu branch 9 times, most recently from bcdb340 to 8b1b207 Compare February 22, 2025 02:06

support ASCEND NPU

0b7e274

Co-authored-by: Chendong98 <chendong136@huawei.com>

as12138 force-pushed the vllm-0.7-npu branch from 8b1b207 to 0b7e274 Compare February 22, 2025 06:48

celestialli reviewed Feb 22, 2025

View reviewed changes

docs/ascend/ascend.md Outdated Show resolved Hide resolved

docs/ascend/ascend.md Outdated Show resolved Hide resolved

zheliuyu and others added 2 commits February 22, 2025 15:43

[fix] npu does not support torch.distributed.ReduceOp.AVG

97a2fd1

Merge pull request #2 from zheliuyu/vllm-0.7-npu

ef8b6e7

[fix] npu does not support torch.distributed.ReduceOp.AVG

FightingZhen reviewed Feb 22, 2025

View reviewed changes

pyproject.toml Show resolved Hide resolved

requirements-npu.txt Outdated Show resolved Hide resolved

verl/utils/fsdp_utils.py Outdated Show resolved Hide resolved

verl/utils/fsdp_utils.py Outdated Show resolved Hide resolved

verl/workers/fsdp_workers.py Outdated Show resolved Hide resolved

support ASCEND NPU

62af61c

as12138 force-pushed the vllm-0.7-npu branch from b0e07ca to 62af61c Compare February 22, 2025 09:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support FSDP worker and vLLM Ascend #332

Support FSDP worker and vLLM Ascend #332

as12138 commented Feb 21, 2025 •

edited

Loading

huangk10 commented Feb 21, 2025

as12138 commented Feb 21, 2025

Support FSDP worker and vLLM Ascend #332

Are you sure you want to change the base?

Support FSDP worker and vLLM Ascend #332

Conversation

as12138 commented Feb 21, 2025 • edited Loading

huangk10 commented Feb 21, 2025

as12138 commented Feb 21, 2025

as12138 commented Feb 21, 2025 •

edited

Loading