Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support FSDP worker and vLLM Ascend #332

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

as12138
Copy link

@as12138 as12138 commented Feb 21, 2025

This pr is committed for supporting Ascend NPU backend.
Co-authored-by: Chendong98 chendong136@huawei.com
Co-authored-by: zheliuyu 15750543867@163.com
In this pr, we add the capability to determine the type of NPU device and we also add a new script for training on NPU.

These are change lists:

  1. examples/grpo_trainer/run_qwen2-7b_npu.sh a new script for training on NPU
  2. examples/grpo_trainer/run_qwen2-7b_npu.sh remove verison of vllm
  3. requirements-npu.txt requirements for NPU
  4. verl/bert_padding.py Adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
  5. verl/single_controller/ray/base.py
  6. verl/third_party/vllm/vllm_spmd/dtensor_weight_loaders.py
  7. verl/trainer/fsdp_sft_trainer.py
  8. verl/utils/flops_counter.py
  9. verl/utils/fsdp_utils.py
  10. verl/workers/actor/dp_actor.py
  11. verl/workers/critic/dp_critic.py
  12. verl/workers/fsdp_workers.py
  13. verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py
  14. verl/workers/sharding_manager/fsdp_vllm.py
  15. verl/utils/device.py get device type for different device

Here are our roadmap:

RoadMap

  • sft
  • ppo
  • grpo

News

[2025.02.21] The GRPO algorithm is supported for training on NPU with the FSDP backend.

Requirements
We use this PR testing on Ascend NPU and GPU to ensure the same codes can run on different devices. The device information is 8 Atlas 800T A2 and 8 A100. Other software information is shown in the following table.

Software Version
transformers 4.47.1
accelerate 1.3.0
torch_npu 2.5.1.rc1
CANN 8.1.RC1 (Not Released)

About mean error
Due to differences in hardware structure, we cannot guarantee that the loss of Ascend NPU is exactly the same as that of the GPU. According to our experience, the loss differences less than 2% is acceptable. If the loss difference is greater than 2%, we will try to fix it. The calculation formula is as follows.
loss_comparison

N represents the number of training steps. For more information, please refer to

@as12138 as12138 changed the title support ASCEND NPU [WIP] support ASCEND NPU Feb 21, 2025
@huangk10
Copy link

does this pr work on multi nodes?

@as12138
Copy link
Author

as12138 commented Feb 21, 2025

does this pr work on multi nodes?

I am currently conducting tests on a single node only, and will subsequently supplement with multi-node testing results.

@as12138 as12138 force-pushed the vllm-0.7-npu branch 2 times, most recently from 0afd136 to d496b70 Compare February 21, 2025 07:59
@as12138 as12138 changed the title [WIP] support ASCEND NPU Support FSDP worker and vLLM Ascend Feb 21, 2025
@as12138 as12138 force-pushed the vllm-0.7-npu branch 9 times, most recently from bcdb340 to 8b1b207 Compare February 22, 2025 02:06
Co-authored-by: Chendong98 <chendong136@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants