-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support FSDP worker and vLLM Ascend #332
Open
as12138
wants to merge
4
commits into
volcengine:main
Choose a base branch
from
as12138:vllm-0.7-npu
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+560
−97
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
7510441
to
8e1637e
Compare
does this pr work on multi nodes? |
I am currently conducting tests on a single node only, and will subsequently supplement with multi-node testing results. |
0afd136
to
d496b70
Compare
bcdb340
to
8b1b207
Compare
Co-authored-by: Chendong98 <chendong136@huawei.com>
8b1b207
to
0b7e274
Compare
celestialli
reviewed
Feb 22, 2025
[fix] npu does not support torch.distributed.ReduceOp.AVG
b0e07ca
to
62af61c
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pr is committed for supporting Ascend NPU backend.
Co-authored-by: Chendong98 chendong136@huawei.com
Co-authored-by: zheliuyu 15750543867@163.com
In this pr, we add the capability to determine the type of NPU device and we also add a new script for training on NPU.
These are change lists:
Here are our roadmap:
RoadMap
News
[2025.02.21] The GRPO algorithm is supported for training on NPU with the FSDP backend.
Requirements
We use this PR testing on Ascend NPU and GPU to ensure the same codes can run on different devices. The device information is 8 Atlas 800T A2 and 8 A100. Other software information is shown in the following table.
About mean error

Due to differences in hardware structure, we cannot guarantee that the loss of Ascend NPU is exactly the same as that of the GPU. According to our experience, the loss differences less than 2% is acceptable. If the loss difference is greater than 2%, we will try to fix it. The calculation formula is as follows.
N represents the number of training steps. For more information, please refer to