Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] Improve the params template for generation #351

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

BearBiscuit05
Copy link

fix the issue#331

@vermouth1992
Copy link
Collaborator

Could you help add a test of QWen 0.5b generation to protect this functionality?

@BearBiscuit05
Copy link
Author

Sure, I used Qwen0.5B for testing on a single machine. But in which directory under the "test" directory should I add the test?

@vermouth1992
Copy link
Collaborator

Could you create a new folder under test with name "generation". Under the folder, create a new bash script that runs QWen 0.5b for generation. And call the generation script here https://github.com/volcengine/verl/blob/main/.github/workflows/vllm.yml#L49 by creating a new test item. Thanks!

@BearBiscuit05
Copy link
Author

Running with 1 GPU works normally, but when setting nproc_per_node > 1, it produces the error Duplicate GPU detected: rank 0 and rank 1 both on CUDA device 31000. I'm unsure whether this is caused by parameter configuration issues or a hardware-related problem. Could you help me identify the root cause?

@vermouth1992
Copy link
Collaborator

vermouth1992 commented Feb 23, 2025

Could you check the version of ray? And could you successfully run normal PPO training?

@BearBiscuit05
Copy link
Author

Ray version is 2.10, and I ran PPO on 2 * A100 successfully. So I think it may be a parameter problem. I will check it tomorrow.

@vermouth1992
Copy link
Collaborator

You can either set max_colocate_count to 1 https://github.com/volcengine/verl/blob/main/verl/single_controller/ray/base.py#L55 or upgrade ray to the latest to resolve this problem

@BearBiscuit05
Copy link
Author

That's great! I successfully ran the generation with multiple GPUs and TP>1. So, in the test script, should I set TP>1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants