Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX]avoid initialize process group when using a single GPU #2496

Closed

Conversation

jeejeelee
Copy link
Contributor

@jeejeelee jeejeelee commented Jan 18, 2024

Refer to:
#117
#244
#565
#654

@simon-mo simon-mo self-assigned this Jan 18, 2024
@simon-mo
Copy link
Collaborator

This is great! please add a test and let me know when it is ready for review!

@jeejeelee
Copy link
Contributor Author

@simon-mo , I have completed the coding and tested using an example similar to the one below, could you please take a little time to review this PR?

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm0 = LLM(model="facebook/opt-125m",trust_remote_code=True,gpu_memory_utilization=0.3)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm0.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

# Create an LLM.
llm1 = LLM(model="facebook/opt-125m",trust_remote_code=True,gpu_memory_utilization=0.5)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm1.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

PS:in my local host , the LLM is chatglm3.

@jeejeelee jeejeelee changed the title [WIP]avoid initialize process group when using a single GPU [FIX]avoid initialize process group when using a single GPU Jan 19, 2024
@simon-mo simon-mo self-requested a review January 19, 2024 16:44
@jeejeelee
Copy link
Contributor Author

@simon-mo I apologize to disturb you,I want to know what should I do next

@simon-mo
Copy link
Collaborator

Thanks for the ping. hmm but the final approach is not what I had in mind. Is monkey patching necessary here? Looks like we can just change the functions inside vLLM to achieve what you wanted.

@jeejeelee
Copy link
Contributor Author

@simon-mo Thanks for your review. In my approach, monkey patching is necessary.

The reason for using this approach instead of changing the functions inside vLLM is to avoid modifying too much original code (such as :Work._init_distributed_environment and get_tensor_model_parallel_rank in parallel_state.py). By using monkey patching, the main modification can be restricted to _init_single_gpu_config.

Of course, these are my personal understandings. I believe you might have better methods and deeper considerations. Please let me know.

@simon-mo
Copy link
Collaborator

Please directly modify the vLLM code so the code in the end is simple and maintainable.

@jeejeelee
Copy link
Contributor Author

@simon-mo Thanks for your review. Considering that you don't agree with my approach, I will close this PR for now. I will think about how to directly modify the vLLM code directly.

@jeejeelee jeejeelee closed this Jan 24, 2024
@simon-mo
Copy link
Collaborator

Thank you for understanding!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants