-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] move parallel sampling out from vllm core #9302
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
the openai api behavior is: every sequence in parallel sampling will be assigned a unique this is the test script: from openai import OpenAI
api_key = ''
client = OpenAI(
api_key=api_key,
)
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Repeat after me: apple."}],
stream=True,
max_tokens=5,
n=1,
)
for chunk in stream:
print(chunk) and the output:
when I use
I get two tokens from sequence 0 at first, and then two tokens from sequence 1. |
@robertgshaw2-neuralmagic can you help take a look? I met a strange error:
will fail in this implementation. it is surprising that ci actually passes ... it errors in my local dev machine. |
@youkaichao how would this PR impact |
@afeldman-nm |
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
Could you explain the benefit of doing so? It seems that with this change, the scheduler can no longer make decisions based on the number of sequences within a |
Signed-off-by: Erkin Sagiroglu <erkin@infra-aipipeline-1-at1-prox-prod-a.ipa.corp.telnyx.com>
yes, the scheduler will only process single sequence in the future, to make the core code simple. |
Signed-off-by: Amit Garg <mitgarg17495@gmail.com>
This modification makes the "fork" mechanism of vLLM completely unused. Previously, for a request with n > 1, its prompt was prefilled only once, and then the sequence was "forked" into n sequences to avoid redundant computation. After this modification, a request with n > 1 has to prefill its prompt n times. from vllm import LLM, SamplingParams
import time
# Sample prompts.
prompts = [
"Once upon a time, there was a king.",
]
# Create a sampling params object.
sampling_params = SamplingParams(seed=42, temperature=0.1, max_tokens=1, n=100)
# Create an LLM.
llm = LLM(model="Qwen/Qwen2.5-1.5B-Instruct")
# warm up
outputs = llm.generate(prompts, sampling_params)
begin_time = time.time()
outputs = llm.generate(prompts, sampling_params)
end_time = time.time()
print(f"{end_time - begin_time}s")
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") |
Yes, this is intended. Please use prefix caching to speed up and share the prefill. All the sharing will not be hardcoded in the scheduler, and will only happen through prefix caching. I'm not sure if prefix caching currently supports sharing in the same batch. If you want optimal performance, I would suggest running a |
Signed-off-by: qishuai <ferdinandzhong@gmail.com>
Thank you for clarifying. Prefix caching does support sharing in the same batch, though the performance gain is not as much as using the "fork" mechanism. |
Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
try to hide seq group from the core, by handling parallel sampling in llm engine.