[core] move parallel sampling out from vllm core #9302

youkaichao · 2024-10-12T00:19:31Z

try to hide seq group from the core, by handling parallel sampling in llm engine.

github-actions · 2024-10-12T00:19:43Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

youkaichao · 2024-10-12T00:24:53Z

~~the caveat is that we cannot support streaming when n > 1 . I think people don't use streaming when n > 1, and it is not clearly defined.~~

~~say we have n = 5, and the first stream gives 5 tokens, and then sequence 2 finish, do we send 5 outputs with the 2nd as empty? or send 4 outputs and let users mantain the status?~~

the openai api behavior is:

every sequence in parallel sampling will be assigned a unique index, and then the stream is flattened, one token at a time. it does not need to have n tokens at a time.

this is the test script:

from openai import OpenAI
api_key = ''
client = OpenAI(
    api_key=api_key,
)

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Repeat after me: apple."}],
    stream=True,
    max_tokens=5,
    n=1,
)
for chunk in stream:
    print(chunk)

and the output:

ChatCompletionChunk(id='chatcmpl-AJDmVXderuQ5vwRKIqOCT2vJsEydq', choices=[Choice(delta=ChoiceDelta(content='', function_call=None, role='assistant', tool_calls=None, refusal=None), finish_reason=None, index=0, logprobs=None)], created=1729144571, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', system_fingerprint='fp_e2bde53e6e')
ChatCompletionChunk(id='chatcmpl-AJDmVXderuQ5vwRKIqOCT2vJsEydq', choices=[Choice(delta=ChoiceDelta(content='Apple', function_call=None, role=None, tool_calls=None), finish_reason=None, index=0, logprobs=None)], created=1729144571, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', system_fingerprint='fp_e2bde53e6e')
ChatCompletionChunk(id='chatcmpl-AJDmVXderuQ5vwRKIqOCT2vJsEydq', choices=[Choice(delta=ChoiceDelta(content='.', function_call=None, role=None, tool_calls=None), finish_reason=None, index=0, logprobs=None)], created=1729144571, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', system_fingerprint='fp_e2bde53e6e')
ChatCompletionChunk(id='chatcmpl-AJDmVXderuQ5vwRKIqOCT2vJsEydq', choices=[Choice(delta=ChoiceDelta(content=None, function_call=None, role=None, tool_calls=None), finish_reason='stop', index=0, logprobs=None)], created=1729144571, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', system_fingerprint='fp_e2bde53e6e')

when I use n=2:

ChatCompletionChunk(id='chatcmpl-AJDo9qfpbDuIFeHHjJtkkMxvVhs2P', choices=[Choice(delta=ChoiceDelta(content='', function_call=None, role='assistant', tool_calls=None, refusal=None), finish_reason=None, index=0, logprobs=None)], created=1729144673, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', system_fingerprint='fp_e2bde53e6e')
ChatCompletionChunk(id='chatcmpl-AJDo9qfpbDuIFeHHjJtkkMxvVhs2P', choices=[Choice(delta=ChoiceDelta(content='Apple', function_call=None, role=None, tool_calls=None), finish_reason=None, index=0, logprobs=None)], created=1729144673, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', system_fingerprint='fp_e2bde53e6e')
ChatCompletionChunk(id='chatcmpl-AJDo9qfpbDuIFeHHjJtkkMxvVhs2P', choices=[Choice(delta=ChoiceDelta(content='', function_call=None, role='assistant', tool_calls=None, refusal=None), finish_reason=None, index=1, logprobs=None)], created=1729144673, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', system_fingerprint='fp_e2bde53e6e')
ChatCompletionChunk(id='chatcmpl-AJDo9qfpbDuIFeHHjJtkkMxvVhs2P', choices=[Choice(delta=ChoiceDelta(content='Apple', function_call=None, role=None, tool_calls=None), finish_reason=None, index=1, logprobs=None)], created=1729144673, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', system_fingerprint='fp_e2bde53e6e')
ChatCompletionChunk(id='chatcmpl-AJDo9qfpbDuIFeHHjJtkkMxvVhs2P', choices=[Choice(delta=ChoiceDelta(content='.', function_call=None, role=None, tool_calls=None), finish_reason=None, index=0, logprobs=None)], created=1729144673, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', system_fingerprint='fp_e2bde53e6e')
ChatCompletionChunk(id='chatcmpl-AJDo9qfpbDuIFeHHjJtkkMxvVhs2P', choices=[Choice(delta=ChoiceDelta(content='.', function_call=None, role=None, tool_calls=None), finish_reason=None, index=1, logprobs=None)], created=1729144673, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', system_fingerprint='fp_e2bde53e6e')
ChatCompletionChunk(id='chatcmpl-AJDo9qfpbDuIFeHHjJtkkMxvVhs2P', choices=[Choice(delta=ChoiceDelta(content=None, function_call=None, role=None, tool_calls=None), finish_reason='stop', index=0, logprobs=None)], created=1729144673, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', system_fingerprint='fp_e2bde53e6e')
ChatCompletionChunk(id='chatcmpl-AJDo9qfpbDuIFeHHjJtkkMxvVhs2P', choices=[Choice(delta=ChoiceDelta(content=None, function_call=None, role=None, tool_calls=None), finish_reason='stop', index=1, logprobs=None)], created=1729144673, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', system_fingerprint='fp_e2bde53e6e')

I get two tokens from sequence 0 at first, and then two tokens from sequence 1.

youkaichao · 2024-10-12T06:33:08Z

@robertgshaw2-neuralmagic can you help take a look? I met a strange error:

pytest -v -s tests/entrypoints/openai/test_completion.py::test_guided_json_completion[-outlines]

will fail in this implementation. lm-format-enforcer works well, and --disable-frontend-multiprocessing also works. only the combination of the mqllmengine and outlines does not work.

it is surprising that ci actually passes ... it errors in my local dev machine.

vllm/engine/llm_engine.py

vllm/outputs.py

afeldman-nm · 2024-10-17T15:46:29Z

@youkaichao how would this PR impact best_of > 1 requests? Is best_of functionality still within the engine, or is it moved outside the engine as has been done for beam search? @robertgshaw2-neuralmagic @njhill

youkaichao · 2024-10-17T16:22:30Z

@afeldman-nm best_of > 1 is already converted to parallel sampling in #9261

Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: Vinay Damodaran <vrdn@hey.com>

rand-fly · 2024-10-26T09:50:11Z

Could you explain the benefit of doing so? It seems that with this change, the scheduler can no longer make decisions based on the number of sequences within a SequenceGroup.

Signed-off-by: Erkin Sagiroglu <erkin@infra-aipipeline-1-at1-prox-prod-a.ipa.corp.telnyx.com>

youkaichao · 2024-10-27T22:57:27Z

Could you explain the benefit of doing so? It seems that with this change, the scheduler can no longer make decisions based on the number of sequences within a SequenceGroup.

yes, the scheduler will only process single sequence in the future, to make the core code simple.

Signed-off-by: Amit Garg <mitgarg17495@gmail.com>

rand-fly · 2024-10-29T06:34:21Z

This modification makes the "fork" mechanism of vLLM completely unused. Previously, for a request with n > 1, its prompt was prefilled only once, and then the sequence was "forked" into n sequences to avoid redundant computation. After this modification, a request with n > 1 has to prefill its prompt n times.
A small experiment code can be used to verify this. Notice how it has become much slower after this squashed commit.

from vllm import LLM, SamplingParams
import time

# Sample prompts.
prompts = [
    "Once upon a time, there was a king.",
]
# Create a sampling params object.
sampling_params = SamplingParams(seed=42, temperature=0.1, max_tokens=1, n=100)

# Create an LLM.
llm = LLM(model="Qwen/Qwen2.5-1.5B-Instruct")

# warm up
outputs = llm.generate(prompts, sampling_params) 

begin_time = time.time()
outputs = llm.generate(prompts, sampling_params)
end_time = time.time()

print(f"{end_time - begin_time}s")

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

youkaichao · 2024-10-29T06:57:05Z

This modification makes the "fork" mechanism of vLLM completely unused. Previously, for a request with n > 1, its prompt was prefilled only once, and then the sequence was "forked" into n sequences to avoid redundant computation. After this modification, a request with n > 1 has to prefill its prompt n times.

Yes, this is intended. Please use prefix caching to speed up and share the prefill. All the sharing will not be hardcoded in the scheduler, and will only happen through prefix caching.

I'm not sure if prefix caching currently supports sharing in the same batch. If you want optimal performance, I would suggest running a n=1 request first, and then run another n=n-1 request.

Signed-off-by: qishuai <ferdinandzhong@gmail.com>

rand-fly · 2024-10-29T11:37:16Z

This modification makes the "fork" mechanism of vLLM completely unused. Previously, for a request with n > 1, its prompt was prefilled only once, and then the sequence was "forked" into n sequences to avoid redundant computation. After this modification, a request with n > 1 has to prefill its prompt n times.

Yes, this is intended. Please use prefix caching to speed up and share the prefill. All the sharing will not be hardcoded in the scheduler, and will only happen through prefix caching.

I'm not sure if prefix caching currently supports sharing in the same batch. If you want optimal performance, I would suggest running a n=1 request first, and then run another n=n-1 request.

Thank you for clarifying. Prefix caching does support sharing in the same batch, though the performance gain is not as much as using the "fork" mechanism.

Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>

Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

m-harmonic · 2025-02-13T21:08:37Z

This modification makes the "fork" mechanism of vLLM completely unused. Previously, for a request with n > 1, its prompt was prefilled only once, and then the sequence was "forked" into n sequences to avoid redundant computation. After this modification, a request with n > 1 has to prefill its prompt n times.

Yes, this is intended. Please use prefix caching to speed up and share the prefill. All the sharing will not be hardcoded in the scheduler, and will only happen through prefix caching.

I'm not sure if prefix caching currently supports sharing in the same batch. If you want optimal performance, I would suggest running a n=1 request first, and then run another n=n-1 request.

This PR reduces our VLLM throughput for n>1 by about 3x, making this and later versions completely unusable. Enabling prefix caching does not make any difference in our performance tests.

try to remove seq group inside core

78a0613

youkaichao requested review from WoosukKwon, zhuohan123, alexm-redhat, comaniac and njhill as code owners October 12, 2024 00:19

youkaichao marked this pull request as draft October 12, 2024 03:00

youkaichao added 4 commits October 11, 2024 21:36

fix

734df78

fix engine

8dc746a

fix output

9463491

fix server

10f7cd9

youkaichao marked this pull request as ready for review October 12, 2024 05:35

youkaichao added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 12, 2024

youkaichao changed the title ~~[draft] try to remove seq group inside core~~ [core] try to remove seq group from core Oct 12, 2024

zhuohan123 approved these changes Oct 15, 2024

View reviewed changes

vllm/engine/llm_engine.py Outdated Show resolved Hide resolved

vllm/outputs.py Outdated Show resolved Hide resolved

zhuohan123 mentioned this pull request Oct 16, 2024

[V1] Implement vLLM V1 [1/N] #9289

Merged

njhill mentioned this pull request Oct 17, 2024

[Bugfix] Handle best_of>1 by disabling multi-step scheduling; fail if beam search is invoked with multi-step scheduling #8637

Closed

youkaichao added 8 commits October 21, 2024 09:52

support streaming for parallel sampling

245920f

support streaming for parallel sampling

5ac8a11

fix

acc8f4c

Merge branch 'main' into rm_n

9014eb5

fix lint

b098014

support streaming

df86d68

fix kwargs

f841b05

fix streaming

2c3e001

youkaichao added 2 commits October 21, 2024 14:50

improve streaming

f04c703

add tests for parallel streaming

adc6be4

youkaichao requested review from DarkLight1337, robertgshaw2-redhat and simon-mo as code owners October 21, 2024 22:21

youkaichao changed the title ~~[core] try to remove seq group from core~~ [core] move parallel sampling out from vllm core Oct 21, 2024

youkaichao enabled auto-merge (squash) October 21, 2024 23:40

youkaichao merged commit 76a5e13 into vllm-project:main Oct 22, 2024
60 checks passed

youkaichao deleted the rm_n branch October 22, 2024 00:34

This was referenced Oct 22, 2024

[core] simplify seq group code #9569

Merged

[Frontend] merge beam search implementations #9296

Merged

charlifu pushed a commit to charlifu/vllm that referenced this pull request Oct 23, 2024

[core] move parallel sampling out from vllm core (vllm-project#9302)

444e404

Signed-off-by: charlifu <charlifu@amd.com>

vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Oct 23, 2024

[core] move parallel sampling out from vllm core (vllm-project#9302)

8b5b749

Signed-off-by: Vinay Damodaran <vrdn@hey.com>

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[core] move parallel sampling out from vllm core (vllm-project#9302)

abdff1d

MErkinSag pushed a commit to MErkinSag/vllm that referenced this pull request Oct 26, 2024

[core] move parallel sampling out from vllm core (vllm-project#9302)

01a7682

Signed-off-by: Erkin Sagiroglu <erkin@infra-aipipeline-1-at1-prox-prod-a.ipa.corp.telnyx.com>

garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024

[core] move parallel sampling out from vllm core (vllm-project#9302)

cce2a8b

Signed-off-by: Amit Garg <mitgarg17495@gmail.com>

FerdinandZhong pushed a commit to FerdinandZhong/vllm that referenced this pull request Oct 29, 2024

[core] move parallel sampling out from vllm core (vllm-project#9302)

1021ae0

Signed-off-by: qishuai <ferdinandzhong@gmail.com>

sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024

[core] move parallel sampling out from vllm core (vllm-project#9302)

83225aa

Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[core] move parallel sampling out from vllm core (vllm-project#9302)

22e7e20

mfournioux pushed a commit to mfournioux/vllm that referenced this pull request Nov 20, 2024

[core] move parallel sampling out from vllm core (vllm-project#9302)

f872e02

Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>

tlrmchlsmth pushed a commit to neuralmagic/vllm that referenced this pull request Nov 23, 2024

[core] move parallel sampling out from vllm core (vllm-project#9302)

b7b585d

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

youkaichao mentioned this pull request Nov 29, 2024

[Misc] typo find in sampling_metadata.py #10740

Merged

This was referenced Jan 30, 2025

[V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine) #10980

Merged

[Bugfix] Fix tqdm progress bar when SamplingParams.n > 1 #12428

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] move parallel sampling out from vllm core #9302

[core] move parallel sampling out from vllm core #9302

youkaichao commented Oct 12, 2024

github-actions bot commented Oct 12, 2024

youkaichao commented Oct 12, 2024 •

edited

Loading

youkaichao commented Oct 12, 2024 •

edited

Loading

afeldman-nm commented Oct 17, 2024 •

edited

Loading

youkaichao commented Oct 17, 2024

rand-fly commented Oct 26, 2024

youkaichao commented Oct 27, 2024

rand-fly commented Oct 29, 2024

youkaichao commented Oct 29, 2024

rand-fly commented Oct 29, 2024

m-harmonic commented Feb 13, 2025

[core] move parallel sampling out from vllm core #9302

[core] move parallel sampling out from vllm core #9302

Conversation

youkaichao commented Oct 12, 2024

github-actions bot commented Oct 12, 2024

youkaichao commented Oct 12, 2024 • edited Loading

youkaichao commented Oct 12, 2024 • edited Loading

afeldman-nm commented Oct 17, 2024 • edited Loading

youkaichao commented Oct 17, 2024

rand-fly commented Oct 26, 2024

youkaichao commented Oct 27, 2024

rand-fly commented Oct 29, 2024

youkaichao commented Oct 29, 2024

rand-fly commented Oct 29, 2024

m-harmonic commented Feb 13, 2025

youkaichao commented Oct 12, 2024 •

edited

Loading

youkaichao commented Oct 12, 2024 •

edited

Loading

afeldman-nm commented Oct 17, 2024 •

edited

Loading