[BugFix] Fix use of per-request seed with pipeline parallel #6698

njhill · 2024-07-23T18:17:57Z

The current per-request seed implementation assumes that the sampling happens in the same process as the driver worker since the sampler SequenceGroup objects are used to hang the torch.Generators on to maintain state between steps.

This assumption is now not always true and in particular this is why seeds are currently broken with pipeline parallel.

This PR moves the per-sequence group generator state to a dict in the final-rank PP worker where the sampler resides. Now that finished_requests_ids are passed in the execute_model calls they can be used to clean out the state for completed requests.

For speculative decoding, the generators from the scorer model worker are used by the rejection sampler.

Changes are also included to simplify/optimize the seeded spec decoding paths, and a seeded mlp speculator CI test is added.

Fixes #6449

github-actions · 2024-07-23T18:18:09Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

njhill · 2024-07-23T18:18:25Z

Draft because I'm still making sure this works properly with spec decoding.

njhill · 2024-07-26T22:35:52Z

vllm/worker/model_runner.py

-                                                     model_input.query_lens,
-                                                     self.device,
-                                                     self.pin_memory)
+        if get_pp_group().is_last_rank:


@andoorve I noticed this small optimization - only need to prepare the sampling metadata tensors in the last rank

Thanks! There's a lot of these optimizations we didn't include last time. Just from what I can tell, I think this one in particular might not help E2E (will just increase the bubble size on the other workers) but doesn't hurt to have

Yeah I figured that but doesn't harm to skip redundant work I guess

Yup for sure

andoorve · 2024-07-29T16:02:05Z

Will take a look today!

andoorve

As long as the above is satisfied I don't have an issue it from my POV. Also would be good to check that it works on the below settings from @aurickq which previously did not work with PP:

Can be triggered using PP=2 and OpenAI Server with the following repro:

kwargs = {
    'model': 'meta-llama/Meta-Llama-3-70B',
    'prompt': [
        [14924, 25, 14693, 39710, 374, 264],
        [14924, 25, 14693, 39710, 374, 264],
    ],
    'echo': True,
    'max_tokens': 0,
    'temperature': 0.0,
    'logprobs': 1,
    'seed': 1234,
}

completion = client.completions.create(**kwargs)

Other than that would recommend someone to look at the spec decoding part.

andoorve · 2024-07-29T20:50:24Z

vllm/worker/model_runner_base.py

+        """
+
+        # Clean up generators from completed requests
+        if finished_request_ids:


Could it be possible that we run into the same abort issue we had previously? I.e. could we abort, get rid of a generator on another "virtual engine" and then find that we're missing the right generator when we get to the part where we use it?

@andoorve I think this is fine since the dict storing the generators is indexed by request id and shared between all of the virtual engines. In fact in this case the finished_request_ids come from the VE-specific scheduler so the VE tasks will only be cleaning up their own ones, but even if that wasn't the case it would still work fine.

njhill · 2024-07-29T21:50:40Z

Thanks @andoorve! I should also point out that this PP seeded generation is now tested in the CI, with this addition to the compare_two_settings method used by test_pipeline_parallel.py.

andoorve · 2024-07-29T23:51:16Z

Thanks @andoorve! I should also point out that this PP seeded generation is now tested in the CI, with this addition to the compare_two_settings method used by test_pipeline_parallel.py.

Oh thanks for pointing that out! I missed it. Should we add a test with batched prompts as well with seed? Apologies if I missed it. I think the condition that @aurickq triggered only happens with multiple prompt.

njhill · 2024-07-30T00:05:26Z

Thanks @andoorve, I've now added a comparison test with seed and multiple prompts.

njhill · 2024-07-30T00:51:50Z

@tdoublep would you mind looking over the spec decoding related changes?

youkaichao

I gave you approval to unblock you. Please feel free to merge when you think it is ready @njhill @andoorve

vllm/spec_decode/batch_expansion.py

tdoublep

LGTM

njhill · 2024-07-30T17:40:28Z

Thanks @tdoublep @andoorve!

* upstream/main: [Build] Temporarily Disable Kernels and LoRA tests (vllm-project#6961) [core][misc] improve free_finished_seq_groups (vllm-project#6865) [Kernel] Remove scaled_fp8_quant kernel padding footgun (vllm-project#6842) [Bugfix] Fix tensorizer memory profiling bug during testing (vllm-project#6881) [OpenVINO] Updated OpenVINO requirements and build docs (vllm-project#6948) [Kernel] Squash a few more warnings (vllm-project#6914) [BugFix] Fix use of per-request seed with pipeline parallel (vllm-project#6698) [Doc] Super tiny fix doc typo (vllm-project#6949)

…ject#6698)

njhill added 4 commits July 26, 2024 15:21

[BugFix] Fix use of per-request seed with pipeline parallel

8aa0ca5

Move generators dict to model runner

26124b3

Add mlp speculator test

3af88df

Rework, including spec decoding cases

d2880ea

njhill force-pushed the fix-pp-seed branch from 5e89191 to d2880ea Compare July 26, 2024 22:21

njhill commented Jul 26, 2024

View reviewed changes

Fix batch_expansion test

59c3f67

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 27, 2024

Remove leftover print statement

c440e09

njhill marked this pull request as ready for review July 27, 2024 01:26

simon-mo requested a review from andoorve July 27, 2024 01:48

simon-mo mentioned this pull request Jul 27, 2024

v0.5.2, v0.5.3, v0.6.0 Release Tracker #6434

Closed

7 tasks

Fix rejection sampler tests

fa6afa5

andoorve approved these changes Jul 29, 2024

View reviewed changes

njhill mentioned this pull request Jul 29, 2024

Re-enable seed for pipeline parallel opendatahub-io/vllm-tgis-adapter#69

Merged

Add seeded comparison tests with multiple prompts

9432cbd

youkaichao approved these changes Jul 30, 2024

View reviewed changes

tdoublep reviewed Jul 30, 2024

View reviewed changes

vllm/spec_decode/batch_expansion.py Show resolved Hide resolved

vllm/spec_decode/batch_expansion.py Outdated Show resolved Hide resolved

tdoublep approved these changes Jul 30, 2024

View reviewed changes

Add a comment with some more explanation

73e9e39

simon-mo merged commit 5cf9254 into vllm-project:main Jul 30, 2024
14 of 16 checks passed

njhill deleted the fix-pp-seed branch July 30, 2024 17:40

ShangmingCai mentioned this pull request Aug 2, 2024

[Bug]: speculative decoding dies: IndexError: index 0 is out of bounds for dimension 0 with size 0 #7047

Closed

dtrifiro mentioned this pull request Aug 5, 2024

Sync with upstream@v0.5.4-7-g9118217f opendatahub-io/vllm#120

Closed

SolitaryThinker mentioned this pull request Aug 14, 2024

[core] [3/N] multi-step args and sequence.py #7452

Merged

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024

[BugFix] Fix use of per-request seed with pipeline parallel (vllm-pro…

37fa324

…ject#6698)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Fix use of per-request seed with pipeline parallel #6698

[BugFix] Fix use of per-request seed with pipeline parallel #6698

njhill commented Jul 23, 2024 •

edited

Loading

github-actions bot commented Jul 23, 2024

njhill commented Jul 23, 2024

njhill Jul 26, 2024

andoorve Jul 26, 2024

njhill Jul 26, 2024

andoorve Jul 26, 2024

andoorve commented Jul 29, 2024

andoorve left a comment

andoorve Jul 29, 2024

njhill Jul 29, 2024

njhill commented Jul 29, 2024

andoorve commented Jul 29, 2024

njhill commented Jul 30, 2024

njhill commented Jul 30, 2024

youkaichao left a comment

tdoublep left a comment

njhill commented Jul 30, 2024

[BugFix] Fix use of per-request seed with pipeline parallel #6698

[BugFix] Fix use of per-request seed with pipeline parallel #6698

Conversation

njhill commented Jul 23, 2024 • edited Loading

github-actions bot commented Jul 23, 2024

njhill commented Jul 23, 2024

njhill Jul 26, 2024

Choose a reason for hiding this comment

andoorve Jul 26, 2024

Choose a reason for hiding this comment

njhill Jul 26, 2024

Choose a reason for hiding this comment

andoorve Jul 26, 2024

Choose a reason for hiding this comment

andoorve commented Jul 29, 2024

andoorve left a comment

Choose a reason for hiding this comment

andoorve Jul 29, 2024

Choose a reason for hiding this comment

njhill Jul 29, 2024

Choose a reason for hiding this comment

njhill commented Jul 29, 2024

andoorve commented Jul 29, 2024

njhill commented Jul 30, 2024

njhill commented Jul 30, 2024

youkaichao left a comment

Choose a reason for hiding this comment

tdoublep left a comment

Choose a reason for hiding this comment

njhill commented Jul 30, 2024

njhill commented Jul 23, 2024 •

edited

Loading