Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugFix] Fix use of per-request seed with pipeline parallel #6698

Merged
merged 9 commits into from
Jul 30, 2024

Conversation

njhill
Copy link
Member

@njhill njhill commented Jul 23, 2024

The current per-request seed implementation assumes that the sampling happens in the same process as the driver worker since the sampler SequenceGroup objects are used to hang the torch.Generators on to maintain state between steps.

This assumption is now not always true and in particular this is why seeds are currently broken with pipeline parallel.

This PR moves the per-sequence group generator state to a dict in the final-rank PP worker where the sampler resides. Now that finished_requests_ids are passed in the execute_model calls they can be used to clean out the state for completed requests.

For speculative decoding, the generators from the scorer model worker are used by the rejection sampler.

Changes are also included to simplify/optimize the seeded spec decoding paths, and a seeded mlp speculator CI test is added.

Fixes #6449

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@njhill
Copy link
Member Author

njhill commented Jul 23, 2024

Draft because I'm still making sure this works properly with spec decoding.

model_input.query_lens,
self.device,
self.pin_memory)
if get_pp_group().is_last_rank:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andoorve I noticed this small optimization - only need to prepare the sampling metadata tensors in the last rank

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! There's a lot of these optimizations we didn't include last time. Just from what I can tell, I think this one in particular might not help E2E (will just increase the bubble size on the other workers) but doesn't hurt to have

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I figured that but doesn't harm to skip redundant work I guess

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup for sure

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 27, 2024
@njhill njhill marked this pull request as ready for review July 27, 2024 01:26
@simon-mo simon-mo requested a review from andoorve July 27, 2024 01:48
@andoorve
Copy link
Collaborator

Will take a look today!

Copy link
Collaborator

@andoorve andoorve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the above is satisfied I don't have an issue it from my POV. Also would be good to check that it works on the below settings from @aurickq which previously did not work with PP:

Can be triggered using PP=2 and OpenAI Server with the following repro:

kwargs = {
    'model': 'meta-llama/Meta-Llama-3-70B',
    'prompt': [
        [14924, 25, 14693, 39710, 374, 264],
        [14924, 25, 14693, 39710, 374, 264],
    ],
    'echo': True,
    'max_tokens': 0,
    'temperature': 0.0,
    'logprobs': 1,
    'seed': 1234,
}

completion = client.completions.create(**kwargs)

Other than that would recommend someone to look at the spec decoding part.

"""

# Clean up generators from completed requests
if finished_request_ids:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be possible that we run into the same abort issue we had previously? I.e. could we abort, get rid of a generator on another "virtual engine" and then find that we're missing the right generator when we get to the part where we use it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andoorve I think this is fine since the dict storing the generators is indexed by request id and shared between all of the virtual engines. In fact in this case the finished_request_ids come from the VE-specific scheduler so the VE tasks will only be cleaning up their own ones, but even if that wasn't the case it would still work fine.

@njhill
Copy link
Member Author

njhill commented Jul 29, 2024

Thanks @andoorve! I should also point out that this PP seeded generation is now tested in the CI, with this addition to the compare_two_settings method used by test_pipeline_parallel.py.

@andoorve
Copy link
Collaborator

Thanks @andoorve! I should also point out that this PP seeded generation is now tested in the CI, with this addition to the compare_two_settings method used by test_pipeline_parallel.py.

Oh thanks for pointing that out! I missed it. Should we add a test with batched prompts as well with seed? Apologies if I missed it. I think the condition that @aurickq triggered only happens with multiple prompt.

@njhill
Copy link
Member Author

njhill commented Jul 30, 2024

Thanks @andoorve, I've now added a comparison test with seed and multiple prompts.

@njhill
Copy link
Member Author

njhill commented Jul 30, 2024

@tdoublep would you mind looking over the spec decoding related changes?

Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave you approval to unblock you. Please feel free to merge when you think it is ready @njhill @andoorve

Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@simon-mo simon-mo merged commit 5cf9254 into vllm-project:main Jul 30, 2024
14 of 16 checks passed
@njhill njhill deleted the fix-pp-seed branch July 30, 2024 17:40
@njhill
Copy link
Member Author

njhill commented Jul 30, 2024

Thanks @tdoublep @andoorve!

tjohnson31415 added a commit to tjohnson31415/vllm that referenced this pull request Jul 30, 2024
* upstream/main:
  [Build] Temporarily Disable Kernels and LoRA tests (vllm-project#6961)
  [core][misc] improve free_finished_seq_groups (vllm-project#6865)
  [Kernel] Remove scaled_fp8_quant kernel padding footgun (vllm-project#6842)
  [Bugfix] Fix tensorizer memory profiling bug during testing (vllm-project#6881)
  [OpenVINO] Updated OpenVINO requirements and build docs (vllm-project#6948)
  [Kernel] Squash a few more warnings (vllm-project#6914)
  [BugFix] Fix use of per-request seed with pipeline parallel (vllm-project#6698)
  [Doc] Super tiny fix doc typo (vllm-project#6949)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Seed issue with Pipeline Parallel
5 participants