Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Fix broadcasting logic for multi_modal_kwargs #6836

Merged
merged 15 commits into from
Jul 31, 2024

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Jul 26, 2024

Currently multi_modal_kwargs are placed on GPU before distributed broadcast. Since the broadcast operation only allows flat dictionaries of tensors, custom logic is required to handle nested tensors. Currently, it only handles singly-nested dictionaries and fails to consider nested lists of tensors. It is complicated and even inefficient to extend this to arbitrarily nested structures such as interleaved nested dictionaries and lists.

This PR addresses the root issue by keeping the tensors in CPU before the broadcast operation, only moving them to GPU right before they are inputted to the model. This lets us broadcast them as opaque Python objects, removing the need to handle nested structures.

FIX #5905 (comment)
FIX #6946

cc @xwjiang2010 @youkaichao

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@DarkLight1337 DarkLight1337 marked this pull request as draft July 26, 2024 16:11
@DarkLight1337 DarkLight1337 changed the title [CI/Build] Run distributed test on LLaVA-NeXT [Bugfix] Fix broadcasting logic for LLaVA-NeXT Jul 27, 2024
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Jul 30, 2024

Closing as superseded by #6925.

Reopening to address the comment from @youkaichao .

@DarkLight1337 DarkLight1337 reopened this Jul 30, 2024
@DarkLight1337 DarkLight1337 marked this pull request as ready for review July 30, 2024 06:21
@DarkLight1337 DarkLight1337 added ready ONLY add when PR is ready to merge/full CI is needed and removed ready ONLY add when PR is ready to merge/full CI is needed labels Jul 30, 2024
Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the change! will take a look tmrw.

@DarkLight1337 DarkLight1337 changed the title [Bugfix] Fix broadcasting logic for LLaVA-NeXT [Bugfix] Fix broadcasting logic for multi_modal_kwargs Jul 30, 2024
@xwjiang2010
Copy link
Contributor

Hmm, does this have performance impact? Like given a CPU tensor, how do you compare the following

  • CPU-side broadcasting followed by multiple CPU-to-GPU transfers
  • Single CPU-to-GPU transfer followed by NCCL GPU-to-GPU broadcast

I feel like on a single node with high bandwidth NCCL, option 2 may be faster?

Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the refactor to make our life easier!

@youkaichao youkaichao added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 30, 2024
@youkaichao
Copy link
Member

@xwjiang2010 the tensor here to broadcast is small, and NCCL does not have a win here. It only adds code complexity.

@xwjiang2010
Copy link
Contributor

xwjiang2010 commented Jul 30, 2024

@xwjiang2010 the tensor here to broadcast is small, and NCCL does not have a win here. It only adds code complexity.

yeah while I am happy that I don't have to think of a way of making nested list of nested dict of nested list work,
but just want to make sure that we all on the same page that these tensors are small.

For example, are they small when there are dozens of queries in a batch? Or multiple images per query? Or when we are sending in embeddings (in the future)? Or when factors like these are compounding together?

@youkaichao
Copy link
Member

In the future we will only broadcast CPU data, without sending any GPU data via broadcast. GPU data should live in the process that produces it.

@xwjiang2010
Copy link
Contributor

And this also applies to things like input_ids right?

@youkaichao
Copy link
Member

Yes, in the future input_ids live in worker process, and we just need to append new generated tokens, without re-constructing all the input_ids at every iteration.

@DarkLight1337 DarkLight1337 merged commit f230cc2 into vllm-project:main Jul 31, 2024
72 checks passed
@DarkLight1337 DarkLight1337 deleted the distributed-llava-next branch July 31, 2024 02:38
kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: MiniCPM-Llama3-V-2_5 error when tensor_parallel_size>1
3 participants