[Bugfix] Fix broadcasting logic for `multi_modal_kwargs` #6836

DarkLight1337 · 2024-07-26T14:36:31Z

Currently multi_modal_kwargs are placed on GPU before distributed broadcast. Since the broadcast operation only allows flat dictionaries of tensors, custom logic is required to handle nested tensors. Currently, it only handles singly-nested dictionaries and fails to consider nested lists of tensors. It is complicated and even inefficient to extend this to arbitrarily nested structures such as interleaved nested dictionaries and lists.

This PR addresses the root issue by keeping the tensors in CPU before the broadcast operation, only moving them to GPU right before they are inputted to the model. This lets us broadcast them as opaque Python objects, removing the need to handle nested structures.

FIX #5905 (comment)
FIX #6946

cc @xwjiang2010 @youkaichao

github-actions · 2024-07-26T14:36:46Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

DarkLight1337 · 2024-07-30T03:37:44Z

~~Closing as superseded by #6925.~~

Reopening to address the comment from @youkaichao .

youkaichao

thanks for the change! will take a look tmrw.

xwjiang2010 · 2024-07-30T16:19:55Z

Hmm, does this have performance impact? Like given a CPU tensor, how do you compare the following

CPU-side broadcasting followed by multiple CPU-to-GPU transfers
Single CPU-to-GPU transfer followed by NCCL GPU-to-GPU broadcast

I feel like on a single node with high bandwidth NCCL, option 2 may be faster?

vllm/distributed/parallel_state.py

youkaichao

LGTM, thanks for the refactor to make our life easier!

youkaichao · 2024-07-30T16:45:29Z

@xwjiang2010 the tensor here to broadcast is small, and NCCL does not have a win here. It only adds code complexity.

xwjiang2010 · 2024-07-30T17:53:34Z

@xwjiang2010 the tensor here to broadcast is small, and NCCL does not have a win here. It only adds code complexity.

yeah while I am happy that I don't have to think of a way of making nested list of nested dict of nested list work,
but just want to make sure that we all on the same page that these tensors are small.

For example, are they small when there are dozens of queries in a batch? Or multiple images per query? Or when we are sending in embeddings (in the future)? Or when factors like these are compounding together?

youkaichao · 2024-07-30T18:02:04Z

In the future we will only broadcast CPU data, without sending any GPU data via broadcast. GPU data should live in the process that produces it.

xwjiang2010 · 2024-07-30T18:04:16Z

And this also applies to things like input_ids right?

youkaichao · 2024-07-30T18:05:56Z

Yes, in the future input_ids live in worker process, and we just need to append new generated tokens, without re-constructing all the input_ids at every iteration.

…t#6836)

Run distributed test on LLaVA-NeXT

dece877

DarkLight1337 added 3 commits July 26, 2024 14:38

Format

b268849

Fix elif

d90c4b7

Fix CUDA error

a043336

DarkLight1337 marked this pull request as draft July 26, 2024 16:11

Test multi-scale

ff52a47

DarkLight1337 changed the title ~~[CI/Build] Run distributed test on LLaVA-NeXT~~ [Bugfix] Fix broadcasting logic for LLaVA-NeXT Jul 27, 2024

DarkLight1337 closed this Jul 30, 2024

DarkLight1337 added 2 commits July 30, 2024 05:15

Merge branch 'upstream' into distributed-llava-next

c0a0629

Move device after broadcasting

335b63b

DarkLight1337 reopened this Jul 30, 2024

Test removal of nested handler

2470a2c

DarkLight1337 force-pushed the distributed-llava-next branch from 1ab23e2 to 2470a2c Compare July 30, 2024 06:18

DarkLight1337 marked this pull request as ready for review July 30, 2024 06:21

DarkLight1337 added ready ONLY add when PR is ready to merge/full CI is needed and removed ready ONLY add when PR is ready to merge/full CI is needed labels Jul 30, 2024

Remove test_parallel_state.py

8782a68

DarkLight1337 requested a review from youkaichao July 30, 2024 07:41

youkaichao reviewed Jul 30, 2024

View reviewed changes

DarkLight1337 added 3 commits July 30, 2024 08:00

Use async device conversion

9a1e3ce

Rename

0b97c01

Add BatchedTensorInputs convenience type

822b8dd

DarkLight1337 changed the title ~~[Bugfix] Fix broadcasting logic for LLaVA-NeXT~~ [Bugfix] Fix broadcasting logic for multi_modal_kwargs Jul 30, 2024

DarkLight1337 mentioned this pull request Jul 30, 2024

[bug fix][dist] Handle the case that a dict entry is a list of tensors. #6925

Closed

youkaichao reviewed Jul 30, 2024

View reviewed changes

vllm/distributed/parallel_state.py Outdated Show resolved Hide resolved

youkaichao approved these changes Jul 30, 2024

View reviewed changes

youkaichao added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 30, 2024

DarkLight1337 and others added 3 commits July 31, 2024 08:34

Update type annotations

32e31ab

Remove code related to nested dict

f6f24f9

Fix batching

f307a9d

DarkLight1337 merged commit f230cc2 into vllm-project:main Jul 31, 2024
72 checks passed

DarkLight1337 deleted the distributed-llava-next branch July 31, 2024 02:38

This was referenced Jul 31, 2024

[Bug]: distributed inference for vl model crashed(so slow that the connection closed) #6894

Open

[Bugfix] Fix feature size calculation for LLaVA-NeXT #6982

Merged

dtrifiro mentioned this pull request Aug 5, 2024

Sync with upstream@v0.5.4-7-g9118217f opendatahub-io/vllm#120

Closed

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024

[Bugfix] Fix broadcasting logic for multi_modal_kwargs (vllm-projec…

438f51d

…t#6836)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Fix broadcasting logic for `multi_modal_kwargs` #6836

[Bugfix] Fix broadcasting logic for `multi_modal_kwargs` #6836

DarkLight1337 commented Jul 26, 2024 •

edited

Loading

github-actions bot commented Jul 26, 2024

DarkLight1337 commented Jul 30, 2024 •

edited

Loading

youkaichao left a comment

xwjiang2010 commented Jul 30, 2024

youkaichao left a comment

youkaichao commented Jul 30, 2024

xwjiang2010 commented Jul 30, 2024 •

edited

Loading

youkaichao commented Jul 30, 2024

xwjiang2010 commented Jul 30, 2024

youkaichao commented Jul 30, 2024

[Bugfix] Fix broadcasting logic for multi_modal_kwargs #6836

[Bugfix] Fix broadcasting logic for multi_modal_kwargs #6836

Conversation

DarkLight1337 commented Jul 26, 2024 • edited Loading

github-actions bot commented Jul 26, 2024

DarkLight1337 commented Jul 30, 2024 • edited Loading

youkaichao left a comment

Choose a reason for hiding this comment

xwjiang2010 commented Jul 30, 2024

youkaichao left a comment

Choose a reason for hiding this comment

youkaichao commented Jul 30, 2024

xwjiang2010 commented Jul 30, 2024 • edited Loading

youkaichao commented Jul 30, 2024

xwjiang2010 commented Jul 30, 2024

youkaichao commented Jul 30, 2024

[Bugfix] Fix broadcasting logic for `multi_modal_kwargs` #6836

[Bugfix] Fix broadcasting logic for `multi_modal_kwargs` #6836

DarkLight1337 commented Jul 26, 2024 •

edited

Loading

DarkLight1337 commented Jul 30, 2024 •

edited

Loading

xwjiang2010 commented Jul 30, 2024 •

edited

Loading