[Core][VLM] Add precise multi-modal placeholder tracking #8346

petersalas · 2024-09-10T22:02:12Z

Currently, multi-modal prompt placeholders are related to the multi-modal embeddings exclusively by index, i.e. the first placeholder token in the prompt must correspond to the first MM embedding vector, etc. This adds a mechanism for tracking multi-modal placeholder ranges precisely which allows multi-modal models to be used with chunked prefill and is a prerequisite for allowing multi-modal models to be used with prefix caching enabled (see #8348).

For a model to use precise tracking, it:

Sets the placeholder ranges on the LLM inputs (these should align 1-1 with the multi-modal items for the matching key)
Uses the computed placeholder/embedding index tensors from the AttentionMetadata to merge the embeddings instead of matching on token ID.

github-actions · 2024-09-10T22:02:26Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

DarkLight1337

Some initial comments.

tests/multimodal/test_utils.py

vllm/model_executor/models/clip.py

vllm/model_executor/models/ultravox.py

vllm/multimodal/image.py

vllm/model_executor/models/ultravox.py

…positions

DarkLight1337

Added some new comments.

Let's also test whether chunked prefill works with multimodal data for online serving since only prompt_token_ids are passed in that case.

DarkLight1337 · 2024-09-19T02:42:13Z

vllm/inputs/data.py

+    multi_modal_placeholders: NotRequired[
+        Optional["MultiModalPlaceholderDict"]]
+    """
+    Placeholder ranges for the multi-modal data.
+    """


Since this field always goes with multi_modal_data, I suggest adding a level of nesting to this data structure like so:

class MultiModalInputs(TypedDict): data: "MultiModalDataDict" placeholders: "MultiModalPlaceholderDict" class LLMInputs(TypedDict): prompt_token_ids: List[int] prompt: NotRequired[Optional[str]] multi_modal_inputs: NotRequired[Optional[MultiModalInputs]]

This should avoid the need to perform an extra if check inside each model's input processor.

I'm happy to do that but I think the signature ends up being

class MultiModalInputs(TypedDict): data: "MultiModalDataDict" placeholders: Optional["MultiModalPlaceholderDict"]

so the check for placeholders still seems necessary.

That being said, I was previously under the mistaken impression that LLMInputs was part of the external API and so consumers could explicitly specify the placeholder locations to skip the processor but I see now that that's not really the case, at least not without changing TextPrompt/TokensPrompt. Which then begs a few questions:

Should the placeholder dict be added to the external API as well?

If so, is it worth making a breaking change to TextPrompt/TokensPrompt?

If not, should LLMInputs be consistent with the external APIs?

(Additionally, if there's no practical way to explicitly specify the placeholders anyway I can simply remove the check and have the processors ignore that possibility.)

so consumers could explicitly specify the placeholder locations to skip the processor

Is there a particular use case for this? I think it'll just end up complicating the code.

Not concretely, it was more just that because the processor is LLMInputs -> LLMInputs it seemed like "don't mess with placeholder annotations if they already exist" would be the most sensible semantics if they were already provided.

But I'm happy to either assert that they don't already exist or blindly replace them instead -- let me know.

I see. We can work on this in another PR then, since I also have another PR that refactors the input structure. (#8688)

vllm/inputs/registry.py

vllm/sequence.py

…positions

DarkLight1337 · 2024-09-24T11:02:14Z

As a preliminary test, I ran the VLM CI against this PR. Please take a look at the CI failures and fix them.

DarkLight1337 · 2024-09-25T02:01:29Z

PTAL at the failing workers test as well.

…positions

DarkLight1337 · 2024-09-27T01:55:11Z

Please merge from main again to resolve the test failures.

…positions

DarkLight1337 · 2024-09-29T02:01:17Z

It looks like the format of multi_modal_data inputted to the input mapper has been changed by you to always be in multi-input form (i.e. list of single input instances), breaking some of the models that don't allow multi-input.

…positions

vllm/multimodal/base.py

alexm-neuralmagic · 2024-10-09T19:12:58Z

vllm/multimodal/base.py

+            intersection = range(max(positions.start, placeholder.start),
+                                 min(positions.stop, placeholder.stop))
+
+            if not intersection:


Is there a real use case where intersection is an empty set?

Yes, a couple scenarios:

If prefix caching is enabled (following integration with [Core][VLM] Add support for placeholder token content hashes #8348) we can skip multi-modal handling altogether for any multi-modal items whose corresponding blocks are cached.

In chunked prefill, if a multi-modal item is in a section of the prompt that isn't currently being prefilled it can be ignored for that inference.

alexm-neuralmagic · 2024-10-09T19:24:52Z

vllm/attention/backends/flash_attn.py

@@ -466,6 +484,7 @@ def build(self, seq_lens: List[int], query_lens: List[int],
            num_prefill_tokens=self.num_prefill_tokens,
            num_decode_tokens=num_decode_tokens,
            seq_lens=seq_lens,
+            multi_modal_placeholder_maps=placeholder_maps,


What code part is using this added parameter here? Ditto for all other attention changes.

Any model that uses merge_multimodal_embeddings_from_map -- this change only adds it for Ultravox but that's really only to limit scope. In time I'd expect any multi-modal model that merges multi-modal and text embeddings to use merge_multimodal_embeddings_from_map instead of merge_multimodal_embeddings (barring any tradeoffs I might be neglecting).

alexm-neuralmagic · 2024-10-10T14:25:38Z

vllm/model_executor/models/ultravox.py

-                inputs_embeds = merge_multimodal_embeddings(
-                    input_ids, inputs_embeds, audio_embeddings,
-                    _AUDIO_PLACEHOLDER_TOKEN)
+                merge_multimodal_embeddings_from_map(


It looks like "merge_multimodal_embeddings_from_map" is used only for ultravox model, where this kind of merging happens for the other models?

See the other comment -- in time I'd expect usage of merge_multimodal_embeddings to be replaced with merge_multimodal_embeddings_from_map.

alexm-neuralmagic

@petersalas did you had a chance to do performance comparison of non-chunked vs chunked prompt execution with your changes?

petersalas · 2024-10-10T22:59:07Z

@petersalas did you had a chance to do performance comparison of non-chunked vs chunked prompt execution with your changes?

Not rigorously -- I did some ad-hoc runs of offline_inference_audio_language.py at various batch sizes as a smoke test (and came to the conclusion that throughput was slightly worse with chunked prefill than without). Is there a particular setup I should benchmark?

edit: here's the ad-hoc test I ran:

chunked prefill on

python examples/offline_inference_audio_language.py --num-prompts 100
Processed prompts: 100%|█| 100/100 [00:06<00:00, 14.87it/s, est. speed input: 2156.88 toks/s, output: 758.77 toks/s]

chunked prefill off

python examples/offline_inference_audio_language.py --num-prompts 100
Processed prompts: 100%|█| 100/100 [00:06<00:00, 15.38it/s, est. speed input: 2230.79 toks/s, output: 813.09 toks/s]

…positions

afeldman-nm

Thanks for the PR. Left a few nits but overall LGTM.

vllm/multimodal/base.py

tests/models/decoder_only/audio_language/test_ultravox.py

afeldman-nm · 2024-10-11T11:56:46Z

vllm/attention/backends/abstract.py

@@ -105,6 +107,11 @@ class AttentionMetadata:
    # in block 0, and 1st slot in block 1, respectively.
    slot_mapping: torch.Tensor

+    # The index maps that relate multi-modal embeddings to the corresponding
+    # placeholders.
+    multi_modal_placeholder_maps: Optional[Dict[


Would it make sense to give multi_modal_placeholder_maps a default value, so that in non-multi-modal scenarios multi_modal_placeholder_maps need not be specified

I wish I could -- unfortunately because this is the base type for the other AttentionMetadata types, doing so would require that all fields in all derived types also have default values.

vllm/multimodal/base.py

…positions

DarkLight1337 requested review from mgoin and ywang96 September 11, 2024 02:07

DarkLight1337 reviewed Sep 11, 2024

View reviewed changes

tests/multimodal/test_utils.py Outdated Show resolved Hide resolved

vllm/model_executor/models/clip.py Outdated Show resolved Hide resolved

vllm/model_executor/models/ultravox.py Outdated Show resolved Hide resolved

vllm/multimodal/image.py Outdated Show resolved Hide resolved

ywang96 self-assigned this Sep 11, 2024

DarkLight1337 mentioned this pull request Sep 11, 2024

[RFC]: Multi-modality Support Refactoring #4194

Open

55 tasks

DarkLight1337 reviewed Sep 12, 2024

View reviewed changes

vllm/model_executor/models/ultravox.py Outdated Show resolved Hide resolved

DarkLight1337 mentioned this pull request Sep 12, 2024

[Core] generate from input embeds #6869

Open

mgoin reviewed Sep 12, 2024

View reviewed changes

vllm/model_executor/models/ultravox.py Outdated Show resolved Hide resolved

ywang96 mentioned this pull request Sep 12, 2024

[Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models #8425

Merged

petersalas mentioned this pull request Sep 12, 2024

[Core][VLM] Add support for placeholder token content hashes #8348

Open

DarkLight1337 mentioned this pull request Sep 17, 2024

[Doc] Compatibility matrix for mutual exclusive features #8512

Merged

petersalas added 2 commits September 18, 2024 18:57

[Core][VLM] Add precise multi-modal placeholder tracking

d5e298f

Merge remote-tracking branch 'upstream/main' into psalas/placeholder-…

aa756bf

…positions

petersalas force-pushed the psalas/placeholder-positions branch from e457298 to aa756bf Compare September 18, 2024 19:07

petersalas added 2 commits September 18, 2024 22:13

Fix msgpack failures

46ee2d5

Fix test

6c7830e

DarkLight1337 reviewed Sep 19, 2024

View reviewed changes

petersalas added 6 commits September 19, 2024 17:18

Change DummyData to NamedTuple

bf7d874

Merge remote-tracking branch 'upstream/main' into psalas/placeholder-…

33e3023

…positions

Merge remote-tracking branch 'upstream/main' into psalas/placeholder-…

8ebd665

…positions

Add online + chunked prefill test for ultravox

5da9450

Use SequenceData.from_token_counts in Ultravox

c5d0d7f

Update test mock

7a6cbe9

petersalas added 2 commits September 24, 2024 16:30

Fix test failures

1a617f2

Fix llava-onevision dummy data

509cbac

petersalas added 2 commits September 25, 2024 16:05

Allow None values in AttentionMetadata

cb2dc49

Merge remote-tracking branch 'upstream/main' into psalas/placeholder-…

b60ec8c

…positions

petersalas added 3 commits September 25, 2024 17:19

Fix phi3v test

defb59c

Merge remote-tracking branch 'upstream/main' into psalas/placeholder-…

5a48342

…positions

Replace index tensors with lists

a8025f9

Merge remote-tracking branch 'upstream/main' into psalas/placeholder-…

c10dff9

…positions

petersalas added 5 commits September 30, 2024 16:23

Fix test failures

862c46c

Merge remote-tracking branch 'upstream/main' into psalas/placeholder-…

f14ab68

…positions

Merge remote-tracking branch 'upstream/main' into psalas/placeholder-…

938b857

…positions

Merge remote-tracking branch 'upstream/main' into psalas/placeholder-…

6611261

…positions

Merge remote-tracking branch 'upstream/main' into psalas/placeholder-…

d0d1ea2

…positions

alexm-neuralmagic reviewed Oct 9, 2024

View reviewed changes

alexm-neuralmagic reviewed Oct 10, 2024

View reviewed changes

Add docstrings

d0bc54d

Merge remote-tracking branch 'upstream/main' into psalas/placeholder-…

f0b7a3f

…positions

petersalas requested review from WoosukKwon, zhuohan123, youkaichao and njhill as code owners October 10, 2024 23:15

afeldman-nm approved these changes Oct 11, 2024

View reviewed changes

petersalas added 2 commits October 11, 2024 16:30

Update docstrings

5171297

Merge remote-tracking branch 'upstream/main' into psalas/placeholder-…

fc0c190

…positions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][VLM] Add precise multi-modal placeholder tracking #8346

[Core][VLM] Add precise multi-modal placeholder tracking #8346

petersalas commented Sep 10, 2024 •

edited

Loading

github-actions bot commented Sep 10, 2024

DarkLight1337 left a comment

DarkLight1337 left a comment

DarkLight1337 Sep 19, 2024

petersalas Sep 19, 2024

DarkLight1337 Sep 19, 2024

petersalas Sep 23, 2024

DarkLight1337 Sep 24, 2024

DarkLight1337 commented Sep 24, 2024 •

edited

Loading

DarkLight1337 commented Sep 25, 2024

DarkLight1337 commented Sep 27, 2024

DarkLight1337 commented Sep 29, 2024 •

edited

Loading

alexm-neuralmagic Oct 9, 2024

petersalas Oct 10, 2024

alexm-neuralmagic Oct 9, 2024

petersalas Oct 10, 2024

alexm-neuralmagic Oct 10, 2024

petersalas Oct 10, 2024

alexm-neuralmagic left a comment

petersalas commented Oct 10, 2024 •

edited

Loading

afeldman-nm left a comment

afeldman-nm Oct 11, 2024

petersalas Oct 11, 2024

[Core][VLM] Add precise multi-modal placeholder tracking #8346

Are you sure you want to change the base?

[Core][VLM] Add precise multi-modal placeholder tracking #8346

Conversation

petersalas commented Sep 10, 2024 • edited Loading

github-actions bot commented Sep 10, 2024

DarkLight1337 left a comment

Choose a reason for hiding this comment

DarkLight1337 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DarkLight1337 commented Sep 24, 2024 • edited Loading

DarkLight1337 commented Sep 25, 2024

DarkLight1337 commented Sep 27, 2024

DarkLight1337 commented Sep 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexm-neuralmagic left a comment

Choose a reason for hiding this comment

petersalas commented Oct 10, 2024 • edited Loading

afeldman-nm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petersalas commented Sep 10, 2024 •

edited

Loading

DarkLight1337 commented Sep 24, 2024 •

edited

Loading

DarkLight1337 commented Sep 29, 2024 •

edited

Loading

petersalas commented Oct 10, 2024 •

edited

Loading