-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][VLM] Add precise multi-modal placeholder tracking #8346
base: main
Are you sure you want to change the base?
[Core][VLM] Add precise multi-modal placeholder tracking #8346
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments.
e457298
to
aa756bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some new comments.
Let's also test whether chunked prefill works with multimodal data for online serving since only prompt_token_ids
are passed in that case.
multi_modal_placeholders: NotRequired[ | ||
Optional["MultiModalPlaceholderDict"]] | ||
""" | ||
Placeholder ranges for the multi-modal data. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this field always goes with multi_modal_data
, I suggest adding a level of nesting to this data structure like so:
class MultiModalInputs(TypedDict):
data: "MultiModalDataDict"
placeholders: "MultiModalPlaceholderDict"
class LLMInputs(TypedDict):
prompt_token_ids: List[int]
prompt: NotRequired[Optional[str]]
multi_modal_inputs: NotRequired[Optional[MultiModalInputs]]
This should avoid the need to perform an extra if
check inside each model's input processor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to do that but I think the signature ends up being
class MultiModalInputs(TypedDict):
data: "MultiModalDataDict"
placeholders: Optional["MultiModalPlaceholderDict"]
so the check for placeholders still seems necessary.
That being said, I was previously under the mistaken impression that LLMInputs
was part of the external API and so consumers could explicitly specify the placeholder locations to skip the processor but I see now that that's not really the case, at least not without changing TextPrompt
/TokensPrompt
. Which then begs a few questions:
- Should the placeholder dict be added to the external API as well?
- If so, is it worth making a breaking change to
TextPrompt
/TokensPrompt
? - If not, should
LLMInputs
be consistent with the external APIs?
(Additionally, if there's no practical way to explicitly specify the placeholders anyway I can simply remove the check and have the processors ignore that possibility.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so consumers could explicitly specify the placeholder locations to skip the processor
Is there a particular use case for this? I think it'll just end up complicating the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not concretely, it was more just that because the processor is LLMInputs -> LLMInputs
it seemed like "don't mess with placeholder annotations if they already exist" would be the most sensible semantics if they were already provided.
But I'm happy to either assert that they don't already exist or blindly replace them instead -- let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. We can work on this in another PR then, since I also have another PR that refactors the input structure. (#8688)
As a preliminary test, I ran the VLM CI against this PR. Please take a look at the CI failures and fix them. |
PTAL at the failing workers test as well. |
Please merge from |
It looks like the format of |
intersection = range(max(positions.start, placeholder.start), | ||
min(positions.stop, placeholder.stop)) | ||
|
||
if not intersection: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a real use case where intersection is an empty set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a couple scenarios:
- If prefix caching is enabled (following integration with [Core][VLM] Add support for placeholder token content hashes #8348) we can skip multi-modal handling altogether for any multi-modal items whose corresponding blocks are cached.
- In chunked prefill, if a multi-modal item is in a section of the prompt that isn't currently being prefilled it can be ignored for that inference.
@@ -466,6 +484,7 @@ def build(self, seq_lens: List[int], query_lens: List[int], | |||
num_prefill_tokens=self.num_prefill_tokens, | |||
num_decode_tokens=num_decode_tokens, | |||
seq_lens=seq_lens, | |||
multi_modal_placeholder_maps=placeholder_maps, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What code part is using this added parameter here? Ditto for all other attention changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any model that uses merge_multimodal_embeddings_from_map
-- this change only adds it for Ultravox but that's really only to limit scope. In time I'd expect any multi-modal model that merges multi-modal and text embeddings to use merge_multimodal_embeddings_from_map
instead of merge_multimodal_embeddings
(barring any tradeoffs I might be neglecting).
inputs_embeds = merge_multimodal_embeddings( | ||
input_ids, inputs_embeds, audio_embeddings, | ||
_AUDIO_PLACEHOLDER_TOKEN) | ||
merge_multimodal_embeddings_from_map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like "merge_multimodal_embeddings_from_map" is used only for ultravox model, where this kind of merging happens for the other models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the other comment -- in time I'd expect usage of merge_multimodal_embeddings
to be replaced with merge_multimodal_embeddings_from_map
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@petersalas did you had a chance to do performance comparison of non-chunked vs chunked prompt execution with your changes?
Not rigorously -- I did some ad-hoc runs of edit: here's the ad-hoc test I ran: chunked prefill on
chunked prefill off
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Left a few nits but overall LGTM.
@@ -105,6 +107,11 @@ class AttentionMetadata: | |||
# in block 0, and 1st slot in block 1, respectively. | |||
slot_mapping: torch.Tensor | |||
|
|||
# The index maps that relate multi-modal embeddings to the corresponding | |||
# placeholders. | |||
multi_modal_placeholder_maps: Optional[Dict[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to give multi_modal_placeholder_maps
a default value, so that in non-multi-modal scenarios multi_modal_placeholder_maps
need not be specified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish I could -- unfortunately because this is the base type for the other AttentionMetadata
types, doing so would require that all fields in all derived types also have default values.
Currently, multi-modal prompt placeholders are related to the multi-modal embeddings exclusively by index, i.e. the first placeholder token in the prompt must correspond to the first MM embedding vector, etc. This adds a mechanism for tracking multi-modal placeholder ranges precisely which allows multi-modal models to be used with chunked prefill and is a prerequisite for allowing multi-modal models to be used with prefix caching enabled (see #8348).
For a model to use precise tracking, it:
AttentionMetadata
to merge the embeddings instead of matching on token ID.