Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][VLM] Add precise multi-modal placeholder tracking #8346

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

petersalas
Copy link
Contributor

@petersalas petersalas commented Sep 10, 2024

Currently, multi-modal prompt placeholders are related to the multi-modal embeddings exclusively by index, i.e. the first placeholder token in the prompt must correspond to the first MM embedding vector, etc. This adds a mechanism for tracking multi-modal placeholder ranges precisely which allows multi-modal models to be used with chunked prefill and is a prerequisite for allowing multi-modal models to be used with prefix caching enabled (see #8348).

For a model to use precise tracking, it:

  • Sets the placeholder ranges on the LLM inputs (these should align 1-1 with the multi-modal items for the matching key)
  • Uses the computed placeholder/embedding index tensors from the AttentionMetadata to merge the embeddings instead of matching on token ID.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments.

tests/multimodal/test_utils.py Outdated Show resolved Hide resolved
vllm/model_executor/models/clip.py Outdated Show resolved Hide resolved
vllm/model_executor/models/ultravox.py Outdated Show resolved Hide resolved
vllm/multimodal/image.py Outdated Show resolved Hide resolved
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some new comments.

Let's also test whether chunked prefill works with multimodal data for online serving since only prompt_token_ids are passed in that case.

Comment on lines +125 to +129
multi_modal_placeholders: NotRequired[
Optional["MultiModalPlaceholderDict"]]
"""
Placeholder ranges for the multi-modal data.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this field always goes with multi_modal_data, I suggest adding a level of nesting to this data structure like so:

class MultiModalInputs(TypedDict):
    data: "MultiModalDataDict"
    placeholders: "MultiModalPlaceholderDict"

class LLMInputs(TypedDict):
    prompt_token_ids: List[int]
    prompt: NotRequired[Optional[str]]
    multi_modal_inputs: NotRequired[Optional[MultiModalInputs]]

This should avoid the need to perform an extra if check inside each model's input processor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to do that but I think the signature ends up being

class MultiModalInputs(TypedDict):
    data: "MultiModalDataDict"
    placeholders: Optional["MultiModalPlaceholderDict"]

so the check for placeholders still seems necessary.

That being said, I was previously under the mistaken impression that LLMInputs was part of the external API and so consumers could explicitly specify the placeholder locations to skip the processor but I see now that that's not really the case, at least not without changing TextPrompt/TokensPrompt. Which then begs a few questions:

  • Should the placeholder dict be added to the external API as well?
  • If so, is it worth making a breaking change to TextPrompt/TokensPrompt?
  • If not, should LLMInputs be consistent with the external APIs?

(Additionally, if there's no practical way to explicitly specify the placeholders anyway I can simply remove the check and have the processors ignore that possibility.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so consumers could explicitly specify the placeholder locations to skip the processor

Is there a particular use case for this? I think it'll just end up complicating the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not concretely, it was more just that because the processor is LLMInputs -> LLMInputs it seemed like "don't mess with placeholder annotations if they already exist" would be the most sensible semantics if they were already provided.

But I'm happy to either assert that they don't already exist or blindly replace them instead -- let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. We can work on this in another PR then, since I also have another PR that refactors the input structure. (#8688)

vllm/inputs/registry.py Outdated Show resolved Hide resolved
vllm/sequence.py Outdated Show resolved Hide resolved
@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 24, 2024

As a preliminary test, I ran the VLM CI against this PR. Please take a look at the CI failures and fix them.

@DarkLight1337
Copy link
Member

PTAL at the failing workers test as well.

@DarkLight1337
Copy link
Member

Please merge from main again to resolve the test failures.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 29, 2024

It looks like the format of multi_modal_data inputted to the input mapper has been changed by you to always be in multi-input form (i.e. list of single input instances), breaking some of the models that don't allow multi-input.

vllm/multimodal/base.py Show resolved Hide resolved
vllm/multimodal/base.py Show resolved Hide resolved
intersection = range(max(positions.start, placeholder.start),
min(positions.stop, placeholder.stop))

if not intersection:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a real use case where intersection is an empty set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a couple scenarios:

  • If prefix caching is enabled (following integration with [Core][VLM] Add support for placeholder token content hashes #8348) we can skip multi-modal handling altogether for any multi-modal items whose corresponding blocks are cached.
  • In chunked prefill, if a multi-modal item is in a section of the prompt that isn't currently being prefilled it can be ignored for that inference.

@@ -466,6 +484,7 @@ def build(self, seq_lens: List[int], query_lens: List[int],
num_prefill_tokens=self.num_prefill_tokens,
num_decode_tokens=num_decode_tokens,
seq_lens=seq_lens,
multi_modal_placeholder_maps=placeholder_maps,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What code part is using this added parameter here? Ditto for all other attention changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any model that uses merge_multimodal_embeddings_from_map -- this change only adds it for Ultravox but that's really only to limit scope. In time I'd expect any multi-modal model that merges multi-modal and text embeddings to use merge_multimodal_embeddings_from_map instead of merge_multimodal_embeddings (barring any tradeoffs I might be neglecting).

inputs_embeds = merge_multimodal_embeddings(
input_ids, inputs_embeds, audio_embeddings,
_AUDIO_PLACEHOLDER_TOKEN)
merge_multimodal_embeddings_from_map(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like "merge_multimodal_embeddings_from_map" is used only for ultravox model, where this kind of merging happens for the other models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the other comment -- in time I'd expect usage of merge_multimodal_embeddings to be replaced with merge_multimodal_embeddings_from_map.

Copy link
Collaborator

@alexm-neuralmagic alexm-neuralmagic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petersalas did you had a chance to do performance comparison of non-chunked vs chunked prompt execution with your changes?

@petersalas
Copy link
Contributor Author

petersalas commented Oct 10, 2024

@petersalas did you had a chance to do performance comparison of non-chunked vs chunked prompt execution with your changes?

Not rigorously -- I did some ad-hoc runs of offline_inference_audio_language.py at various batch sizes as a smoke test (and came to the conclusion that throughput was slightly worse with chunked prefill than without). Is there a particular setup I should benchmark?

edit: here's the ad-hoc test I ran:

chunked prefill on

python examples/offline_inference_audio_language.py --num-prompts 100
Processed prompts: 100%|█| 100/100 [00:06<00:00, 14.87it/s, est. speed input: 2156.88 toks/s, output: 758.77 toks/s]

chunked prefill off

python examples/offline_inference_audio_language.py --num-prompts 100
Processed prompts: 100%|█| 100/100 [00:06<00:00, 15.38it/s, est. speed input: 2230.79 toks/s, output: 813.09 toks/s]

Copy link
Contributor

@afeldman-nm afeldman-nm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Left a few nits but overall LGTM.

vllm/multimodal/base.py Outdated Show resolved Hide resolved
@@ -105,6 +107,11 @@ class AttentionMetadata:
# in block 0, and 1st slot in block 1, respectively.
slot_mapping: torch.Tensor

# The index maps that relate multi-modal embeddings to the corresponding
# placeholders.
multi_modal_placeholder_maps: Optional[Dict[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to give multi_modal_placeholder_maps a default value, so that in non-multi-modal scenarios multi_modal_placeholder_maps need not be specified

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish I could -- unfortunately because this is the base type for the other AttentionMetadata types, doing so would require that all fields in all derived types also have default values.

vllm/multimodal/base.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants