Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Dynamic image size support for VLMs #5276

Merged
merged 242 commits into from
Jul 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
242 commits
Select commit Hold shift + click to select a range
34bfa79
Introduce a higher level `INPUT_REGISTRY`
DarkLight1337 Jun 3, 2024
df2aa19
Move dummy data generation to input registry
DarkLight1337 Jun 3, 2024
c72d2b3
Update docs
DarkLight1337 Jun 3, 2024
d8c6488
Rename `process_input` to `map_input`
DarkLight1337 Jun 3, 2024
f18de48
Reorder arguments
DarkLight1337 Jun 3, 2024
653537d
Apply input processor
DarkLight1337 Jun 3, 2024
a2f5a3c
Remove `VisionLanguageConfig` from input mapper
DarkLight1337 Jun 3, 2024
378ad80
Fix bad use of `functools.partial`
DarkLight1337 Jun 3, 2024
7aa3778
Use default input processor
DarkLight1337 Jun 3, 2024
c774168
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 4, 2024
532f863
Fix wrong arguments
DarkLight1337 Jun 4, 2024
080d40c
Use pillow image instead of tensor to avoid bypassing the processor b…
DarkLight1337 Jun 5, 2024
662693a
Update interface of dummy data factory and input processor
DarkLight1337 Jun 5, 2024
9bc5fcc
Use `InputContext` to handle checked type cast of config types
DarkLight1337 Jun 5, 2024
911cac7
Add input processor for injecting image tokens; fix docs
DarkLight1337 Jun 5, 2024
a38b347
Add new documentation pages
DarkLight1337 Jun 5, 2024
29c3bb3
Fix LLaVA-NeXT input processor and cleanup code
DarkLight1337 Jun 5, 2024
9cfbcce
Fix LLaVA-NeXT input processor and cleanup code
DarkLight1337 Jun 5, 2024
7bb6cbf
Add sanity check
DarkLight1337 Jun 6, 2024
ccf49c4
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 6, 2024
3482d32
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 6, 2024
8ea8468
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 8, 2024
be3d64f
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 8, 2024
2ff5be6
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 10, 2024
8e2ff86
Update LLaVA-NeXT
DarkLight1337 Jun 11, 2024
553f684
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 11, 2024
b134dfc
Update name
DarkLight1337 Jun 11, 2024
1efa480
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 11, 2024
1a08444
Update LLaVA-NeXT
DarkLight1337 Jun 11, 2024
7e33706
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 11, 2024
cfc31fd
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jun 11, 2024
3fb622c
Remove `MULTIMODAL` convenience property as it was causing some (impo…
DarkLight1337 Jun 11, 2024
da85ab2
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 11, 2024
383bea1
Update docs
DarkLight1337 Jun 11, 2024
80a09f2
Remove double processing of image tokens
DarkLight1337 Jun 12, 2024
6a70e4f
Add docs
DarkLight1337 Jun 12, 2024
8322ecb
Add docs
DarkLight1337 Jun 12, 2024
52a0116
Add docs
DarkLight1337 Jun 12, 2024
c1733dd
Add docs
DarkLight1337 Jun 12, 2024
b7a8683
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 12, 2024
9fb5e72
Remove more instances of double processing; update docs
DarkLight1337 Jun 13, 2024
25f9949
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 13, 2024
03c7e65
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 13, 2024
3932b3f
Remove xfail
DarkLight1337 Jun 13, 2024
7fa877a
Fix missing image token in OpenAI API serving
DarkLight1337 Jun 13, 2024
092e550
Fix LLaVA-NeXT test
DarkLight1337 Jun 14, 2024
7a19862
Remove duplicate processing in async engine
DarkLight1337 Jun 14, 2024
fd7d954
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 15, 2024
49dac3e
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 15, 2024
b2c6832
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 15, 2024
0104218
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 18, 2024
18cc7e0
Set up dummy data factory for phi3v
DarkLight1337 Jun 18, 2024
2291617
Move dummy data factories to model files
DarkLight1337 Jun 18, 2024
adf5503
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 18, 2024
e5a94e4
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 18, 2024
9b0386d
Move input processors to model files
DarkLight1337 Jun 18, 2024
4e656e7
Set up input processor for phi3v
DarkLight1337 Jun 18, 2024
fecf1f0
Fix wrong feature size
DarkLight1337 Jun 18, 2024
086e0fe
Fix wrong feature size
DarkLight1337 Jun 18, 2024
8c26a18
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 19, 2024
81522fe
Fix wrong feature size
DarkLight1337 Jun 19, 2024
c036b86
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 24, 2024
f75e1ab
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 24, 2024
b24e8d9
Update validation
DarkLight1337 Jun 24, 2024
8569d35
Fix image feature calculation for phi3v
DarkLight1337 Jun 24, 2024
bfa5aa9
Remove redundant code
DarkLight1337 Jun 24, 2024
dc34121
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 24, 2024
07e695d
Apply isort
DarkLight1337 Jun 24, 2024
8a43a77
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 24, 2024
825401d
Apply yapf
DarkLight1337 Jun 24, 2024
4a0d4d1
Reduce `max_tokens` so that test still passes
DarkLight1337 Jun 25, 2024
8d22fe0
Fix vllm to hf output (+ rename)
DarkLight1337 Jun 25, 2024
2e1ee2f
Fix wrong arguments
DarkLight1337 Jun 25, 2024
7229b07
Move `DummyImageDataFactories` into CLIP model file
DarkLight1337 Jun 25, 2024
17800fd
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 25, 2024
50f994b
Move `input_processor_for_clip` into CLIP
DarkLight1337 Jun 25, 2024
838aa9b
Remove some magic numbers
DarkLight1337 Jun 25, 2024
e7a5564
Test multiscale inputs for LLaVA-NeXT
DarkLight1337 Jun 25, 2024
36e8001
Handle multiscale inputs (different number of patches per batch) in L…
DarkLight1337 Jun 25, 2024
39e6d42
Fix wrong feature size
DarkLight1337 Jun 26, 2024
0d7f18f
Apply formatter
DarkLight1337 Jun 26, 2024
8e5dc7c
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
d9a4150
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 26, 2024
6849236
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
6d02491
Revert max_tokens
DarkLight1337 Jun 26, 2024
76ddea4
Add more tests for input mapper
DarkLight1337 Jun 26, 2024
4b20e66
Sanity check: Also test multiscale inputs for LLaVA-1.5
DarkLight1337 Jun 26, 2024
784af1a
Do not auto-convert image dtype to model's dtype
DarkLight1337 Jun 26, 2024
8e5fb12
Update prompts
DarkLight1337 Jun 26, 2024
4b947ad
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 26, 2024
e7397ee
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
865be7a
Fix mapper tests w.r.t. dtype change
DarkLight1337 Jun 26, 2024
9e82a26
Clarify docs and add todo
DarkLight1337 Jun 26, 2024
46391de
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
a4733f9
Remove TODO since vision config will be removed soon
DarkLight1337 Jun 26, 2024
6b19e6c
Expand docs
DarkLight1337 Jun 26, 2024
be326f2
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
f451668
Add ref
DarkLight1337 Jun 26, 2024
5c0c8cf
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
3d7b795
Update docs
DarkLight1337 Jun 26, 2024
1abb8a7
Add docs
DarkLight1337 Jun 26, 2024
428d420
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
698830f
Fix name
DarkLight1337 Jun 26, 2024
ac9ea9a
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
334b1a9
Add `MultiModalInputs` to docs
DarkLight1337 Jun 26, 2024
36ab12d
Fix and add links
DarkLight1337 Jun 26, 2024
af01e97
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
c303421
Fix `is_multiscale` not provided anymore
DarkLight1337 Jun 26, 2024
0a0c0e3
Also test multiscale input for phi3v
DarkLight1337 Jun 26, 2024
60517a7
Revert max_tokens for phi3v as numerical error still persists
DarkLight1337 Jun 26, 2024
57df434
Improve error message
DarkLight1337 Jun 26, 2024
ffe0675
Log the full output for easier reference
DarkLight1337 Jun 26, 2024
4f7b210
[VLM] Remove support for pixel_values and image_features.
xwjiang2010 Jun 25, 2024
c7a2a66
Update xfail to be more efficient
DarkLight1337 Jun 26, 2024
598e0e3
Also xfail llava test
DarkLight1337 Jun 26, 2024
174ca90
address comments
xwjiang2010 Jun 26, 2024
5b3e9aa
remove image_input_type altogether.
xwjiang2010 Jun 26, 2024
b7acf3a
types
xwjiang2010 Jun 26, 2024
f22b219
format
xwjiang2010 Jun 26, 2024
f84d87a
Update comment
DarkLight1337 Jun 27, 2024
5dfb6fc
Update docs
DarkLight1337 Jun 27, 2024
bbeff03
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 27, 2024
bf3281c
modify llava_next
ywang96 Jun 27, 2024
56e2d3b
Update comment
DarkLight1337 Jun 27, 2024
d2f8c6d
Update docs
DarkLight1337 Jun 27, 2024
7c197d2
Use dynamic image feature size calculation
DarkLight1337 Jun 27, 2024
f5ffd3e
Fix phi3v not handling `image_sizes` correctly
DarkLight1337 Jun 27, 2024
66aad21
Apply formatter
DarkLight1337 Jun 27, 2024
d1c68c0
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 27, 2024
5f32d53
Add see also
DarkLight1337 Jun 27, 2024
15df4ef
Update examples prompt format
DarkLight1337 Jun 27, 2024
f2e4633
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 27, 2024
095e008
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 27, 2024
a6e3162
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 27, 2024
28922af
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 27, 2024
ce06541
Fix config
DarkLight1337 Jun 27, 2024
cdcc2d4
Fix config
DarkLight1337 Jun 27, 2024
4212abf
Update docs
DarkLight1337 Jun 27, 2024
07c08e3
Update docs
DarkLight1337 Jun 27, 2024
f3f5854
Fix `MultiModalInputs` not working in Python 3.8
DarkLight1337 Jun 27, 2024
bebf9e7
Fix `_ImageAssets` not working in Python 3.8
DarkLight1337 Jun 27, 2024
7e80ecc
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 28, 2024
487d742
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 28, 2024
36f72b6
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 28, 2024
43350b8
update example
ywang96 Jun 28, 2024
57791de
update doc
ywang96 Jun 28, 2024
b2b1e11
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 28, 2024
fbc5f70
Update docs
DarkLight1337 Jun 28, 2024
4292ccb
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jun 28, 2024
5d23a96
Apply formatter
DarkLight1337 Jun 28, 2024
78064e0
Fix OpenAI server not working for phi3v
DarkLight1337 Jun 28, 2024
4cb809c
Preemptively handle upcoming models
DarkLight1337 Jun 28, 2024
754e238
Add more models
DarkLight1337 Jun 28, 2024
9edb53c
Update feature size for dummy data
DarkLight1337 Jun 28, 2024
91d6c1e
Merge branch 'main' of https://github.com/vllm-project/vllm into remo…
xwjiang2010 Jun 28, 2024
f84b793
format
xwjiang2010 Jun 28, 2024
a934663
ExternalMultiModalDataDict
xwjiang2010 Jun 28, 2024
2144d3a
mention schema
xwjiang2010 Jun 28, 2024
2795b16
Use a less strict check
DarkLight1337 Jun 29, 2024
86ffd60
Fix phi3v test
DarkLight1337 Jun 29, 2024
f339dd1
Update default length as the dummy image feature size is increased
DarkLight1337 Jun 29, 2024
59a7a4c
Raise full error if output is completely different
DarkLight1337 Jun 29, 2024
62952e1
Fix phi3v not using input processor
DarkLight1337 Jun 29, 2024
0ce3ecb
Move size factors outside
DarkLight1337 Jun 29, 2024
b43e8c3
Apply formatter
DarkLight1337 Jun 29, 2024
9023794
Fix some outputs not being checked
DarkLight1337 Jun 29, 2024
fc5549c
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jun 30, 2024
f6c8061
Also test no image
DarkLight1337 Jun 30, 2024
15cc847
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jun 30, 2024
235c8a9
Batch by size factors
DarkLight1337 Jun 30, 2024
b98d924
Factor out xfail code
DarkLight1337 Jun 30, 2024
2c2558b
Fix unused args
DarkLight1337 Jun 30, 2024
ec28eca
Check logprobs instead of xfailing
DarkLight1337 Jun 30, 2024
5a337f5
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jun 30, 2024
2eb3490
Fix different scales not being in the same batch
DarkLight1337 Jun 30, 2024
6301a52
Apply suggestions from code review
DarkLight1337 Jun 30, 2024
14f10fc
Add link
DarkLight1337 Jun 30, 2024
7c335c3
Use `self.multi_modal_projector` directly
DarkLight1337 Jun 30, 2024
33c860e
Allow users to send image token formatted prompt directly
DarkLight1337 Jun 30, 2024
e03bc57
Factor out the code for placeholder token IDs
DarkLight1337 Jun 30, 2024
b270ac3
Remove `-rx` flag
DarkLight1337 Jun 30, 2024
3161221
Fix distributed tests
DarkLight1337 Jun 30, 2024
85d108a
Fix string mismatch warning
DarkLight1337 Jun 30, 2024
d648e32
Relax phi3v test; add TODO for llava tests
DarkLight1337 Jun 30, 2024
fde5f26
Fix distributed tests
DarkLight1337 Jun 30, 2024
d432934
address comments
xwjiang2010 Jul 1, 2024
83cfada
Merge branch 'main' of https://github.com/vllm-project/vllm into remo…
xwjiang2010 Jul 1, 2024
ab347bc
format
xwjiang2010 Jul 1, 2024
404700f
rm ctx
xwjiang2010 Jul 1, 2024
6a4014e
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jul 1, 2024
95a1fc5
Fix distributed test
DarkLight1337 Jul 1, 2024
1e87823
Update docs about prompt formatting
DarkLight1337 Jul 1, 2024
55ab3e4
Remove unused parameter
DarkLight1337 Jul 1, 2024
21da5b8
Remove unused import
DarkLight1337 Jul 1, 2024
525fe8f
Fix distributed test
DarkLight1337 Jul 1, 2024
04ebb68
rm ImageData and MultiModalData
xwjiang2010 Jul 1, 2024
31b8b09
rm external
xwjiang2010 Jul 1, 2024
a4b5617
comments
xwjiang2010 Jul 1, 2024
045674d
fix dist gpu test.
xwjiang2010 Jul 1, 2024
c8fa150
address comments
xwjiang2010 Jul 2, 2024
58ab8e9
Further avoid cuda init
DarkLight1337 Jul 2, 2024
6975caa
Add warnings for repeated image tokens
DarkLight1337 Jul 2, 2024
b1f1813
docs
xwjiang2010 Jul 2, 2024
b8b636d
Update vllm/multimodal/base.py
xwjiang2010 Jul 2, 2024
2c1d291
format
xwjiang2010 Jul 2, 2024
b6401d3
Reword
DarkLight1337 Jul 2, 2024
0f6f64c
Merge branch 'remove_image_features_2' of https://github.com/xwjiang2…
DarkLight1337 Jul 2, 2024
89f1103
Remove useless test
DarkLight1337 Jul 2, 2024
47fbdba
Unify test API between HfRunner and VllmRunner
DarkLight1337 Jul 2, 2024
c1c5a4d
Fix import error
DarkLight1337 Jul 2, 2024
fde4b25
Fix attribute error
DarkLight1337 Jul 2, 2024
4278fed
fix import error
ywang96 Jul 2, 2024
d9a2908
update llava next example
ywang96 Jul 2, 2024
d61e8af
Merge branch 'remove_image_features_2' of https://github.com/xwjiang2…
DarkLight1337 Jul 2, 2024
abd56fc
Update comments
DarkLight1337 Jul 2, 2024
ce2516e
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jul 2, 2024
38042ab
Remove some unnecessary deferred imports
DarkLight1337 Jul 2, 2024
7a6d895
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jul 2, 2024
9a49d2c
Use more precise type annotation
DarkLight1337 Jul 2, 2024
ac6f4fa
Fix wrong feature size
DarkLight1337 Jul 2, 2024
3f95778
Fix wrong image
DarkLight1337 Jul 2, 2024
90e80c4
Remove unnecessary lazy import
DarkLight1337 Jul 2, 2024
ea622c7
Check for conflicting kwargs in `map_input`
DarkLight1337 Jul 2, 2024
18740c2
Avoid unnecessary processing
DarkLight1337 Jul 2, 2024
a0db2c7
Update doc
DarkLight1337 Jul 2, 2024
526a871
Avoid cuda init
DarkLight1337 Jul 2, 2024
a5174da
Remove unused logger
DarkLight1337 Jul 2, 2024
6cf34e4
Remove unnecessary deferred imports
DarkLight1337 Jul 2, 2024
feff395
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jul 2, 2024
aacb5d0
Fix typo
DarkLight1337 Jul 2, 2024
13f43bd
Address comments
DarkLight1337 Jul 2, 2024
00e9e39
Add comment
DarkLight1337 Jul 2, 2024
288bfb9
Merge branch 'main' into mm-image-tokenizer-2
ywang96 Jul 2, 2024
284fca8
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jul 3, 2024
a231eaf
Update XPU runner's multimodal logic
DarkLight1337 Jul 3, 2024
ec74121
Fix unused import
DarkLight1337 Jul 3, 2024
d16d3c8
Fix feature size calculation
DarkLight1337 Jul 3, 2024
aaa0f1f
Add extra image to test
DarkLight1337 Jul 3, 2024
cc540c3
Support multimodal data for neuron and tpu
DarkLight1337 Jul 3, 2024
48489ef
Fix broadcasting
DarkLight1337 Jul 3, 2024
2adc41f
Fix OpenVINO model runner for multimodal data
DarkLight1337 Jul 3, 2024
0e6845f
Cleanup
DarkLight1337 Jul 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/dev/input_processing/model_inputs_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Input Processing
vLLM provides a mechanism for defining input processors for each model so that the inputs are processed
in :class:`~vllm.LLMEngine` before they are passed to model executors.

Currently, this mechanism is only utilized in **multi-modal models** for preprocessing multi-modal input
Currently, this mechanism is only utilized in :ref:`multi-modal models <multi_modality>` for preprocessing multi-modal input
data in addition to input prompt, but it can be extended to text-only language models when needed.

Guides
Expand Down
124 changes: 124 additions & 0 deletions docs/source/dev/multimodal/adding_multimodal_model.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
.. _adding_a_new_multimodal_model:

Adding a New Multimodal Model
=============================

This document provides a high-level guide on integrating a :ref:`multi-modal model <multi_modality>` into vLLM.

.. note::
The complexity of adding a new model depends heavily on the model's architecture.
The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.

.. tip::
If you are encountering issues while integrating your model into vLLM, feel free to open an issue on our `GitHub <https://github.com/vllm-project/vllm/issues>`_ repository.
We will be happy to help you out!


1. Set up the base vLLM model
-----------------------------

As usual, follow :ref:`these steps <adding_a_new_model>` to implement the model in vLLM, but note the following:

- You should additionally implement the :class:`~vllm.model_executor.models.interfaces.SupportsVision` interface.

.. code-block:: diff

+ from vllm.model_executor.models.interfaces import SupportsVision

- class YourModelForImage2Seq(nn.Module):
+ class YourModelForImage2Seq(nn.Module, SupportsVision):

.. note::
The model class does not have to be named :code:`*ForCausalLM`.
Check out `the HuggingFace Transformers documentation <https://huggingface.co/docs/transformers/model_doc/auto#multimodal>`__ for some examples.

- While implementing the :meth:`~torch.nn.Module.forward` method, reserve a keyword parameter
for each input tensor that corresponds to a multi-modal input, as shown in the following example:

.. code-block:: diff

def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
kv_caches: List[torch.Tensor],
attn_metadata: AttentionMetadata,
+ pixel_values: torch.Tensor,
) -> SamplerOutput:


2. Register input mappers
-------------------------

For each modality type to support, decorate the model class with :meth:`MULTIMODAL_REGISTRY.register_input_mapper <vllm.multimodal.MultiModalRegistry.register_input_mapper>`.
This decorator accepts a function that maps multi-modal inputs to the keyword arguments you have previously defined in :meth:`~torch.nn.Module.forward`.

.. code-block:: diff

from vllm.model_executor.models.interfaces import SupportsVision
+ from vllm.multimodal import MULTIMODAL_REGISTRY

+ @MULTIMODAL_REGISTRY.register_image_feature_input_mapper()
+ @MULTIMODAL_REGISTRY.register_image_pixel_input_mapper()
class YourModelForImage2Seq(nn.Module, SupportsVision):

A default mapper is available for each modality in the core vLLM library. This input mapper will be used if you do not provide your own function.

.. seealso::
:ref:`input_processing_pipeline`


3. (Optional) Register dummy data
---------------------------------

During startup, dummy data is passed to the vLLM model to allocate memory. This only consists of text input by default, which may not be applicable to multi-modal models.
In such cases, you can define your own dummy data by registering a factory method via :meth:`INPUT_REGISTRY.register_dummy_data <vllm.inputs.registry.InputRegistry.register_dummy_data>`.

.. code-block:: diff

from vllm.inputs import INPUT_REGISTRY
from vllm.model_executor.models.interfaces import SupportsVision
from vllm.multimodal import MULTIMODAL_REGISTRY

@MULTIMODAL_REGISTRY.register_image_feature_input_mapper()
@MULTIMODAL_REGISTRY.register_image_pixel_input_mapper()
+ @INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
class YourModelForImage2Seq(nn.Module, SupportsVision):

Here are some examples:

- Image inputs (static feature size): `LLaVA-1.5 Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava.py>`__
- Image inputs (dynamic feature size): `LLaVA-NeXT Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava_next.py>`__

.. seealso::
:ref:`input_processing_pipeline`


4. (Optional) Register input processor
--------------------------------------

Sometimes, there is a need to process inputs at the :class:`~vllm.LLMEngine` level before they are passed to the model executor.
This is often due to the fact that unlike implementations in HuggingFace Transformers, the reshaping and/or expansion of multi-modal embeddings needs to take place outside model's :meth:`~torch.nn.Module.forward` call.
You can register input processors via :meth:`INPUT_REGISTRY.register_input_processor <vllm.inputs.registry.InputRegistry.register_input_processor>`.

.. code-block:: diff

from vllm.inputs import INPUT_REGISTRY
from vllm.model_executor.models.interfaces import SupportsVision
from vllm.multimodal import MULTIMODAL_REGISTRY

@MULTIMODAL_REGISTRY.register_image_feature_input_mapper()
@MULTIMODAL_REGISTRY.register_image_pixel_input_mapper()
@INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
+ @INPUT_REGISTRY.register_input_processor(<your_input_processor>)
class YourModelForImage2Seq(nn.Module, SupportsVision):

A common use case of input processors is inserting placeholder tokens to leverage the vLLM framework for attention mask generation.
Here are some examples:

- Insert static number of image tokens: `LLaVA-1.5 Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava.py>`__
- Insert dynamic number of image tokens: `LLaVA-NeXT Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava_next.py>`__

.. seealso::
:ref:`input_processing_pipeline`
18 changes: 15 additions & 3 deletions docs/source/dev/multimodal/multimodal_index.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _multi_modality:

Multi-Modality
==============

Expand All @@ -8,12 +10,18 @@ vLLM provides experimental support for multi-modal models through the :mod:`vllm
:class:`vllm.inputs.PromptStrictInputs` accepts an additional attribute ``multi_modal_data``
which allows you to pass in multi-modal input alongside text and token prompts.

By default, vLLM models do not support multi-modal inputs. To enable multi-modal support for a model,
you must decorate the model class with :meth:`InputRegistry.register_dummy_data <vllm.inputs.registry.InputRegistry.register_dummy_data>`,
as well as :meth:`MULTIMODAL_REGISTRY.register_input_mapper <MultiModalRegistry.register_input_mapper>` for each modality type to support.
By default, vLLM models do not support multi-modal inputs. To enable multi-modal support for a model, please follow :ref:`the guide for adding a new multimodal model. <adding_a_new_multimodal_model>`.

# TODO: Add more instructions on how to do that once embeddings is in.

Guides
++++++

.. toctree::
:maxdepth: 1

adding_multimodal_model

Module Contents
+++++++++++++++

Expand All @@ -35,6 +43,10 @@ Base Classes
:members:
:show-inheritance:

.. autoclass:: vllm.multimodal.MultiModalInputs
:members:
:show-inheritance:

.. autoclass:: vllm.multimodal.MultiModalPlugin
:members:
:show-inheritance:
Expand Down
24 changes: 16 additions & 8 deletions docs/source/models/vlm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@ The following :ref:`engine arguments <engine_args>` are specific to VLMs:
Currently, the support for vision language models on vLLM has the following limitations:

* Only single image input is supported per text prompt.
* Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means our LLaVA-NeXT output may not exactly match the huggingface implementation.

We are continuously improving user & developer experience for VLMs. Please `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.

Expand All @@ -42,12 +41,17 @@ To initialize a VLM, the aforementioned arguments must be passed to the ``LLM``
)

.. important::
Currently, you have to specify ``image_feature_size`` to support memory profiling.
To avoid OOM during runtime, you should set this to the maximum value supported by the model.
The calculation of feature size is specific to the model. For more details, please refer to
the function :code:`get_<model_name>_image_feature_size` inside the corresponding model file.

We will remove most of the vision-specific arguments in a future release as they can be inferred from the HuggingFace configuration.


To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`:

* ``prompt``: The prompt should have a number of ``<image>`` tokens equal to ``image_feature_size``.
* ``prompt``: The prompt should follow the format that is documented on HuggingFace.
* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`.

.. note::
Expand All @@ -57,8 +61,8 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptS

.. code-block:: python

prompt = "<image>" * 576 + (
"\nUSER: What is the content of this image?\nASSISTANT:")
# Refer to the HuggingFace repo for the correct format to use
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"

# Load the image using PIL.Image
image = ...
Expand All @@ -74,8 +78,6 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptS

A code example can be found in `examples/llava_example.py <https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py>`_.

.. important::
We will remove the need to format image tokens in a future release. Afterwards, the input text will follow the same format as that for the original HuggingFace model.

Online OpenAI Vision API Compatible Inference
----------------------------------------------
Expand Down Expand Up @@ -103,6 +105,11 @@ Below is an example on how to launch the same ``llava-hf/llava-1.5-7b-hf`` with
--chat-template template_llava.jinja

.. important::
Currently, you have to specify ``image_feature_size`` to support memory profiling.
To avoid OOM during runtime, you should set this to the maximum value supported by the model.
The calculation of feature size is specific to the model. For more details, please refer to
the function :code:`get_<model_name>_image_feature_size` inside the corresponding model file.

We will remove most of the vision-specific arguments in a future release as they can be inferred from the HuggingFace configuration.

To consume the server, you can use the OpenAI client like in the example below:
Expand All @@ -121,6 +128,8 @@ To consume the server, you can use the OpenAI client like in the example below:
messages=[{
"role": "user",
"content": [
# NOTE: The prompt formatting with the image token `<image>` is not needed
# since the prompt will be processed automatically by the API server.
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
Expand All @@ -144,5 +153,4 @@ A full code example can be found in `examples/openai_vision_api_client.py <https
export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>

.. note::
The prompt formatting with the image token ``<image>`` is not needed when serving VLMs with the API server since the prompt will be
processed automatically by the server.
There is no need to format the prompt in the API request since it will be handled by the server.
3 changes: 1 addition & 2 deletions examples/llava_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,7 @@ def run_llava():
image_feature_size=576,
)

prompt = "<image>" * 576 + (
"\nUSER: What is the content of this image?\nASSISTANT:")
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"

image = Image.open("images/stop_sign.jpg")

Expand Down
11 changes: 3 additions & 8 deletions examples/llava_next_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,17 @@

from vllm import LLM, SamplingParams

# Dynamic image input is currently not supported and therefore
# a fixed image input shape and its corresponding feature size is required.
# See https://github.com/vllm-project/vllm/pull/4199 for the complete
# configuration matrix.


def run_llava_next():
llm = LLM(
model="llava-hf/llava-v1.6-mistral-7b-hf",
image_token_id=32000,
image_input_shape="1,3,336,336",
image_feature_size=1176,
# Use the maximum possible value for memory profiling
image_feature_size=2928,
)

prompt = "[INST] " + "<image>" * 1176 + (
"\nWhat is shown in this image? [/INST]")
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"
image = Image.open(BytesIO(requests.get(url).content))
sampling_params = SamplingParams(temperature=0.8,
Expand Down
8 changes: 5 additions & 3 deletions examples/phi3v_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@

from vllm import LLM, SamplingParams

# The assets are located at `s3://air-example-data-2/vllm_opensource_llava/`.
# You can use `.buildkite/download-images.sh` to download them


def run_phi3v():
model_path = "microsoft/Phi-3-vision-128k-instruct"
Expand All @@ -18,16 +21,15 @@ def run_phi3v():
trust_remote_code=True,
image_token_id=32044,
image_input_shape="1,3,1008,1344",
image_feature_size=1921,
# Use the maximum possible value for memory profiling
image_feature_size=2653,
max_num_seqs=5,
)

image = Image.open("images/cherry_blossom.jpg")

# single-image prompt
prompt = "<|user|>\n<|image_1|>\nWhat is the season?<|end|>\n<|assistant|>\n" # noqa: E501
prompt = prompt.replace("<|image_1|>", "<|image|>" * 1921 + "<s>")

sampling_params = SamplingParams(temperature=0, max_tokens=64)

outputs = llm.generate(
Expand Down
Loading
Loading