[Core] Dynamic image size support for VLMs #5276

DarkLight1337 · 2024-06-05T11:24:07Z

This PR uses the input registry introduced by #5214 to implement an input process that inserts image tokens automatically at the LLMEngine level, so that it applies to LLM.generate.

Accordingly, I have updated LLaVA-NeXT and Phi-3-Vision to support dynamic image size. Along the way, I have expanded the VLM tests to consider text-only and multiscale-image input in addition to the current single-scale image input.

Based on this, I have written a detailed guide on how to implement multimodal vLLM models.

Please note that this introduces a breaking change to users. Instead of manually repeating image tokens, the same prompt format as described in the corresponding HuggingFace repo should be used regardless of the model.

Related contributions

Follow-up to #5214.

This PR conflicts with #5237 as it inserts image tokens at the OpenAIServing level. This PR has removed such logic from the server to avoid double insertion.

…y default

ywang96

Overall LGTM - I'll just need to run some testing on my end before finally approving this!

examples/llava_next_example.py

vllm/multimodal/base.py

vllm/multimodal/image.py

ywang96

LGTM! Thank you for the work and glad we resolved all the issues!

Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: ywang96 <ywang@roblox.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

DarkLight1337 added 16 commits June 3, 2024 06:34

Introduce a higher level INPUT_REGISTRY

34bfa79

Move dummy data generation to input registry

df2aa19

Update docs

c72d2b3

Rename process_input to map_input

d8c6488

Reorder arguments

f18de48

Apply input processor

653537d

Remove VisionLanguageConfig from input mapper

a2f5a3c

Fix bad use of functools.partial

378ad80

Use default input processor

7aa3778

Merge branch 'upstream' into mm-image-tokenizer

c774168

Fix wrong arguments

532f863

Use pillow image instead of tensor to avoid bypassing the processor b…

080d40c

…y default

Update interface of dummy data factory and input processor

662693a

Use InputContext to handle checked type cast of config types

9bc5fcc

Add input processor for injecting image tokens; fix docs

911cac7

Add new documentation pages

a38b347

DarkLight1337 marked this pull request as draft June 5, 2024 11:24

DarkLight1337 changed the title ~~[Core][Docs] Use input processor to insert image tokens~~ [Core][Doc] Use input processor to insert image tokens Jun 5, 2024

DarkLight1337 mentioned this pull request Jun 5, 2024

[RFC]: Multi-modality Support Refactoring #4194

Open

55 tasks

DarkLight1337 force-pushed the mm-image-tokenizer-2 branch from bdac3c9 to 25b5bb1 Compare June 5, 2024 13:48

DarkLight1337 added 2 commits June 5, 2024 14:25

Fix LLaVA-NeXT input processor and cleanup code

29c3bb3

Fix LLaVA-NeXT input processor and cleanup code

9cfbcce

DarkLight1337 force-pushed the mm-image-tokenizer-2 branch from 25b5bb1 to 9cfbcce Compare June 5, 2024 14:25

This was referenced Jun 5, 2024

[Model] Dynamic image size support for LLaVA-NeXT #5279

Closed

[Core] Support image processor #4197

Merged

DarkLight1337 added 5 commits June 6, 2024 03:17

Add sanity check

7bb6cbf

Merge branch 'upstream' into mm-image-tokenizer

ccf49c4

Merge branch 'upstream' into mm-image-tokenizer

3482d32

Merge branch 'upstream' into mm-image-tokenizer

8ea8468

Merge branch 'upstream' into mm-image-tokenizer

be3d64f

DarkLight1337 added 5 commits July 2, 2024 09:25

Avoid cuda init

526a871

Remove unused logger

a5174da

Remove unnecessary deferred imports

6cf34e4

Merge branch 'upstream' into mm-image-tokenizer-2

feff395

Fix typo

aacb5d0

DarkLight1337 mentioned this pull request Jul 2, 2024

[ci][misc] fix more device count #6055

Closed

ywang96 reviewed Jul 2, 2024

View reviewed changes

examples/llava_next_example.py Outdated Show resolved Hide resolved

vllm/multimodal/base.py Outdated Show resolved Hide resolved

vllm/multimodal/image.py Show resolved Hide resolved

DarkLight1337 and others added 9 commits July 2, 2024 17:36

Address comments

13f43bd

Add comment

00e9e39

Merge branch 'main' into mm-image-tokenizer-2

288bfb9

Merge branch 'upstream' into mm-image-tokenizer-2

284fca8

Update XPU runner's multimodal logic

a231eaf

Fix unused import

ec74121

Fix feature size calculation

d16d3c8

Add extra image to test

aaa0f1f

Support multimodal data for neuron and tpu

cc540c3

DarkLight1337 force-pushed the mm-image-tokenizer-2 branch from 93ad7de to cc540c3 Compare July 3, 2024 01:49

DarkLight1337 added 3 commits July 3, 2024 01:49

Fix broadcasting

48489ef

Fix OpenVINO model runner for multimodal data

2adc41f

Cleanup

0e6845f

ywang96 approved these changes Jul 3, 2024

View reviewed changes

youkaichao merged commit 9831aec into vllm-project:main Jul 3, 2024
68 of 70 checks passed

DarkLight1337 deleted the mm-image-tokenizer-2 branch July 3, 2024 04:13

This was referenced Jul 3, 2024

[Model] Adding support for MiniCPM-V #4087

Merged

[Model] Add GLM-4v support #5358

Closed

M0gician mentioned this pull request Jul 7, 2024

Make sglang compat with vllm 0.5.1 sgl-project/sglang#598

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Dynamic image size support for VLMs #5276

[Core] Dynamic image size support for VLMs #5276

DarkLight1337 commented Jun 5, 2024 •

edited

Loading

ywang96 left a comment

ywang96 left a comment

[Core] Dynamic image size support for VLMs #5276

[Core] Dynamic image size support for VLMs #5276

Conversation

DarkLight1337 commented Jun 5, 2024 • edited Loading

Related contributions

ywang96 left a comment

Choose a reason for hiding this comment

ywang96 left a comment

Choose a reason for hiding this comment

DarkLight1337 commented Jun 5, 2024 •

edited

Loading