[Model] Initialize support for InternVL2 series models #6514

Isotr0py · 2024-07-17T16:26:43Z

FILL IN THE PR DESCRIPTION HERE

FIX #4393
FIX #6321 (link existing issues this PR will resolve)

This PR aims to add support for InternVL2 series models:

Port InternViT model.
Add InternVL2 models implementation (TODO: Check 26B/40B models).
Add and pass the InternVL2 models test.

NOTE: This model was added after the release of 0.5.3.post1, so it'll only be included in the next release (e.g. 0.5.4). If you want to use it now, please install vLLM from source (i.e. main branch).

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

github-actions · 2024-07-17T16:26:54Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only trigger fastcheck CI to run, which consists only a small and essential subset of tests to quickly catch errors with the flexibility to run extra individual tests on top (you can do this by unblocking test steps in the Buildkite run).

Full CI run is still required to merge this PR so once the PR is ready to go, please make sure to run it. If you need all test signals in between PR commits, you can trigger full CI as well.

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

examples/internvl_example.py

lrain-CN · 2024-07-19T08:40:37Z

试了下4b的模型，输出会出现多个，结束不了，一直循环输出至max_tokens

Isotr0py · 2024-07-20T04:28:49Z

@lrain-CN 可以试试在 SamplingParams 添加 stop="<|end|>"，这似乎是 4B 模型特有的 bug，因为 4B 的语言模型用的是 Phi-3，但 "<|end|>" 并没有添加到 4B 模型的 Special token 中。

根据我的测试，似乎只有短的 prompt 会导致 Phi-3 Special token 出现，使用长一点的 prompt 应该能输出跟 HuggingFace 一样的结果。

wciq1208 · 2024-07-24T06:08:08Z

问一下大佬这个PR支持AWQ量化吗

Isotr0py · 2024-07-25T05:47:48Z

@wciq1208 目前 VLMs 还未支持量化模型，量化 VLM 的支持已在 P2 优先级的开发计划中。

tonyaw · 2024-09-04T06:57:36Z

I used vllm v0.5.5 to start OpenGVLab/InternVL2-Llama3-76B-AWQ on A100*2, it fails as following:

Loading pt checkpoint shards: 100% Completed | 28/28 [00:39<00:00,  1.41s/it]

[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 476, in <module>
[rank0]:     asyncio.run(run_server(args))
[rank0]:   File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
[rank0]:     return loop.run_until_complete(main)
[rank0]:   File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[rank0]:     return future.result()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 443, in run_server
[rank0]:     async with build_async_engine_client(args) as async_engine_client:
[rank0]:   File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
[rank0]:     return await anext(self.gen)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 120, in build_async_engine_client
[rank0]:     async_engine_client = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 740, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 840, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 272, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 270, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 46, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 138, in _init_executor
[rank0]:     self._run_workers("load_model",
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 182, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 881, in load_model
[rank0]:     self.model = get_model(model_config=self.model_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 344, in load_model
[rank0]:     model.load_weights(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/internvl.py", line 508, in load_weights
[rank0]:     self.language_model.load_weights(llm_weights)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 506, in load_weights
[rank0]:     param = params_dict[name]
[rank0]: KeyError: 'model.layers.4.mlp.gate_up_proj.qweight'
INFO 09-03 23:54:32 multiproc_worker_utils.py:123] Killing local vLLM worker processes
[rank0]:[W903 23:54:36.058888282 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

command to start vllm:

python3 -m vllm.entrypoints.openai.api_server --model OpenGVLab/InternVL2-Llama3-76B-AWQ  --host 0.0.0.0 --port 8081  --seed 42 --trust-remote-code --disable-frontend-multiprocessing --enable-chunked-prefill  --tensor-parallel-size 2 2>&1 | tee -a /vllm-workspace/internvl2-llama3-76b-awq.log

AmazDeng · 2024-09-04T09:47:49Z

Multi-image input isn't supported yet but is on our roadmap. Please refer to #4194 for the latest progress.

@DarkLight1337 @Isotr0py
May I ask if the InternVL2 model in VLLM now supports input of multiple images?

Isotr0py · 2024-09-04T10:41:23Z

Not yet. I plan to work on this feature later this week if no people are working on this.

AmazDeng · 2024-09-05T06:18:08Z

Not yet. I plan to work on this feature later this week if no people are working on this.

@Isotr0py
May I ask if there is a timeline? Approximately when can it be used?

tonyaw · 2024-09-05T13:52:52Z

I used vllm v0.5.5 to start OpenGVLab/InternVL2-Llama3-76B-AWQ on A100*2, it fails as following:

Loading pt checkpoint shards: 100% Completed | 28/28 [00:39<00:00,  1.41s/it]

[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 476, in <module>
[rank0]:     asyncio.run(run_server(args))
[rank0]:   File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
[rank0]:     return loop.run_until_complete(main)
[rank0]:   File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[rank0]:     return future.result()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 443, in run_server
[rank0]:     async with build_async_engine_client(args) as async_engine_client:
[rank0]:   File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
[rank0]:     return await anext(self.gen)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 120, in build_async_engine_client
[rank0]:     async_engine_client = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 740, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 840, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 272, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 270, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 46, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 138, in _init_executor
[rank0]:     self._run_workers("load_model",
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 182, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 881, in load_model
[rank0]:     self.model = get_model(model_config=self.model_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 344, in load_model
[rank0]:     model.load_weights(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/internvl.py", line 508, in load_weights
[rank0]:     self.language_model.load_weights(llm_weights)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 506, in load_weights
[rank0]:     param = params_dict[name]
[rank0]: KeyError: 'model.layers.4.mlp.gate_up_proj.qweight'
INFO 09-03 23:54:32 multiproc_worker_utils.py:123] Killing local vLLM worker processes
[rank0]:[W903 23:54:36.058888282 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

command to start vllm:

python3 -m vllm.entrypoints.openai.api_server --model OpenGVLab/InternVL2-Llama3-76B-AWQ  --host 0.0.0.0 --port 8081  --seed 42 --trust-remote-code --disable-frontend-multiprocessing --enable-chunked-prefill  --tensor-parallel-size 2 2>&1 | tee -a /vllm-workspace/internvl2-llama3-76b-awq.log

@DarkLight1337 , any comment or suggestion about this?
I can use 0.5.5 to bring up "OpenGVLab/InternVL2-Llama3-76B" successfully, but it uses too much memory.

DarkLight1337 · 2024-09-05T13:57:14Z

OpenGVLab/InternVL2-Llama3-76B-AWQ

The weight loading fails on LM backbone, so it seems that AWQ loading isn't supported for Llama3. cc @Isotr0py

tonyaw · 2024-09-05T14:09:12Z

@DarkLight1337, Thanks for checking! Any workaround here? :-)

Isotr0py · 2024-09-05T14:20:42Z

@tonyaw Can you try adding --quantization=awq? I remember that the quantization can't be inferred automatically for InternVL because their quant_config has some issues.

tonyaw · 2024-09-05T14:29:56Z

@Isotr0py , Thanks! New error:

[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 840, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 272, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 270, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 46, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 138, in _init_executor
[rank0]:     self._run_workers("load_model",
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 182, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 881, in load_model
[rank0]:     self.model = get_model(model_config=self.model_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 341, in load_model
[rank0]:     model = _initialize_model(model_config, self.load_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 174, in _initialize_model
[rank0]:     quant_config=_get_quantization_config(model_config, load_config),
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 108, in _get_quantization_config
[rank0]:     raise ValueError(
[rank0]: ValueError: torch.bfloat16 is not supported for quantization method awq. Supported dtypes: [torch.float16]

command in use:

python3 -m vllm.entrypoints.openai.api_server --model OpenGVLab/InternVL2-Llama3-76B-AWQ --quantization=awq \
--host 0.0.0.0 --port 8080 --trust-remote-code --disable-frontend-multiprocessing \
--tensor-parallel-size 2 --max-model-len 4096  --gpu-memory-utilization=0.95 2>&1 | tee -a /vllm-workspace/internvl2-llama3-76b-awq.log

Isotr0py · 2024-09-05T15:11:07Z

@tonyaw You can add --dtype=half

tonyaw · 2024-09-05T15:28:09Z

@Isotr0py , Thanks!
It works with "--dtype float16" which is same as "--dtype=half".
Will it impact model generation quality as the model is in bf16 format?

Isotr0py · 2024-09-05T16:10:46Z

I think this won't affect model generation quality very much because fp16 has a higher precision compared to bf16.

DarkLight1337 · 2024-09-08T04:35:18Z

Try increasing max_model_len while decreasing max_num_seqs.

DarkLight1337 · 2024-09-09T08:45:33Z

Can you show which image you used and also the text prompt?

DarkLight1337 · 2024-09-09T10:35:44Z

Can you show which image you used and also the text prompt?
image = Image.open("COCO_test2014_000000001227.jpg")
prompt = "介绍一下这幅图片"

It appears that you didn't apply any template to the prompt. Make sure it is formatted as shown in these examples. Notice that there should be <image> tokens in the prompt.

youw3 · 2024-09-09T11:53:38Z

Can you show which image you used and also the text prompt?
image = Image.open("COCO_test2014_000000001227.jpg")
prompt = "介绍一下这幅图片"
It appears that you didn't apply any template to the prompt. Make sure it is formatted as shown in these examples. Notice that there should be <image> tokens in the prompt.

You're right, '<image>\n介绍一下这幅图片' is OK, thank U!

PancakeAwesome · 2024-09-10T04:43:37Z

Not yet. I plan to work on this feature later this week if no people are working on this.

looking forward to updating this feature~

hkunzhe · 2024-09-26T12:15:34Z

Not yet. I plan to work on this feature later this week if no people are working on this.

@Isotr0py looking forward to this update!

Isotr0py · 2024-09-26T13:02:41Z

@PancakeAwesome @hkunzhe InternVL2 has supported multi-images inputs, see here: examples/offline_inference_vision_language_multi_image.py

hkunzhe · 2024-09-27T08:53:42Z

@Isotr0py Hi, I've successfully played with InternVL2-40B-AWQ + vllm by using the example code. However, I found that when the length of multi-images increased to 8 (typically in the video chat setting), the input tokens is too long for max_model_len.

ValueError: The prompt (total length 14424) is too long to fit into the model (context length 8096). Make sure that `max_model_len` is no smaller than the number of text tokens plus multimodal tokens. For image inputs, the number of image tokens depends on the number of images, and possibly their aspect ratios as well.

lmdeploy can set max_dynamic_patch=1 to reduce the input tokens either by editing config.json or use a dynamic config in the request payload (InternLM/lmdeploy#2263 (comment)). In vllm, I encountered the following error when set max_dynamic_patch=1 by editing config.json:

ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20240927-082809.pkl): Attempted to assign 8 x 256 = 2048 multimodal tokens to 4096 placeholders

A minimal code to reproduce:

Click to expand

"""
This example shows how to use vLLM for running offline inference with
multi-image input on vision language models, using the chat template defined
by the model.
"""
from argparse import Namespace
from typing import List, NamedTuple, Optional

# from PIL.Image import Image
from PIL import Image
from transformers import AutoProcessor, AutoTokenizer

from vllm import LLM, SamplingParams
from vllm.multimodal.utils import fetch_image
from vllm.utils import FlexibleArgumentParser


QUESTION = "What is the content of each image?"
IMAGE_URLS = [
    "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg",
  "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg",
] * 4


class ModelRequestData(NamedTuple):
    llm: LLM
    prompt: str
    stop_token_ids: Optional[List[str]]
    image_data: List[Image.Image]
    chat_template: Optional[str]


def load_internvl_video(question: str, image_urls: List[str]) -> ModelRequestData:
    model_name = "OpenGVLab/InternVL2-40B-AWQ"

    llm = LLM(
        model=model_name,
        trust_remote_code=True,
        max_num_seqs=5,
        max_model_len=8096,
        max_num_batched_tokens=8096,
        limit_mm_per_prompt={"image": len(image_urls)},
        gpu_memory_utilization=0.8,
        tensor_parallel_size=2,
        quantization="awq",
        dtype="float16"
    )

    placeholders = "\n".join(f"Image-{i}: <image>\n"
                             for i, _ in enumerate(image_urls, start=1))
    messages = [{'role': 'user', 'content': f"{placeholders}\n{question}"}]

    tokenizer = AutoTokenizer.from_pretrained(model_name,
                                              trust_remote_code=True)
    prompt = tokenizer.apply_chat_template(messages,
                                           tokenize=False,
                                           add_generation_prompt=True)

    # Stop tokens for InternVL
    # models variants may have different stop tokens
    # please refer to the model card for the correct "stop words":
    # https://huggingface.co/OpenGVLab/InternVL2-2B#service
    stop_tokens = ["<|endoftext|>", "<|im_start|>", "<|im_end|>", "<|end|>"]
    stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]

    return ModelRequestData(
        llm=llm,
        prompt=prompt,
        stop_token_ids=stop_token_ids,
        image_data=[fetch_image(url) for url in image_urls],
        chat_template=None,
    )


model_example_map = {
    "internvl_chat_video": load_internvl_video
}


def run_generate(model, question: str, image_urls: List[str]):
    req_data = model_example_map[model](question, image_urls)

    sampling_params = SamplingParams(temperature=0.0,
                                     max_tokens=128,
                                     stop_token_ids=req_data.stop_token_ids)

    outputs = req_data.llm.generate(
        {
            "prompt": req_data.prompt,
            "multi_modal_data": {
                "image": req_data.image_data
            },
        },
        sampling_params=sampling_params)

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)


def run_chat(model: str, question: str, image_urls: List[str]):
    req_data = model_example_map[model](question, image_urls)

    sampling_params = SamplingParams(temperature=0.0,
                                     max_tokens=128,
                                     stop_token_ids=req_data.stop_token_ids)
    outputs = req_data.llm.chat(
        [{
            "role":
            "user",
            "content": [
                {
                    "type": "text",
                    "text": question,
                },
                *({
                    "type": "image_url",
                    "image_url": {
                        "url": image_url
                    },
                } for image_url in image_urls),
            ],
        }],
        sampling_params=sampling_params,
        chat_template=req_data.chat_template,
    )

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)


def main(args: Namespace):
    model = args.model_type
    method = args.method

    if method == "generate":
        run_generate(model, QUESTION, IMAGE_URLS)
    elif method == "chat":
        run_chat(model, QUESTION, IMAGE_URLS)
    else:
        raise ValueError(f"Invalid method: {method}")


if __name__ == "__main__":
    parser = FlexibleArgumentParser(
        description='Demo on using vLLM for offline inference with '
        'vision language models that support multi-image input')
    parser.add_argument('--model-type',
                        '-m',
                        type=str,
                        default="internvl_chat_video",
                        choices=model_example_map.keys(),
                        help='Huggingface "model_type".')
    parser.add_argument("--method",
                        type=str,
                        default="generate",
                        choices=["generate", "chat"],
                        help="The method to run in `vllm.LLM`.")

    args = parser.parse_args()
    main(args)

Isotr0py · 2024-09-27T16:25:28Z

@hkunzhe Thanks for reporting this! We can expose the max_dynamic_patch to mm_processor_kwargs just like #8658. I will take a look this weekend.

The bug seems that use_thumbnail wasn't disabled while max_dynamic_patch=1 for some reasons, I will take a look as well.

hkunzhe · 2024-09-29T03:10:10Z

@Isotr0py Expose the max_dynamic_patch to mm_processor_kwargs just like:

model_name = "OpenGVLab/InternVL2-40B-AWQ"

llm = LLM(
    model=model_name,
    trust_remote_code=True,
    max_num_seqs=5,
    max_model_len=8096,
    max_num_batched_tokens=8096,
    limit_mm_per_prompt={"image": len(image_urls)},
    gpu_memory_utilization=0.8,
    tensor_parallel_size=2,
    quantization="awq",
    dtype="float16",
    mm_processor_kwargs={"max_dynamic_patch": 1}
)

does not seem work

 The following intended overrides are not keyword-only args and and will be dropped: {'max_dynamic_patch'}

Set max_dynamic_patch=2 will not trigger the previous bug (#6514 (comment)) and can be used as a temporary workaround.

Isotr0py · 2024-09-29T14:18:16Z

@hkunzhe I have created #8946 to fix these. Can you take a look if it works on 40B-AWQ as well? Thanks!

hkunzhe · 2024-09-30T10:08:34Z

@Isotr0py It's OK now. Thanks for your quick fix!

…6514) Co-authored-by: Roger Wang <ywang@roblox.com> Signed-off-by: Alvant <alvasian@yandex.ru>

…6514) Co-authored-by: Roger Wang <ywang@roblox.com>

zhenfenxiao · 2024-11-20T08:29:34Z

Thanks for this great pr! I have deploy the model as openai server but have no idea on how to call it. Could you please provide some examples or other references ? Single-page and multi-page examples are both expected.

Isotr0py added 4 commits July 10, 2024 16:55

init internvl support

4102e8c

add internvl2 support for 1B and 2B

2feadc8

add internvl2 example

958a926

add internvl2 test

7142c60

Isotr0py added 7 commits July 18, 2024 00:30

fix internvl2 test

26c2651

Merge branch 'vllm-project:main' into internvl

6e8135c

format code

9d8cb20

add docs

45b0cd7

remove unused code in test

1dafa5e

fix dummy data for internvl

1fbe958

format code

91572b9

DarkLight1337 mentioned this pull request Jul 18, 2024

[RFC]: Multi-modality Support on vLLM #4194

Open

51 tasks

DarkLight1337 reviewed Jul 18, 2024

View reviewed changes

examples/internvl_example.py Outdated Show resolved Hide resolved

update internvl example

56e171f

fix internvl-2B test

5971c11

Isotr0py added 3 commits July 25, 2024 11:03

Merge branch 'vllm-project:main' into internvl

cf1784c

fix internvl test

ea61600

format internvl2 test

49bdf60

Isotr0py added 7 commits July 25, 2024 14:02

add timm to test requirements

f2d6bdd

fix internvl test

8aa0ac7

Merge branch 'main' into internvl

95d8b4f

port and format internvl config

9bee8a8

format code

4d9946c

isort

4dae318

format stacked_params_mapping

eea984f

DarkLight1337 mentioned this pull request Sep 4, 2024

[Usage]: How to use vllm infer video with Internvl2 8b multimodal model #8151

Closed

1 task

PancakeAwesome mentioned this pull request Sep 11, 2024

[Model][VLM] Add Qwen2-VL model support #7905

Merged

Isotr0py mentioned this pull request Sep 29, 2024

[Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg #8946

Merged

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Model] Initialize support for InternVL2 series models (vllm-project#…

663123b

…6514) Co-authored-by: Roger Wang <ywang@roblox.com> Signed-off-by: Alvant <alvasian@yandex.ru>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[Model] Initialize support for InternVL2 series models (vllm-project#…

c6a18ef

…6514) Co-authored-by: Roger Wang <ywang@roblox.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Initialize support for InternVL2 series models #6514

[Model] Initialize support for InternVL2 series models #6514

Isotr0py commented Jul 17, 2024 •

edited by ywang96

Loading

github-actions bot commented Jul 17, 2024

lrain-CN commented Jul 19, 2024

Isotr0py commented Jul 20, 2024 •

edited

Loading

wciq1208 commented Jul 24, 2024

Isotr0py commented Jul 25, 2024

tonyaw commented Sep 4, 2024

AmazDeng commented Sep 4, 2024 •

edited

Loading

Isotr0py commented Sep 4, 2024

AmazDeng commented Sep 5, 2024 •

edited

Loading

tonyaw commented Sep 5, 2024 •

edited

Loading

DarkLight1337 commented Sep 5, 2024 •

edited

Loading

tonyaw commented Sep 5, 2024

Isotr0py commented Sep 5, 2024

tonyaw commented Sep 5, 2024

Isotr0py commented Sep 5, 2024

tonyaw commented Sep 5, 2024

Isotr0py commented Sep 5, 2024

DarkLight1337 commented Sep 8, 2024

DarkLight1337 commented Sep 9, 2024

DarkLight1337 commented Sep 9, 2024 •

edited

Loading

youw3 commented Sep 9, 2024 •

edited

Loading

PancakeAwesome commented Sep 10, 2024

hkunzhe commented Sep 26, 2024

Isotr0py commented Sep 26, 2024

hkunzhe commented Sep 27, 2024

Isotr0py commented Sep 27, 2024

hkunzhe commented Sep 29, 2024 •

edited

Loading

Isotr0py commented Sep 29, 2024

hkunzhe commented Sep 30, 2024

zhenfenxiao commented Nov 20, 2024

[Model] Initialize support for InternVL2 series models #6514

[Model] Initialize support for InternVL2 series models #6514

Conversation

Isotr0py commented Jul 17, 2024 • edited by ywang96 Loading

PR Title and Classification

Code Quality

Notes for Large Changes

What to Expect for the Reviews

Thank You

github-actions bot commented Jul 17, 2024

lrain-CN commented Jul 19, 2024

Isotr0py commented Jul 20, 2024 • edited Loading

wciq1208 commented Jul 24, 2024

Isotr0py commented Jul 25, 2024

tonyaw commented Sep 4, 2024

AmazDeng commented Sep 4, 2024 • edited Loading

Isotr0py commented Sep 4, 2024

AmazDeng commented Sep 5, 2024 • edited Loading

tonyaw commented Sep 5, 2024 • edited Loading

DarkLight1337 commented Sep 5, 2024 • edited Loading

tonyaw commented Sep 5, 2024

Isotr0py commented Sep 5, 2024

tonyaw commented Sep 5, 2024

Isotr0py commented Sep 5, 2024

tonyaw commented Sep 5, 2024

Isotr0py commented Sep 5, 2024

DarkLight1337 commented Sep 8, 2024

DarkLight1337 commented Sep 9, 2024

DarkLight1337 commented Sep 9, 2024 • edited Loading

youw3 commented Sep 9, 2024 • edited Loading

PancakeAwesome commented Sep 10, 2024

hkunzhe commented Sep 26, 2024

Isotr0py commented Sep 26, 2024

hkunzhe commented Sep 27, 2024

Isotr0py commented Sep 27, 2024

hkunzhe commented Sep 29, 2024 • edited Loading

Isotr0py commented Sep 29, 2024

hkunzhe commented Sep 30, 2024

zhenfenxiao commented Nov 20, 2024

Isotr0py commented Jul 17, 2024 •

edited by ywang96

Loading

Isotr0py commented Jul 20, 2024 •

edited

Loading

AmazDeng commented Sep 4, 2024 •

edited

Loading

AmazDeng commented Sep 5, 2024 •

edited

Loading

tonyaw commented Sep 5, 2024 •

edited

Loading

DarkLight1337 commented Sep 5, 2024 •

edited

Loading

DarkLight1337 commented Sep 9, 2024 •

edited

Loading

youw3 commented Sep 9, 2024 •

edited

Loading

hkunzhe commented Sep 29, 2024 •

edited

Loading