Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Initialize support for InternVL2 series models #6514

Merged
merged 42 commits into from
Jul 29, 2024

Conversation

Isotr0py
Copy link
Collaborator

@Isotr0py Isotr0py commented Jul 17, 2024

FILL IN THE PR DESCRIPTION HERE

FIX #4393
FIX #6321 (link existing issues this PR will resolve)

This PR aims to add support for InternVL2 series models:

  • Port InternViT model.
  • Add InternVL2 models implementation (TODO: Check 26B/40B models).
  • Add and pass the InternVL2 models test.

NOTE: This model was added after the release of 0.5.3.post1, so it'll only be included in the next release (e.g. 0.5.4). If you want to use it now, please install vLLM from source (i.e. main branch).

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE


PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Model] for adding a new model or improving an existing model. Model name should appear in the title.
  • [Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
  • [Kernel] for changes affecting CUDA kernels or other compute kernels.
  • [Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
  • [Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • We adhere to Google Python style guide and Google C++ style guide.
  • Pass all linter checks. Please use format.sh to format your code.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
  • Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

  • After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
  • After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
  • After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
  • Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only trigger fastcheck CI to run, which consists only a small and essential subset of tests to quickly catch errors with the flexibility to run extra individual tests on top (you can do this by unblocking test steps in the Buildkite run).

Full CI run is still required to merge this PR so once the PR is ready to go, please make sure to run it. If you need all test signals in between PR commits, you can trigger full CI as well.

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@lrain-CN
Copy link

图片

试了下4b的模型,输出会出现多个,结束不了,一直循环输出至max_tokens

@Isotr0py
Copy link
Collaborator Author

Isotr0py commented Jul 20, 2024

@lrain-CN 可以试试在 SamplingParams 添加 stop="<|end|>",这似乎是 4B 模型特有的 bug,因为 4B 的语言模型用的是 Phi-3,但 "<|end|>" 并没有添加到 4B 模型的 Special token 中。

根据我的测试,似乎只有短的 prompt 会导致 Phi-3 Special token 出现,使用长一点的 prompt 应该能输出跟 HuggingFace 一样的结果。

@wciq1208
Copy link

问一下大佬这个PR支持AWQ量化吗

@Isotr0py
Copy link
Collaborator Author

@wciq1208 目前 VLMs 还未支持量化模型,量化 VLM 的支持已在 P2 优先级的开发计划中。

@tonyaw
Copy link

tonyaw commented Sep 4, 2024

I used vllm v0.5.5 to start OpenGVLab/InternVL2-Llama3-76B-AWQ on A100*2, it fails as following:

Loading pt checkpoint shards: 100% Completed | 28/28 [00:39<00:00,  1.41s/it]

[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 476, in <module>
[rank0]:     asyncio.run(run_server(args))
[rank0]:   File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
[rank0]:     return loop.run_until_complete(main)
[rank0]:   File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[rank0]:     return future.result()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 443, in run_server
[rank0]:     async with build_async_engine_client(args) as async_engine_client:
[rank0]:   File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
[rank0]:     return await anext(self.gen)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 120, in build_async_engine_client
[rank0]:     async_engine_client = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 740, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 840, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 272, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 270, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 46, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 138, in _init_executor
[rank0]:     self._run_workers("load_model",
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 182, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 881, in load_model
[rank0]:     self.model = get_model(model_config=self.model_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 344, in load_model
[rank0]:     model.load_weights(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/internvl.py", line 508, in load_weights
[rank0]:     self.language_model.load_weights(llm_weights)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 506, in load_weights
[rank0]:     param = params_dict[name]
[rank0]: KeyError: 'model.layers.4.mlp.gate_up_proj.qweight'
INFO 09-03 23:54:32 multiproc_worker_utils.py:123] Killing local vLLM worker processes
[rank0]:[W903 23:54:36.058888282 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

command to start vllm:

python3 -m vllm.entrypoints.openai.api_server --model OpenGVLab/InternVL2-Llama3-76B-AWQ  --host 0.0.0.0 --port 8081  --seed 42 --trust-remote-code --disable-frontend-multiprocessing --enable-chunked-prefill  --tensor-parallel-size 2 2>&1 | tee -a /vllm-workspace/internvl2-llama3-76b-awq.log

@AmazDeng
Copy link

AmazDeng commented Sep 4, 2024

Multi-image input isn't supported yet but is on our roadmap. Please refer to #4194 for the latest progress.

@DarkLight1337 @Isotr0py
May I ask if the InternVL2 model in VLLM now supports input of multiple images?

@Isotr0py
Copy link
Collaborator Author

Isotr0py commented Sep 4, 2024

Not yet. I plan to work on this feature later this week if no people are working on this.

@AmazDeng
Copy link

AmazDeng commented Sep 5, 2024

Not yet. I plan to work on this feature later this week if no people are working on this.

@Isotr0py
May I ask if there is a timeline? Approximately when can it be used?

@tonyaw
Copy link

tonyaw commented Sep 5, 2024

I used vllm v0.5.5 to start OpenGVLab/InternVL2-Llama3-76B-AWQ on A100*2, it fails as following:

Loading pt checkpoint shards: 100% Completed | 28/28 [00:39<00:00,  1.41s/it]

[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 476, in <module>
[rank0]:     asyncio.run(run_server(args))
[rank0]:   File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
[rank0]:     return loop.run_until_complete(main)
[rank0]:   File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[rank0]:     return future.result()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 443, in run_server
[rank0]:     async with build_async_engine_client(args) as async_engine_client:
[rank0]:   File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
[rank0]:     return await anext(self.gen)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 120, in build_async_engine_client
[rank0]:     async_engine_client = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 740, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 840, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 272, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 270, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 46, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 138, in _init_executor
[rank0]:     self._run_workers("load_model",
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 182, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 881, in load_model
[rank0]:     self.model = get_model(model_config=self.model_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 344, in load_model
[rank0]:     model.load_weights(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/internvl.py", line 508, in load_weights
[rank0]:     self.language_model.load_weights(llm_weights)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 506, in load_weights
[rank0]:     param = params_dict[name]
[rank0]: KeyError: 'model.layers.4.mlp.gate_up_proj.qweight'
INFO 09-03 23:54:32 multiproc_worker_utils.py:123] Killing local vLLM worker processes
[rank0]:[W903 23:54:36.058888282 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

command to start vllm:

python3 -m vllm.entrypoints.openai.api_server --model OpenGVLab/InternVL2-Llama3-76B-AWQ  --host 0.0.0.0 --port 8081  --seed 42 --trust-remote-code --disable-frontend-multiprocessing --enable-chunked-prefill  --tensor-parallel-size 2 2>&1 | tee -a /vllm-workspace/internvl2-llama3-76b-awq.log

@DarkLight1337 , any comment or suggestion about this?
I can use 0.5.5 to bring up "OpenGVLab/InternVL2-Llama3-76B" successfully, but it uses too much memory.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 5, 2024

OpenGVLab/InternVL2-Llama3-76B-AWQ

The weight loading fails on LM backbone, so it seems that AWQ loading isn't supported for Llama3. cc @Isotr0py

@tonyaw
Copy link

tonyaw commented Sep 5, 2024

@DarkLight1337, Thanks for checking! Any workaround here? :-)

@Isotr0py
Copy link
Collaborator Author

Isotr0py commented Sep 5, 2024

@tonyaw Can you try adding --quantization=awq? I remember that the quantization can't be inferred automatically for InternVL because their quant_config has some issues.

@tonyaw
Copy link

tonyaw commented Sep 5, 2024

@Isotr0py , Thanks! New error:

[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 840, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 272, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 270, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 46, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 138, in _init_executor
[rank0]:     self._run_workers("load_model",
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 182, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 881, in load_model
[rank0]:     self.model = get_model(model_config=self.model_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 341, in load_model
[rank0]:     model = _initialize_model(model_config, self.load_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 174, in _initialize_model
[rank0]:     quant_config=_get_quantization_config(model_config, load_config),
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 108, in _get_quantization_config
[rank0]:     raise ValueError(
[rank0]: ValueError: torch.bfloat16 is not supported for quantization method awq. Supported dtypes: [torch.float16]

command in use:

python3 -m vllm.entrypoints.openai.api_server --model OpenGVLab/InternVL2-Llama3-76B-AWQ --quantization=awq \
--host 0.0.0.0 --port 8080 --trust-remote-code --disable-frontend-multiprocessing \
--tensor-parallel-size 2 --max-model-len 4096  --gpu-memory-utilization=0.95 2>&1 | tee -a /vllm-workspace/internvl2-llama3-76b-awq.log   

@Isotr0py
Copy link
Collaborator Author

Isotr0py commented Sep 5, 2024

@tonyaw You can add --dtype=half

@tonyaw
Copy link

tonyaw commented Sep 5, 2024

@Isotr0py , Thanks!
It works with "--dtype float16" which is same as "--dtype=half".
Will it impact model generation quality as the model is in bf16 format?

@Isotr0py
Copy link
Collaborator Author

Isotr0py commented Sep 5, 2024

I think this won't affect model generation quality very much because fp16 has a higher precision compared to bf16.

@DarkLight1337
Copy link
Member

Try increasing max_model_len while decreasing max_num_seqs.

@DarkLight1337
Copy link
Member

Can you show which image you used and also the text prompt?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 9, 2024

Can you show which image you used and also the text prompt?

image = Image.open("COCO_test2014_000000001227.jpg")
prompt = "介绍一下这幅图片"

It appears that you didn't apply any template to the prompt. Make sure it is formatted as shown in these examples. Notice that there should be <image> tokens in the prompt.

@youw3
Copy link

youw3 commented Sep 9, 2024

Can you show which image you used and also the text prompt?

image = Image.open("COCO_test2014_000000001227.jpg")
prompt = "介绍一下这幅图片"

It appears that you didn't apply any template to the prompt. Make sure it is formatted as shown in these examples. Notice that there should be <image> tokens in the prompt.

You're right, '<image>\n介绍一下这幅图片' is OK, thank U!

@PancakeAwesome
Copy link

Not yet. I plan to work on this feature later this week if no people are working on this.

looking forward to updating this feature~

@hkunzhe
Copy link

hkunzhe commented Sep 26, 2024

Not yet. I plan to work on this feature later this week if no people are working on this.

@Isotr0py looking forward to this update!

@Isotr0py
Copy link
Collaborator Author

@PancakeAwesome @hkunzhe InternVL2 has supported multi-images inputs, see here: examples/offline_inference_vision_language_multi_image.py

@hkunzhe
Copy link

hkunzhe commented Sep 27, 2024

@Isotr0py Hi, I've successfully played with InternVL2-40B-AWQ + vllm by using the example code. However, I found that when the length of multi-images increased to 8 (typically in the video chat setting), the input tokens is too long for max_model_len.

ValueError: The prompt (total length 14424) is too long to fit into the model (context length 8096). Make sure that `max_model_len` is no smaller than the number of text tokens plus multimodal tokens. For image inputs, the number of image tokens depends on the number of images, and possibly their aspect ratios as well.

lmdeploy can set max_dynamic_patch=1 to reduce the input tokens either by editing config.json or use a dynamic config in the request payload (InternLM/lmdeploy#2263 (comment)). In vllm, I encountered the following error when set max_dynamic_patch=1 by editing config.json:

ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20240927-082809.pkl): Attempted to assign 8 x 256 = 2048 multimodal tokens to 4096 placeholders

A minimal code to reproduce:

Click to expand
"""
This example shows how to use vLLM for running offline inference with
multi-image input on vision language models, using the chat template defined
by the model.
"""
from argparse import Namespace
from typing import List, NamedTuple, Optional

# from PIL.Image import Image
from PIL import Image
from transformers import AutoProcessor, AutoTokenizer

from vllm import LLM, SamplingParams
from vllm.multimodal.utils import fetch_image
from vllm.utils import FlexibleArgumentParser


QUESTION = "What is the content of each image?"
IMAGE_URLS = [
    "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg",
  "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg",
] * 4


class ModelRequestData(NamedTuple):
    llm: LLM
    prompt: str
    stop_token_ids: Optional[List[str]]
    image_data: List[Image.Image]
    chat_template: Optional[str]


def load_internvl_video(question: str, image_urls: List[str]) -> ModelRequestData:
    model_name = "OpenGVLab/InternVL2-40B-AWQ"

    llm = LLM(
        model=model_name,
        trust_remote_code=True,
        max_num_seqs=5,
        max_model_len=8096,
        max_num_batched_tokens=8096,
        limit_mm_per_prompt={"image": len(image_urls)},
        gpu_memory_utilization=0.8,
        tensor_parallel_size=2,
        quantization="awq",
        dtype="float16"
    )

    placeholders = "\n".join(f"Image-{i}: <image>\n"
                             for i, _ in enumerate(image_urls, start=1))
    messages = [{'role': 'user', 'content': f"{placeholders}\n{question}"}]

    tokenizer = AutoTokenizer.from_pretrained(model_name,
                                              trust_remote_code=True)
    prompt = tokenizer.apply_chat_template(messages,
                                           tokenize=False,
                                           add_generation_prompt=True)

    # Stop tokens for InternVL
    # models variants may have different stop tokens
    # please refer to the model card for the correct "stop words":
    # https://huggingface.co/OpenGVLab/InternVL2-2B#service
    stop_tokens = ["<|endoftext|>", "<|im_start|>", "<|im_end|>", "<|end|>"]
    stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]

    return ModelRequestData(
        llm=llm,
        prompt=prompt,
        stop_token_ids=stop_token_ids,
        image_data=[fetch_image(url) for url in image_urls],
        chat_template=None,
    )


model_example_map = {
    "internvl_chat_video": load_internvl_video
}


def run_generate(model, question: str, image_urls: List[str]):
    req_data = model_example_map[model](question, image_urls)

    sampling_params = SamplingParams(temperature=0.0,
                                     max_tokens=128,
                                     stop_token_ids=req_data.stop_token_ids)

    outputs = req_data.llm.generate(
        {
            "prompt": req_data.prompt,
            "multi_modal_data": {
                "image": req_data.image_data
            },
        },
        sampling_params=sampling_params)

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)


def run_chat(model: str, question: str, image_urls: List[str]):
    req_data = model_example_map[model](question, image_urls)

    sampling_params = SamplingParams(temperature=0.0,
                                     max_tokens=128,
                                     stop_token_ids=req_data.stop_token_ids)
    outputs = req_data.llm.chat(
        [{
            "role":
            "user",
            "content": [
                {
                    "type": "text",
                    "text": question,
                },
                *({
                    "type": "image_url",
                    "image_url": {
                        "url": image_url
                    },
                } for image_url in image_urls),
            ],
        }],
        sampling_params=sampling_params,
        chat_template=req_data.chat_template,
    )

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)


def main(args: Namespace):
    model = args.model_type
    method = args.method

    if method == "generate":
        run_generate(model, QUESTION, IMAGE_URLS)
    elif method == "chat":
        run_chat(model, QUESTION, IMAGE_URLS)
    else:
        raise ValueError(f"Invalid method: {method}")


if __name__ == "__main__":
    parser = FlexibleArgumentParser(
        description='Demo on using vLLM for offline inference with '
        'vision language models that support multi-image input')
    parser.add_argument('--model-type',
                        '-m',
                        type=str,
                        default="internvl_chat_video",
                        choices=model_example_map.keys(),
                        help='Huggingface "model_type".')
    parser.add_argument("--method",
                        type=str,
                        default="generate",
                        choices=["generate", "chat"],
                        help="The method to run in `vllm.LLM`.")

    args = parser.parse_args()
    main(args)

@Isotr0py
Copy link
Collaborator Author

@hkunzhe Thanks for reporting this! We can expose the max_dynamic_patch to mm_processor_kwargs just like #8658. I will take a look this weekend.

The bug seems that use_thumbnail wasn't disabled while max_dynamic_patch=1 for some reasons, I will take a look as well.

@hkunzhe
Copy link

hkunzhe commented Sep 29, 2024

@Isotr0py Expose the max_dynamic_patch to mm_processor_kwargs just like:

model_name = "OpenGVLab/InternVL2-40B-AWQ"

llm = LLM(
    model=model_name,
    trust_remote_code=True,
    max_num_seqs=5,
    max_model_len=8096,
    max_num_batched_tokens=8096,
    limit_mm_per_prompt={"image": len(image_urls)},
    gpu_memory_utilization=0.8,
    tensor_parallel_size=2,
    quantization="awq",
    dtype="float16",
    mm_processor_kwargs={"max_dynamic_patch": 1}
)

does not seem work

 The following intended overrides are not keyword-only args and and will be dropped: {'max_dynamic_patch'}

Set max_dynamic_patch=2 will not trigger the previous bug (#6514 (comment)) and can be used as a temporary workaround.

@Isotr0py
Copy link
Collaborator Author

@hkunzhe I have created #8946 to fix these. Can you take a look if it works on 40B-AWQ as well? Thanks!

@hkunzhe
Copy link

hkunzhe commented Sep 30, 2024

@Isotr0py It's OK now. Thanks for your quick fix!

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
…6514)

Co-authored-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Alvant <alvasian@yandex.ru>
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
@zhenfenxiao
Copy link

Thanks for this great pr! I have deploy the model as openai server but have no idea on how to call it. Could you please provide some examples or other references ? Single-page and multi-page examples are both expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Model]: Support for InternVL2 [Model]: Support for InternVL-Chat-V1-5