-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Initialize support for InternVL2 series models #6514
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Full CI run is still required to merge this PR so once the PR is ready to go, please make sure to run it. If you need all test signals in between PR commits, you can trigger full CI as well. To run full CI, you can do one of these:
🚀 |
@lrain-CN 可以试试在 根据我的测试,似乎只有短的 prompt 会导致 Phi-3 Special token 出现,使用长一点的 prompt 应该能输出跟 HuggingFace 一样的结果。 |
问一下大佬这个PR支持AWQ量化吗 |
I used vllm v0.5.5 to start OpenGVLab/InternVL2-Llama3-76B-AWQ on A100*2, it fails as following:
command to start vllm:
|
@DarkLight1337 @Isotr0py |
Not yet. I plan to work on this feature later this week if no people are working on this. |
@Isotr0py |
@DarkLight1337 , any comment or suggestion about this? |
The weight loading fails on LM backbone, so it seems that AWQ loading isn't supported for Llama3. cc @Isotr0py |
@DarkLight1337, Thanks for checking! Any workaround here? :-) |
@tonyaw Can you try adding |
@Isotr0py , Thanks! New error:
command in use:
|
@tonyaw You can add |
@Isotr0py , Thanks! |
I think this won't affect model generation quality very much because fp16 has a higher precision compared to bf16. |
Try increasing |
Can you show which image you used and also the text prompt? |
It appears that you didn't apply any template to the prompt. Make sure it is formatted as shown in these examples. Notice that there should be |
You're right, '<image>\n介绍一下这幅图片' is OK, thank U! |
looking forward to updating this feature~ |
@Isotr0py looking forward to this update! |
@PancakeAwesome @hkunzhe InternVL2 has supported multi-images inputs, see here: examples/offline_inference_vision_language_multi_image.py |
@Isotr0py Hi, I've successfully played with InternVL2-40B-AWQ + vllm by using the example code. However, I found that when the length of multi-images increased to 8 (typically in the video chat setting), the input tokens is too long for
lmdeploy can set
A minimal code to reproduce: Click to expand"""
This example shows how to use vLLM for running offline inference with
multi-image input on vision language models, using the chat template defined
by the model.
"""
from argparse import Namespace
from typing import List, NamedTuple, Optional
# from PIL.Image import Image
from PIL import Image
from transformers import AutoProcessor, AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.multimodal.utils import fetch_image
from vllm.utils import FlexibleArgumentParser
QUESTION = "What is the content of each image?"
IMAGE_URLS = [
"https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg",
"https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg",
] * 4
class ModelRequestData(NamedTuple):
llm: LLM
prompt: str
stop_token_ids: Optional[List[str]]
image_data: List[Image.Image]
chat_template: Optional[str]
def load_internvl_video(question: str, image_urls: List[str]) -> ModelRequestData:
model_name = "OpenGVLab/InternVL2-40B-AWQ"
llm = LLM(
model=model_name,
trust_remote_code=True,
max_num_seqs=5,
max_model_len=8096,
max_num_batched_tokens=8096,
limit_mm_per_prompt={"image": len(image_urls)},
gpu_memory_utilization=0.8,
tensor_parallel_size=2,
quantization="awq",
dtype="float16"
)
placeholders = "\n".join(f"Image-{i}: <image>\n"
for i, _ in enumerate(image_urls, start=1))
messages = [{'role': 'user', 'content': f"{placeholders}\n{question}"}]
tokenizer = AutoTokenizer.from_pretrained(model_name,
trust_remote_code=True)
prompt = tokenizer.apply_chat_template(messages,
tokenize=False,
add_generation_prompt=True)
# Stop tokens for InternVL
# models variants may have different stop tokens
# please refer to the model card for the correct "stop words":
# https://huggingface.co/OpenGVLab/InternVL2-2B#service
stop_tokens = ["<|endoftext|>", "<|im_start|>", "<|im_end|>", "<|end|>"]
stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
return ModelRequestData(
llm=llm,
prompt=prompt,
stop_token_ids=stop_token_ids,
image_data=[fetch_image(url) for url in image_urls],
chat_template=None,
)
model_example_map = {
"internvl_chat_video": load_internvl_video
}
def run_generate(model, question: str, image_urls: List[str]):
req_data = model_example_map[model](question, image_urls)
sampling_params = SamplingParams(temperature=0.0,
max_tokens=128,
stop_token_ids=req_data.stop_token_ids)
outputs = req_data.llm.generate(
{
"prompt": req_data.prompt,
"multi_modal_data": {
"image": req_data.image_data
},
},
sampling_params=sampling_params)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
def run_chat(model: str, question: str, image_urls: List[str]):
req_data = model_example_map[model](question, image_urls)
sampling_params = SamplingParams(temperature=0.0,
max_tokens=128,
stop_token_ids=req_data.stop_token_ids)
outputs = req_data.llm.chat(
[{
"role":
"user",
"content": [
{
"type": "text",
"text": question,
},
*({
"type": "image_url",
"image_url": {
"url": image_url
},
} for image_url in image_urls),
],
}],
sampling_params=sampling_params,
chat_template=req_data.chat_template,
)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
def main(args: Namespace):
model = args.model_type
method = args.method
if method == "generate":
run_generate(model, QUESTION, IMAGE_URLS)
elif method == "chat":
run_chat(model, QUESTION, IMAGE_URLS)
else:
raise ValueError(f"Invalid method: {method}")
if __name__ == "__main__":
parser = FlexibleArgumentParser(
description='Demo on using vLLM for offline inference with '
'vision language models that support multi-image input')
parser.add_argument('--model-type',
'-m',
type=str,
default="internvl_chat_video",
choices=model_example_map.keys(),
help='Huggingface "model_type".')
parser.add_argument("--method",
type=str,
default="generate",
choices=["generate", "chat"],
help="The method to run in `vllm.LLM`.")
args = parser.parse_args()
main(args) |
@Isotr0py Expose the model_name = "OpenGVLab/InternVL2-40B-AWQ"
llm = LLM(
model=model_name,
trust_remote_code=True,
max_num_seqs=5,
max_model_len=8096,
max_num_batched_tokens=8096,
limit_mm_per_prompt={"image": len(image_urls)},
gpu_memory_utilization=0.8,
tensor_parallel_size=2,
quantization="awq",
dtype="float16",
mm_processor_kwargs={"max_dynamic_patch": 1}
) does not seem work
Set |
@Isotr0py It's OK now. Thanks for your quick fix! |
…6514) Co-authored-by: Roger Wang <ywang@roblox.com> Signed-off-by: Alvant <alvasian@yandex.ru>
…6514) Co-authored-by: Roger Wang <ywang@roblox.com>
Thanks for this great pr! I have deploy the model as openai server but have no idea on how to call it. Could you please provide some examples or other references ? Single-page and multi-page examples are both expected. |
FILL IN THE PR DESCRIPTION HERE
FIX #4393
FIX #6321 (link existing issues this PR will resolve)
This PR aims to add support for InternVL2 series models:
NOTE: This model was added after the release of 0.5.3.post1, so it'll only be included in the next release (e.g. 0.5.4). If you want to use it now, please install vLLM from source (i.e. main branch).
BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE
PR Checklist (Click to Expand)
Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]
for bug fixes.[CI/Build]
for build or continuous integration improvements.[Doc]
for documentation fixes and improvements.[Model]
for adding a new model or improving an existing model. Model name should appear in the title.[Frontend]
For changes on the vLLM frontend (e.g., OpenAI API server,LLM
class, etc.)[Kernel]
for changes affecting CUDA kernels or other compute kernels.[Core]
for changes in the core vLLM logic (e.g.,LLMEngine
,AsyncLLMEngine
,Scheduler
, etc.)[Hardware][Vendor]
for hardware-specific changes. Vendor name should appear in the prefix (e.g.,[Hardware][AMD]
).[Misc]
for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
format.sh
to format your code.docs/source/
if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.Notes for Large Changes
Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with
rfc-required
and might not go through the PR.What to Expect for the Reviews
The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:
action-required
label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.Thank You
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!