[Frontend] Support GPT-4V Chat Completions API #4200

DarkLight1337 · 2024-04-19T09:23:49Z

I have refactored the OpenAI-compatible server to support image input in OpenAI Chat Completions API, as specified by GPT-4 with Vision (GPT-4V). This enables easy access to VLMs in vLLM via the OpenAI client.

FIX #2058
FIX #3873
FIX #4205
FIX #4826

Features

Support for image input in OpenAI Chat Completions API
- Accepts both data and web URLs.
- Accepts all image formats specified by OpenAI (PNG, JPEG, WEBP, GIF); additionally supports BMP and TIFF.
- The detail parameter is not applicable to most models, so it is ignored.
- Limitation: Only accepts a single image per API call.
- The documentation for using VLMs has been updated accordingly.
- A chat template (examples/template_llava.jinja) has been added to the repository so that you do not have to come up with your own to run llava-hf/llava-1.5-7b-hf etc. (those models fail to provide the template in config.json).
- TODO: Verify whether it perfectly matches the one required for llava-hf/llava-v1.6-34b-hf. The model seems to work but I haven't checked the input prompt specifically.

Related Contributions

This PR directly resolves issues #2058 and #3873. It also supersedes PR #3467.

This PR completes part of #3978.

This PR depends on #4197; it is set to Draft status until it is merged. Afterwards, you should be able to see the real diffs made exclusively by this PR.

To avoid unnecessary resource usage, this branch is frozen (except for critical fixes) until its dependencies have all been merged.

Future Work

Develop a more extensible framework of representing images from OpenAI API in the text prompt.

Currently, LlavaForConditionalGeneration models in vLLM require that the <image> token be repeated image_feature_size times. This differs from the documentation on HuggingFace which specifies that only one <image> token needs to be provided. I think that the implementation details of duplicating <image> tokens in vLLM should be hidden from the user.
The framework should be applicable to multi-image input. MMICL may be a suitable model to showcase this functionality.
In my opinion, we can tackle this in one of two ways:
1. If there are only a small number of standard methods, we can provide a config option to choose which method to apply. I have added the image_openai attribute to VisionLanguageConfig to facilitate this.
2. A more flexible option would be to pass the image(s) to the chat template (e.g. by setting the images attribute alongside role and content). This transfers the burden of implementation to the maintainers of the model on HuggingFace, making it more likely that vLLM users have to implement their own template. I have created ConversationMessage class to represent the dictionary for each message.

…ions API and legacy Completions API

- Also add docs for basic VLM usage

jamt9000 · 2024-04-29T16:01:43Z

Hello, nice work adding mutimodal support! I'm trying out your PRs but running into some issues. I'm starting the openai server with this (using the Dockerfile) python -m vllm.entrypoints.openai.api_server --model llava-hf/llava-1.5-7b-hf --image-input-type pixel_values --image-token-id 32000 --image-input-shape 1,3,336,336 --image-feature-size 576 --chat-template examples/template_llava.jinja and this example from the docs in this PR:

chat_response = client.chat.completions.create(
    model="llava-hf/llava-1.5-7b-hf",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                },
            },
            {"type": "text", "text": "What's in this image?"},
        ],
    }],
)

But I'm getting these errors:

With #3978:

  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llava.py", line 274, in forward
    inputs_embeds = _merge_vision_embeddings(
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llava.py", line 59, in _merge_vision_embeddings
    raise ValueError(f"image_feature_size should be {image_feature_size}, "
ValueError: image_feature_size should be 576, but found: 0

With #4200:

  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llava.py", line 260, in forward
    inputs_embeds = _merge_vision_embeddings(
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llava.py", line 55, in _merge_vision_embeddings
    inputs_embeds[mask] = vision_embeddings.view(-1,
RuntimeError: shape mismatch: value tensor of shape [576, 4096] cannot be broadcast to indexing result of shape [0, 4096]

So it seems like it isn't adding the image tokens correctly. Am I doing something wrong? Would appreciate it if you could point me in the right direction!

Edit: OK it seems to be working when run outside of docker, so it seem the problem is --chat-template examples/template_llava.jinja isn't in the working directory of the dockerfile, and that instead of failing when it can't find the file it treats the filename like a literal string template. I see now this was fixed in #4292 to catch such a scenario!

Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.

happywu · 2024-05-16T00:11:48Z

hi @DarkLight1337,

I found the llava model with gpt-4v chat completion will always output a white space at the beginning, while other pure text model will not have this issue, do you have any idea why this might happen?

ChatCompletion(id='cmpl-564bda62019242aa9ce6cf3d3e955ae7', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=' The image captures a lush green field under a blue cloudy sky. A boardwalk or wooden trail is positioned in the middle of the field, leading towards a wooded area. The trail appears to be a pathway designed for walking or recreational purposes.\n\nIn the surrounding landscape, various trees can be seen in the distance, adding to the serene atmosphere of the scene. The combination of the vibrant green grass, the blue sky, and the trail creates a picturesque outdoor setting.', role='assistant', function_call=None, tool_calls=None), stop_reason=None)], created=1715818283, model='llava-hf/llava-1.5-7b-hf', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=107, prompt_tokens=595, total_tokens=702))

DarkLight1337 · 2024-05-16T02:12:05Z

I found the llava model with gpt-4v chat completion will always output a white space at the beginning, while other pure text model will not have this issue, do you have any idea why this might happen?

The inputs to AsyncLLMEngine in the OpenAI-compatible server seem correct. Perhaps it is an issue with the updated implementation of LLaVA (#4197). I'll take a deeper look later today.

DarkLight1337 · 2024-05-16T05:06:10Z

I have narrowed down the issue to this line. For some reason, new_decoded_token_text is pre-padded with extra whitespace characters.

vllm/vllm/transformers_utils/detokenizer.py

Line 115 in 52f8107

(new_tokens, new_decoded_token_text, prefix_offset,

DarkLight1337 · 2024-05-17T02:03:49Z

After some digging, it seems that this is expected behaviour because there is no extra whitespace after "ASSISTANT:", so that the language model fills it in automatically, resulting in the output "ASSISTANT: <text>" (instead of "ASSISTANT:<text>").

vLLM code

>>> from vllm import LLM
>>> llm = LLM(
...     model="llava-hf/llava-1.5-7b-hf",
...     image_input_type="pixel_values",
...     image_token_id=32000,
...     image_input_shape="1,3,336,336",
...     image_feature_size=576,
... )
>>> [o.outputs[0].text for o in llm.generate("USER: What is 1+1?\nASSISTANT:")]
[' 1+1 is equal to 2.']
>>> [o.outputs[0].text for o in llm.generate("USER: What is 1+1?\nASSISTANT: ")]
['1+1 is equal to 2.']

HuggingFace code

>>> from transformers import AutoProcessor, LlavaForConditionalGeneration
>>> llm = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
>>> processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf"
>>> processor.batch_decode(llm.generate(**processor(text="USER: What is 1+1?\nASSISTANT:"), max_new_tokens=15))
['<s> USER: What is 1+1?\nASSISTANT: 1+1 is equal to 2.</s>']
>>> processor.batch_decode(llm.generate(**processor(text="USER: What is 1+1?\nASSISTANT: "), max_new_tokens=15))
['<s> USER: What is 1+1?\nASSISTANT: 1+1 is equal to 2.</s>']

I have tried adding a space after "ASSISTANT:" in the chat template but it harms the performance of the model, particularly when an image is included in the input.

Since vllm-project#4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in vllm-project#4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.

jamt9000 · 2024-05-20T14:16:22Z

examples/template_llava.jinja

+{%- for message in messages -%}
+    {{ message['role'].upper() + ': ' + message['content'] }}
+    {%- if (loop.last and add_generation_prompt) or not loop.last -%}
+        {{- '\n' -}}


I think there should actually be no '\n'

Going by the vicuna_v1 conversation style used for llava-v1.5-13b eg here

Which has sep2="</s>"

https://github.com/haotian-liu/LLaVA/blob/c121f0432da27facab705978f83c4ada465e46fd/llava/conversation.py#L242-L252

The initial prompt will look like this:

In [3]: from llava import conversation In [4]: conv = conversation.conv_vicuna_v1.copy() In [5]: conv.append_message(conv.roles[0], "Hi") In [6]: conv.append_message(conv.roles[1], None) In [7]: conv.get_prompt() Out[7]: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT:"

And then continues like this with a </s> after each assistant response.

In [9]: conv.messages[-1][-1] = " Hello I am LLaVA".strip() # The model will add with space looks like it gets stripped eg here
In [10]: conv.get_prompt()
Out[10]: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT: Hello I am LLaVA</s>"
In [11]: conv.append_message(conv.roles[0], "What is the capital of France?")
In [12]: conv.append_message(conv.roles[1], None)
In [13]: conv.get_prompt()
Out[13]: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT: Hello I am LLaVA</s>USER: What is the capital of France? ASSISTANT:"

I modeled the chat template according to their HF repo. Their example used a newline right before ASSISTANT.

Interesting, llama.cpp also has the same (before ASSISTANT and before the first USER after the system prompt) - I guess it is mostly compatible with the original llava style if the user messages could end with a newline during training - although the jinja template also isn't handling the system prompt (which seems is added with no SYSTEM: prefix in llama.cpp and LLaVA repo's conv_vicuna_v1)

Will (and should?) </s> also get added after the ASSISTANT answer? I guess it will have been output from the model since it's the eos token but not sure if it gets stripped at some point before making it to jinja.

In the HuggingFace code (above), you can see that EOS token is included in the output. However this is removed in vLLM, presumably in favor of returning the generated text in a more user-friendly manner.

Since vllm-project#4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in vllm-project#4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.

DarkLight1337 · 2024-06-04T06:20:23Z

After some offline discussion, I have handed over this feature to @ywang96, who has opened #5237. This PR is thus no longer considered a candidate for being merged, and only exists as a reference.

Still, I have updated this PR to no longer depend on #4028 and resolved conflicts with the current main branch.

Since vllm-project#4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in vllm-project#4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.

DarkLight1337 and others added 15 commits April 12, 2024 06:00

Use discriminated union in prompt parsing

ce770f4

Fix some type errors along the way

6b016bc

Some more fixes

7620354

Apply formatter

7c3e6d9

Refactor prompt parsing so that it can be shared between Chat Complet…

7bdc84e

…ions API and legacy Completions API

Make code more readable

a7d1098

Move assertion to a more appropriate place

8b9d636

Add code documentation

c48c13a

Decompose _validate_prompt_and_tokenize

3530362

Fix missing import due to renaming

b8feec9

Merge branch 'upstream' into openai-typing

89d9086

Fix bug when parsing array of tokens

cc1a5b3

Add token array to batch completions testing

f9c1135

Replace legacy conint with Annotated field

f2e8180

Merge branch 'upstream' into openai-typing

797326b

DarkLight1337 marked this pull request as draft April 19, 2024 09:23

DarkLight1337 mentioned this pull request Apr 19, 2024

[Core][Frontend][Doc] Initial support for LLaVA-NeXT and GPT-4V Chat Completions API #3978

Closed

7 tasks

DarkLight1337 added 3 commits April 19, 2024 10:02

Support image processor

a26badd

- Also add docs for basic VLM usage

Merge branch 'mm-data-processor' into openai-gpt4v

8f991a3

Support GPT-4V Chat Completions API - Update VLM docs accordingly

32aa3c7

DarkLight1337 force-pushed the openai-gpt4v branch from f0ee7b6 to 32aa3c7 Compare April 19, 2024 10:31

Chat template usage is already documented so no need to mention it again

5e099be

DarkLight1337 mentioned this pull request May 15, 2024

[Usage]: Passing image to the vllm api endpoint #4826

Closed

DarkLight1337 mentioned this pull request May 16, 2024

[Bug]: Unexpected Special Tokens in prompt_logprobs Output for Llama3 Prompt #4772

Closed

jamt9000 reviewed May 20, 2024

View reviewed changes

ywang96 mentioned this pull request May 23, 2024

[RFC]: Multi-modality Support Refactoring #4194

Open

55 tasks

DarkLight1337 mentioned this pull request May 23, 2024

[Frontend] Refactor prompt processing #4028

Merged

DarkLight1337 mentioned this pull request May 31, 2024

[Usage]: Make request to LLAVA server. #4205

Closed

DarkLight1337 added 3 commits June 4, 2024 06:11

Merge branch 'upstream' into openai-gpt4v

6883061

Fix some merge issues

3d376bf

Update doc

81676b4

Code cleanup and fix wrong inputs

a8d4875

DarkLight1337 mentioned this pull request Jun 4, 2024

[Frontend] Add OpenAI Vision API Support #5237

Merged

DarkLight1337 added 2 commits June 4, 2024 15:46

Fix tests w.r.t. vllm-project#5026

57d65eb

Fix wrong number of expected tokens

ddf3f06

ywang96 closed this in #5237 Jun 7, 2024

DarkLight1337 deleted the openai-gpt4v branch June 8, 2024 01:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Frontend] Support GPT-4V Chat Completions API #4200

[Frontend] Support GPT-4V Chat Completions API #4200

DarkLight1337 commented Apr 19, 2024 •

edited

Loading

jamt9000 commented Apr 29, 2024 •

edited

Loading

happywu commented May 16, 2024

DarkLight1337 commented May 16, 2024

DarkLight1337 commented May 16, 2024

DarkLight1337 commented May 17, 2024 •

edited

Loading

jamt9000 May 20, 2024 •

edited

Loading

DarkLight1337 May 20, 2024

jamt9000 May 20, 2024

DarkLight1337 May 20, 2024

DarkLight1337 commented Jun 4, 2024

[Frontend] Support GPT-4V Chat Completions API #4200

[Frontend] Support GPT-4V Chat Completions API #4200

Conversation

DarkLight1337 commented Apr 19, 2024 • edited Loading

Features

Related Contributions

Future Work

jamt9000 commented Apr 29, 2024 • edited Loading

happywu commented May 16, 2024

DarkLight1337 commented May 16, 2024

DarkLight1337 commented May 16, 2024

DarkLight1337 commented May 17, 2024 • edited Loading

jamt9000 May 20, 2024 • edited Loading

Choose a reason for hiding this comment

DarkLight1337 May 20, 2024

Choose a reason for hiding this comment

jamt9000 May 20, 2024

Choose a reason for hiding this comment

DarkLight1337 May 20, 2024

Choose a reason for hiding this comment

DarkLight1337 commented Jun 4, 2024

DarkLight1337 commented Apr 19, 2024 •

edited

Loading

jamt9000 commented Apr 29, 2024 •

edited

Loading

DarkLight1337 commented May 17, 2024 •

edited

Loading

jamt9000 May 20, 2024 •

edited

Loading