Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend] Support GPT-4V Chat Completions API #4200

Closed
wants to merge 25 commits into from

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Apr 19, 2024

I have refactored the OpenAI-compatible server to support image input in OpenAI Chat Completions API, as specified by GPT-4 with Vision (GPT-4V). This enables easy access to VLMs in vLLM via the OpenAI client.

FIX #2058
FIX #3873
FIX #4205
FIX #4826

Features

  1. Support for image input in OpenAI Chat Completions API
    • Accepts both data and web URLs.
    • Accepts all image formats specified by OpenAI (PNG, JPEG, WEBP, GIF); additionally supports BMP and TIFF.
    • The detail parameter is not applicable to most models, so it is ignored.
    • Limitation: Only accepts a single image per API call.
    • The documentation for using VLMs has been updated accordingly.
    • A chat template (examples/template_llava.jinja) has been added to the repository so that you do not have to come up with your own to run llava-hf/llava-1.5-7b-hf etc. (those models fail to provide the template in config.json).
    • TODO: Verify whether it perfectly matches the one required for llava-hf/llava-v1.6-34b-hf. The model seems to work but I haven't checked the input prompt specifically.

Related Contributions

This PR directly resolves issues #2058 and #3873. It also supersedes PR #3467.

This PR completes part of #3978.

This PR depends on #4197; it is set to Draft status until it is merged. Afterwards, you should be able to see the real diffs made exclusively by this PR.

To avoid unnecessary resource usage, this branch is frozen (except for critical fixes) until its dependencies have all been merged.

Future Work

Develop a more extensible framework of representing images from OpenAI API in the text prompt.

  • Currently, LlavaForConditionalGeneration models in vLLM require that the <image> token be repeated image_feature_size times. This differs from the documentation on HuggingFace which specifies that only one <image> token needs to be provided. I think that the implementation details of duplicating <image> tokens in vLLM should be hidden from the user.
  • The framework should be applicable to multi-image input. MMICL may be a suitable model to showcase this functionality.
  • In my opinion, we can tackle this in one of two ways:
    1. If there are only a small number of standard methods, we can provide a config option to choose which method to apply. I have added the image_openai attribute to VisionLanguageConfig to facilitate this.
    2. A more flexible option would be to pass the image(s) to the chat template (e.g. by setting the images attribute alongside role and content). This transfers the burden of implementation to the maintainers of the model on HuggingFace, making it more likely that vLLM users have to implement their own template. I have created ConversationMessage class to represent the dictionary for each message.

@jamt9000
Copy link

jamt9000 commented Apr 29, 2024

Hello, nice work adding mutimodal support! I'm trying out your PRs but running into some issues. I'm starting the openai server with this (using the Dockerfile) python -m vllm.entrypoints.openai.api_server --model llava-hf/llava-1.5-7b-hf --image-input-type pixel_values --image-token-id 32000 --image-input-shape 1,3,336,336 --image-feature-size 576 --chat-template examples/template_llava.jinja and this example from the docs in this PR:

chat_response = client.chat.completions.create(
    model="llava-hf/llava-1.5-7b-hf",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                },
            },
            {"type": "text", "text": "What's in this image?"},
        ],
    }],
)

But I'm getting these errors:

With #3978:

  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llava.py", line 274, in forward
    inputs_embeds = _merge_vision_embeddings(
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llava.py", line 59, in _merge_vision_embeddings
    raise ValueError(f"image_feature_size should be {image_feature_size}, "
ValueError: image_feature_size should be 576, but found: 0

With #4200:

  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llava.py", line 260, in forward
    inputs_embeds = _merge_vision_embeddings(
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llava.py", line 55, in _merge_vision_embeddings
    inputs_embeds[mask] = vision_embeddings.view(-1,
RuntimeError: shape mismatch: value tensor of shape [576, 4096] cannot be broadcast to indexing result of shape [0, 4096]

So it seems like it isn't adding the image tokens correctly. Am I doing something wrong? Would appreciate it if you could point me in the right direction!

Edit: OK it seems to be working when run outside of docker, so it seem the problem is --chat-template examples/template_llava.jinja isn't in the working directory of the dockerfile, and that instead of failing when it can't find the file it treats the filename like a literal string template. I see now this was fixed in #4292 to catch such a scenario!

rkooo567 pushed a commit that referenced this pull request May 13, 2024
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)

Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
@happywu
Copy link

happywu commented May 16, 2024

hi @DarkLight1337,

I found the llava model with gpt-4v chat completion will always output a white space at the beginning, while other pure text model will not have this issue, do you have any idea why this might happen?

ChatCompletion(id='cmpl-564bda62019242aa9ce6cf3d3e955ae7', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=' The image captures a lush green field under a blue cloudy sky. A boardwalk or wooden trail is positioned in the middle of the field, leading towards a wooded area. The trail appears to be a pathway designed for walking or recreational purposes.\n\nIn the surrounding landscape, various trees can be seen in the distance, adding to the serene atmosphere of the scene. The combination of the vibrant green grass, the blue sky, and the trail creates a picturesque outdoor setting.', role='assistant', function_call=None, tool_calls=None), stop_reason=None)], created=1715818283, model='llava-hf/llava-1.5-7b-hf', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=107, prompt_tokens=595, total_tokens=702))

@DarkLight1337
Copy link
Member Author

I found the llava model with gpt-4v chat completion will always output a white space at the beginning, while other pure text model will not have this issue, do you have any idea why this might happen?

The inputs to AsyncLLMEngine in the OpenAI-compatible server seem correct. Perhaps it is an issue with the updated implementation of LLaVA (#4197). I'll take a deeper look later today.

@DarkLight1337
Copy link
Member Author

I have narrowed down the issue to this line. For some reason, new_decoded_token_text is pre-padded with extra whitespace characters.

(new_tokens, new_decoded_token_text, prefix_offset,

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented May 17, 2024

After some digging, it seems that this is expected behaviour because there is no extra whitespace after "ASSISTANT:", so that the language model fills it in automatically, resulting in the output "ASSISTANT: <text>" (instead of "ASSISTANT:<text>").

vLLM code

>>> from vllm import LLM
>>> llm = LLM(
...     model="llava-hf/llava-1.5-7b-hf",
...     image_input_type="pixel_values",
...     image_token_id=32000,
...     image_input_shape="1,3,336,336",
...     image_feature_size=576,
... )
>>> [o.outputs[0].text for o in llm.generate("USER: What is 1+1?\nASSISTANT:")]
[' 1+1 is equal to 2.']
>>> [o.outputs[0].text for o in llm.generate("USER: What is 1+1?\nASSISTANT: ")]
['1+1 is equal to 2.']

HuggingFace code

>>> from transformers import AutoProcessor, LlavaForConditionalGeneration
>>> llm = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
>>> processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf"
>>> processor.batch_decode(llm.generate(**processor(text="USER: What is 1+1?\nASSISTANT:"), max_new_tokens=15))
['<s> USER: What is 1+1?\nASSISTANT: 1+1 is equal to 2.</s>']
>>> processor.batch_decode(llm.generate(**processor(text="USER: What is 1+1?\nASSISTANT: "), max_new_tokens=15))
['<s> USER: What is 1+1?\nASSISTANT: 1+1 is equal to 2.</s>']

I have tried adding a space after "ASSISTANT:" in the chat template but it harms the performance of the model, particularly when an image is included in the input.

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 19, 2024
Since vllm-project#4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in vllm-project#4200 which would duplicate the code a third time)

Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
{%- for message in messages -%}
{{ message['role'].upper() + ': ' + message['content'] }}
{%- if (loop.last and add_generation_prompt) or not loop.last -%}
{{- '\n' -}}
Copy link

@jamt9000 jamt9000 May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should actually be no '\n'

Going by the vicuna_v1 conversation style used for llava-v1.5-13b eg here

Which has sep2="</s>"

https://github.com/haotian-liu/LLaVA/blob/c121f0432da27facab705978f83c4ada465e46fd/llava/conversation.py#L242-L252

The initial prompt will look like this:

In [3]: from llava import conversation

In [4]: conv = conversation.conv_vicuna_v1.copy()

In [5]: conv.append_message(conv.roles[0], "Hi")

In [6]: conv.append_message(conv.roles[1], None)

In [7]: conv.get_prompt()
Out[7]: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT:"

And then continues like this with a </s> after each assistant response.

In [9]: conv.messages[-1][-1] = " Hello I am LLaVA".strip() # The model will add with space looks like it gets stripped eg here
In [10]: conv.get_prompt()
Out[10]: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT: Hello I am LLaVA</s>"
In [11]: conv.append_message(conv.roles[0], "What is the capital of France?")
In [12]: conv.append_message(conv.roles[1], None)
In [13]: conv.get_prompt()
Out[13]: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT: Hello I am LLaVA</s>USER: What is the capital of France? ASSISTANT:"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modeled the chat template according to their HF repo. Their example used a newline right before ASSISTANT.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, llama.cpp also has the same (before ASSISTANT and before the first USER after the system prompt) - I guess it is mostly compatible with the original llava style if the user messages could end with a newline during training - although the jinja template also isn't handling the system prompt (which seems is added with no SYSTEM: prefix in llama.cpp and LLaVA repo's conv_vicuna_v1)

Will (and should?) </s> also get added after the ASSISTANT answer? I guess it will have been output from the model since it's the eos token but not sure if it gets stripped at some point before making it to jinja.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the HuggingFace code (above), you can see that EOS token is included in the output. However this is removed in vLLM, presumably in favor of returning the generated text in a more user-friendly manner.

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024
Since vllm-project#4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in vllm-project#4200 which would duplicate the code a third time)

Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
@DarkLight1337
Copy link
Member Author

After some offline discussion, I have handed over this feature to @ywang96, who has opened #5237. This PR is thus no longer considered a candidate for being merged, and only exists as a reference.

Still, I have updated this PR to no longer depend on #4028 and resolved conflicts with the current main branch.

@ywang96 ywang96 closed this in #5237 Jun 7, 2024
@DarkLight1337 DarkLight1337 deleted the openai-gpt4v branch June 8, 2024 01:12
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Since vllm-project#4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in vllm-project#4200 which would duplicate the code a third time)

Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants