Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend] Support GPT-4V Chat Completions API #4200

Closed
wants to merge 25 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
ce770f4
Use discriminated union in prompt parsing
DarkLight1337 Apr 12, 2024
6b016bc
Fix some type errors along the way
DarkLight1337 Apr 12, 2024
7620354
Some more fixes
DarkLight1337 Apr 12, 2024
7c3e6d9
Apply formatter
DarkLight1337 Apr 12, 2024
7bdc84e
Refactor prompt parsing so that it can be shared between Chat Complet…
DarkLight1337 Apr 12, 2024
a7d1098
Make code more readable
DarkLight1337 Apr 12, 2024
8b9d636
Move assertion to a more appropriate place
DarkLight1337 Apr 12, 2024
c48c13a
Add code documentation
DarkLight1337 Apr 12, 2024
3530362
Decompose `_validate_prompt_and_tokenize`
DarkLight1337 Apr 12, 2024
b8feec9
Fix missing import due to renaming
DarkLight1337 Apr 12, 2024
89d9086
Merge branch 'upstream' into openai-typing
DarkLight1337 Apr 13, 2024
cc1a5b3
Fix bug when parsing array of tokens
DarkLight1337 Apr 13, 2024
f9c1135
Add token array to batch completions testing
DarkLight1337 Apr 13, 2024
f2e8180
Replace legacy `conint` with `Annotated` field
DarkLight1337 Apr 14, 2024
797326b
Merge branch 'upstream' into openai-typing
DarkLight1337 Apr 19, 2024
a26badd
Support image processor
DarkLight1337 Apr 19, 2024
8f991a3
Merge branch 'mm-data-processor' into openai-gpt4v
DarkLight1337 Apr 19, 2024
32aa3c7
Support GPT-4V Chat Completions API - Update VLM docs accordingly
DarkLight1337 Apr 19, 2024
5e099be
Chat template usage is already documented so no need to mention it again
DarkLight1337 May 8, 2024
6883061
Merge branch 'upstream' into openai-gpt4v
DarkLight1337 Jun 4, 2024
3d376bf
Fix some merge issues
DarkLight1337 Jun 4, 2024
81676b4
Update doc
DarkLight1337 Jun 4, 2024
a8d4875
Code cleanup and fix wrong inputs
DarkLight1337 Jun 4, 2024
57d65eb
Fix tests w.r.t. #5026
DarkLight1337 Jun 4, 2024
ddf3f06
Fix wrong number of expected tokens
DarkLight1337 Jun 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/source/models/supported_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,8 +89,8 @@ Alongside each architecture, we include some popular models that use it.
- ✅︎
* - :code:`LlavaForConditionalGeneration`
- LLaVA-1.5
- :code:`llava-hf/llava-1.5-7b-hf`\*, :code:`llava-hf/llava-1.5-13b-hf`\*, etc.
-
- :code:`llava-hf/llava-1.5-7b-hf`, :code:`llava-hf/llava-1.5-13b-hf`, etc.
-
* - :code:`MiniCPMForCausalLM`
- MiniCPM
- :code:`openbmb/MiniCPM-2B-sft-bf16`, :code:`openbmb/MiniCPM-2B-dpo-bf16`, etc.
Expand Down
41 changes: 41 additions & 0 deletions docs/source/models/vlm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,3 +54,44 @@ For now, we only support a single image per text prompt. To pass an image to the
print(generated_text)

A code example can be found in `examples/llava_example.py <https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py>`_.

OpenAI-Compatible Server
------------------------

We support image inputs to the OpenAI Chat API, as described in `GPT-4 with Vision <https://platform.openai.com/docs/guides/vision>`_.

Here is a simple example using the :code:`openai` package:

.. code-block:: python

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)

# Note that this model expects the image to come before the main text
chat_response = client.chat.completions.create(
model="llava-hf/llava-1.5-7b-hf",
messages=[{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
},
{"type": "text", "text": "What's in this image?"},
],
}],
)
print("Chat response:", chat_response)

.. note::

For now, we only support a single image per API call. Also, the ``detail`` parameter is ignored since it may not be applicable to other models.
11 changes: 11 additions & 0 deletions examples/template_llava.jinja
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{%- for message in messages -%}
{{ message['role'].upper() + ': ' + message['content'] }}
{%- if (loop.last and add_generation_prompt) or not loop.last -%}
{{- '\n' -}}
Copy link

@jamt9000 jamt9000 May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should actually be no '\n'

Going by the vicuna_v1 conversation style used for llava-v1.5-13b eg here

Which has sep2="</s>"

https://github.com/haotian-liu/LLaVA/blob/c121f0432da27facab705978f83c4ada465e46fd/llava/conversation.py#L242-L252

The initial prompt will look like this:

In [3]: from llava import conversation

In [4]: conv = conversation.conv_vicuna_v1.copy()

In [5]: conv.append_message(conv.roles[0], "Hi")

In [6]: conv.append_message(conv.roles[1], None)

In [7]: conv.get_prompt()
Out[7]: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT:"

And then continues like this with a </s> after each assistant response.

In [9]: conv.messages[-1][-1] = " Hello I am LLaVA".strip() # The model will add with space looks like it gets stripped eg here
In [10]: conv.get_prompt()
Out[10]: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT: Hello I am LLaVA</s>"
In [11]: conv.append_message(conv.roles[0], "What is the capital of France?")
In [12]: conv.append_message(conv.roles[1], None)
In [13]: conv.get_prompt()
Out[13]: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT: Hello I am LLaVA</s>USER: What is the capital of France? ASSISTANT:"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modeled the chat template according to their HF repo. Their example used a newline right before ASSISTANT.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, llama.cpp also has the same (before ASSISTANT and before the first USER after the system prompt) - I guess it is mostly compatible with the original llava style if the user messages could end with a newline during training - although the jinja template also isn't handling the system prompt (which seems is added with no SYSTEM: prefix in llama.cpp and LLaVA repo's conv_vicuna_v1)

Will (and should?) </s> also get added after the ASSISTANT answer? I guess it will have been output from the model since it's the eos token but not sure if it gets stripped at some point before making it to jinja.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the HuggingFace code (above), you can see that EOS token is included in the output. However this is removed in vLLM, presumably in favor of returning the generated text in a more user-friendly manner.

{%- endif -%}
{%- endfor -%}


{%- if add_generation_prompt and messages[-1]['role'] != 'assistant' -%}
{{- 'ASSISTANT:' -}}
{% endif %}
92 changes: 47 additions & 45 deletions tests/entrypoints/test_openai_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -558,50 +558,52 @@ async def test_chat_streaming(server, client: openai.AsyncOpenAI,
)
async def test_batch_completions(server, client: openai.AsyncOpenAI,
model_name: str):
# test simple list
batch = await client.completions.create(
model=model_name,
prompt=["Hello, my name is", "Hello, my name is"],
max_tokens=5,
temperature=0.0,
)
assert len(batch.choices) == 2
assert batch.choices[0].text == batch.choices[1].text

# test n = 2
batch = await client.completions.create(
model=model_name,
prompt=["Hello, my name is", "Hello, my name is"],
n=2,
max_tokens=5,
temperature=0.0,
extra_body=dict(
# NOTE: this has to be true for n > 1 in vLLM, but not necessary
# for official client.
use_beam_search=True),
)
assert len(batch.choices) == 4
assert batch.choices[0].text != batch.choices[
1].text, "beam search should be different"
assert batch.choices[0].text == batch.choices[
2].text, "two copies of the same prompt should be the same"
assert batch.choices[1].text == batch.choices[
3].text, "two copies of the same prompt should be the same"

# test streaming
batch = await client.completions.create(
model=model_name,
prompt=["Hello, my name is", "Hello, my name is"],
max_tokens=5,
temperature=0.0,
stream=True,
)
texts = [""] * 2
async for chunk in batch:
assert len(chunk.choices) == 1
choice = chunk.choices[0]
texts[choice.index] += choice.text
assert texts[0] == texts[1]
# test using text and token IDs
for prompts in (["Hello, my name is"] * 2, [[0, 0, 0, 0, 0]] * 2):
# test simple list
batch = await client.completions.create(
model=model_name,
prompt=prompts,
max_tokens=5,
temperature=0.0,
)
assert len(batch.choices) == 2
assert batch.choices[0].text == batch.choices[1].text

# test n = 2
batch = await client.completions.create(
model=model_name,
prompt=prompts,
n=2,
max_tokens=5,
temperature=0.0,
extra_body=dict(
# NOTE: this has to be true for n > 1 in vLLM, but not necessary
# for official client.
use_beam_search=True),
)
assert len(batch.choices) == 4
assert batch.choices[0].text != batch.choices[
1].text, "beam search should be different"
assert batch.choices[0].text == batch.choices[
2].text, "two copies of the same prompt should be the same"
assert batch.choices[1].text == batch.choices[
3].text, "two copies of the same prompt should be the same"

# test streaming
batch = await client.completions.create(
model=model_name,
prompt=prompts,
max_tokens=5,
temperature=0.0,
stream=True,
)
texts = [""] * 2
async for chunk in batch:
assert len(chunk.choices) == 1
choice = chunk.choices[0]
texts[choice.index] += choice.text
assert texts[0] == texts[1]


@pytest.mark.asyncio
Expand Down Expand Up @@ -1047,7 +1049,7 @@ async def test_echo_logprob_completion(server, client: openai.AsyncOpenAI,
prompt_text = tokenizer.decode(prompt) if isinstance(prompt,
list) else prompt
assert (completion.choices[0].text is not None
and re.search(r"^" + prompt_text, completion.choices[0].text))
and completion.choices[0].text.startswith(prompt_text))
logprobs = completion.choices[0].logprobs
assert logprobs is not None
assert len(logprobs.text_offset) > 5
Expand Down
176 changes: 176 additions & 0 deletions tests/entrypoints/test_openai_server_vision.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
from pathlib import Path

import openai # use the official client for correctness check
import pytest
# using Ray for overall ease of process management, parallel requests,
# and debugging.
import ray

from ..utils import ServerRunner

MODEL_NAME = "llava-hf/llava-1.5-7b-hf"
CHAT_TEMPLATE = (Path(__file__).parent.parent.parent /
"examples/template_llava.jinja")
assert CHAT_TEMPLATE.exists()

# Test different image extensions (JPG/PNG) and formats (gray/RGB/RGBA)
TEST_IMAGE_URLS = [
"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
"https://upload.wikimedia.org/wikipedia/commons/f/fa/Grayscale_8bits_palette_sample_image.png",
"https://upload.wikimedia.org/wikipedia/commons/thumb/9/91/Venn_diagram_rgb.svg/1280px-Venn_diagram_rgb.svg.png",
"https://upload.wikimedia.org/wikipedia/commons/0/0b/RGBA_comp.png",
]

pytestmark = pytest.mark.openai


@pytest.fixture(scope="module")
def server():
ray.init()
server_runner = ServerRunner.remote([
"--model",
MODEL_NAME,
# use half precision for speed and memory savings in CI environment
"--dtype",
"bfloat16",
"--max-model-len",
"4096",
"--enforce-eager",
# vision language config below
"--image-input-type",
"pixel_values",
"--image-token-id",
"32000",
"--image-input-shape",
"1,3,336,336",
"--image-feature-size",
"576",
# chat template required for LLaVA
"--chat-template",
str(CHAT_TEMPLATE),
])
ray.get(server_runner.ready.remote())
yield server_runner
ray.shutdown()


@pytest.fixture(scope="session")
def client():
client = openai.AsyncOpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123",
)
yield client


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
@pytest.mark.parametrize("image_url", TEST_IMAGE_URLS)
async def test_single_chat_session_image(server, client: openai.AsyncOpenAI,
model_name: str, image_url: str):
messages = [{
"role":
"user",
"content": [
{
"type": "image_url",
"image_url": {
"url": image_url
}
},
{
"type": "text",
"text": "What's in this image?"
},
],
}]

# test single completion
chat_completion = await client.chat.completions.create(model=model_name,
messages=messages,
max_tokens=10,
logprobs=True,
top_logprobs=5)
assert chat_completion.id is not None
assert len(chat_completion.choices) == 1

choice = chat_completion.choices[0]
assert choice.finish_reason == "length"
assert chat_completion.usage == openai.types.CompletionUsage(
completion_tokens=10, prompt_tokens=594, total_tokens=604)

message = choice.message
assert message.content is not None and len(message.content) >= 10
assert message.role == "assistant"
messages.append({"role": "assistant", "content": message.content})

# test multi-turn dialogue
messages.append({"role": "user", "content": "express your result in json"})
chat_completion = await client.chat.completions.create(
model=model_name,
messages=messages,
max_tokens=10,
)
message = chat_completion.choices[0].message
assert message.content is not None and len(message.content) >= 0


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
@pytest.mark.parametrize("image_url", TEST_IMAGE_URLS)
async def test_chat_streaming_image(server, client: openai.AsyncOpenAI,
model_name: str, image_url: str):
messages = [{
"role":
"user",
"content": [
{
"type": "image_url",
"image_url": {
"url": image_url
}
},
{
"type": "text",
"text": "What's in this image?"
},
],
}]

# test single completion
chat_completion = await client.chat.completions.create(
model=model_name,
messages=messages,
max_tokens=10,
temperature=0.0,
)
output = chat_completion.choices[0].message.content
stop_reason = chat_completion.choices[0].finish_reason

# test streaming
stream = await client.chat.completions.create(
model=model_name,
messages=messages,
max_tokens=10,
temperature=0.0,
stream=True,
)
chunks = []
finish_reason_count = 0
async for chunk in stream:
delta = chunk.choices[0].delta
if delta.role:
assert delta.role == "assistant"
if delta.content:
chunks.append(delta.content)
if chunk.choices[0].finish_reason is not None:
finish_reason_count += 1
# finish reason should only return in last block
assert finish_reason_count == 1
assert chunk.choices[0].finish_reason == stop_reason
assert delta.content
assert "".join(chunks) == output


if __name__ == "__main__":
pytest.main([__file__])
Loading
Loading