Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Frontend][Doc] Initial support for LLaVA-NeXT and GPT-4V Chat Completions API #3978

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
874a581
Add basic support for OpenAI image input API
DarkLight1337 Apr 8, 2024
607434e
Update documentation
DarkLight1337 Apr 9, 2024
aaa6bfe
Add tests for OpenAI image input API and image loader
DarkLight1337 Apr 9, 2024
26e7b2a
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 11, 2024
44829b5
Apply formatter
DarkLight1337 Apr 11, 2024
bccb367
Place image before text for `llava-hf` model
DarkLight1337 Apr 11, 2024
b9302e8
Internally enable customization of merging image with text prompt
DarkLight1337 Apr 11, 2024
a44d7d1
Fix errors in CI/CD
DarkLight1337 Apr 11, 2024
561ad49
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 12, 2024
4479605
Fix some type errors along the way
DarkLight1337 Apr 12, 2024
20852d9
Improve async behaviour of loading images
DarkLight1337 Apr 12, 2024
ce770f4
Use discriminated union in prompt parsing
DarkLight1337 Apr 12, 2024
6b016bc
Fix some type errors along the way
DarkLight1337 Apr 12, 2024
7620354
Some more fixes
DarkLight1337 Apr 12, 2024
7c3e6d9
Apply formatter
DarkLight1337 Apr 12, 2024
e74b0a7
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 12, 2024
9925dcb
Move `openai` to common requirements
DarkLight1337 Apr 12, 2024
ceb4e35
Fix typo in `_parse_chat_message_image_input`
DarkLight1337 Apr 12, 2024
7bdc84e
Refactor prompt parsing so that it can be shared between Chat Complet…
DarkLight1337 Apr 12, 2024
a7d1098
Make code more readable
DarkLight1337 Apr 12, 2024
8b9d636
Move assertion to a more appropriate place
DarkLight1337 Apr 12, 2024
9754142
Merge branch 'openai-typing' into openai-vision-api
DarkLight1337 Apr 12, 2024
c48c13a
Add code documentation
DarkLight1337 Apr 12, 2024
3530362
Decompose `_validate_prompt_and_tokenize`
DarkLight1337 Apr 12, 2024
b8feec9
Fix missing import due to renaming
DarkLight1337 Apr 12, 2024
9cae113
Merge branch 'openai-typing' into openai-vision-api
DarkLight1337 Apr 12, 2024
89d9086
Merge branch 'upstream' into openai-typing
DarkLight1337 Apr 13, 2024
cc1a5b3
Fix bug when parsing array of tokens
DarkLight1337 Apr 13, 2024
f9c1135
Add token array to batch completions testing
DarkLight1337 Apr 13, 2024
ecc2d50
Merge branch 'openai-typing' into openai-vision-api
DarkLight1337 Apr 14, 2024
f2e8180
Replace legacy `conint` with `Annotated` field
DarkLight1337 Apr 14, 2024
ce04842
Merge branch 'openai-typing' into openai-vision-api
DarkLight1337 Apr 14, 2024
cdbf08a
Load image processor from HuggingFace
DarkLight1337 Apr 14, 2024
9a336ec
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 14, 2024
5722dd8
Allow disabling image processor
DarkLight1337 Apr 14, 2024
6e1fa67
Fix errors when running the example and tests
DarkLight1337 Apr 15, 2024
7ce44da
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 15, 2024
9804604
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 16, 2024
21434df
Add test for loading image processor by revision
DarkLight1337 Apr 16, 2024
a5907b0
Temporary patch for llava-1.5-13b to facilitate testing
DarkLight1337 Apr 16, 2024
f08ff10
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 17, 2024
c126646
Fix issue with pickling config when serving LLaVA with multiple GPUs
DarkLight1337 Apr 17, 2024
49ba216
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 18, 2024
11e9921
Add TODO to test
DarkLight1337 Apr 18, 2024
7ae80a2
Try to avoid OOM by using `--enforce-eager`
DarkLight1337 Apr 18, 2024
2610bea
Reduce number of models to test to avoid OOM
DarkLight1337 Apr 18, 2024
5ad2b67
Try testing 13b model only
DarkLight1337 Apr 18, 2024
696357b
Refactor image processing, `MultiModalData` and LLaVA model
DarkLight1337 Apr 18, 2024
483b190
Fix image processing not working directly, due to tensor being passed
DarkLight1337 Apr 18, 2024
3e22017
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 18, 2024
0b6af35
Revert to using 7b model in testing
DarkLight1337 Apr 18, 2024
e4c3502
Get LLaVA-Next to work with fixed-size images
DarkLight1337 Apr 18, 2024
21aaf3d
Apply formatter and fix typo
DarkLight1337 Apr 18, 2024
ac95b79
Fix input shape not being based on config value
DarkLight1337 Apr 18, 2024
9a9a4e7
Allow config to specify other image size for LLaVA-NeXT
DarkLight1337 Apr 18, 2024
176ad2c
Improve error message to show the expected `image_feature_size`
DarkLight1337 Apr 18, 2024
91ea044
Fix dtype mismatch in `multi_modal_kwargs`
DarkLight1337 Apr 19, 2024
cb19743
Fix LLaVA example and test w.r.t. image processing refactor
DarkLight1337 Apr 19, 2024
019f473
Merge branch 'upstream' into openai-vision-api
DarkLight1337 Apr 19, 2024
f882d99
Fix circular import and set return type
DarkLight1337 Apr 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,9 @@ steps:
- label: LogitsProcessor Test
command: pytest -v -s test_logits_processor.py

- label: Utils Test
command: pytest -v -s test_utils.py

- label: Worker Test
command: pytest -v -s worker

Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ vLLM seamlessly supports many Hugging Face models, including the following archi
- InternLM2 (`internlm/internlm2-7b`, `internlm/internlm2-chat-7b`, etc.)
- Jais (`core42/jais-13b`, `core42/jais-13b-chat`, `core42/jais-30b-v3`, `core42/jais-30b-chat-v3`, etc.)
- LLaMA, Llama 2, and Meta Llama 3 (`meta-llama/Meta-Llama-3-8B-Instruct`, `meta-llama/Meta-Llama-3-70B-Instruct`, `meta-llama/Llama-2-70b-hf`, `lmsys/vicuna-13b-v1.3`, `young-geng/koala`, `openlm-research/open_llama_13b`, etc.)
- LLavA-1.5 (`llava-hf/llava-1.5-7b-hf`, `llava-hf/llava-1.5-13b-hf`, etc.)
- MiniCPM (`openbmb/MiniCPM-2B-sft-bf16`, `openbmb/MiniCPM-2B-dpo-bf16`, etc.)
- Mistral (`mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc.)
- Mixtral (`mistralai/Mixtral-8x7B-v0.1`, `mistralai/Mixtral-8x7B-Instruct-v0.1`, `mistral-community/Mixtral-8x22B-v0.1`, etc.)
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ Documentation
models/adding_model
models/engine_args
models/lora
models/vlm

.. toctree::
:maxdepth: 1
Expand Down
18 changes: 18 additions & 0 deletions docs/source/models/supported_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,24 @@ Alongside each architecture, we include some popular models that use it.
- LLaMA, Llama 2, Meta Llama 3, Vicuna, Alpaca, Yi
- :code:`meta-llama/Meta-Llama-3-8B-Instruct`, :code:`meta-llama/Meta-Llama-3-70B-Instruct`, :code:`meta-llama/Llama-2-13b-hf`, :code:`meta-llama/Llama-2-70b-hf`, :code:`openlm-research/open_llama_13b`, :code:`lmsys/vicuna-13b-v1.3`, :code:`01-ai/Yi-6B`, :code:`01-ai/Yi-34B`, etc.
- ✅︎
* - :code:`LlavaForConditionalGeneration`
- LLaVA-1.5
- :code:`llava-hf/llava-1.5-7b-hf`\*, :code:`llava-hf/llava-1.5-13b-hf`\*, etc.

.. note::

Models with an asterisk (\*) are missing :code:`chat_template` from HuggingFace :code:`config.json`. A predefined template can be found in our repo (:code:`examples/template_llava.jinja`). To host the OpenAI-compatible server, provide the chat template via command-line arguments. You also need to provide the :code:`VisionLanguageConfig` to initialize the model. See the following example:

.. code-block:: shell

$ python -m vllm.entrypoints.openai.api_server \
--model llava-hf/llava-1.5-7b-hf \
--chat-template examples/template_llava.jinja \
--image-input-type pixel_values \
--image-token-id 32000 \
--image-input-shape 1,3,336,336 \
--image-feature-size 576
-
* - :code:`MiniCPMForCausalLM`
- MiniCPM
- :code:`openbmb/MiniCPM-2B-sft-bf16`, :code:`openbmb/MiniCPM-2B-dpo-bf16`, etc.
Expand Down
118 changes: 118 additions & 0 deletions docs/source/models/vlm.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
.. _vlm:

Using VLMs
==========

This document shows you how to run and serve Vision Language Models (VLMs) using vLLM.

Additional Engine Arguments
---------------------------

Apart from the :ref:`basic engine arguments <engine_args>`, VLMs additionally require the following engine arguments for vLLM.

.. option:: --image-input-type {pixel_values,image_features}

The image input type passed into vLLM. Should be one of "pixel_values" or "image_features".

.. option:: --image-token-id <id>

Input ID for image token.

.. option:: --image-input-shape <tuple>

The biggest image input shape (worst for memory footprint) given an input type. Only used for vLLM's profile_run.

For example, if the image tensor has shape :code:`(1, 3, 336, 336)`, then you should pass :code:`--image-input-shape 1,3,336,336`.

.. option:: --image-feature-size <size>

The image feature size along the context dimension.

.. option:: --image-processor <size>

Name or path of the huggingface image processor to use.

.. option:: --image-processor-revision <revision>

The specific image processor version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

.. option:: --no-image-processor

Disables the use of image processor, even if one is defined for the model on huggingface.

Offline Batched Inference
-------------------------

To initialize a VLM, the aforementioned arguments must be passed to the ``LLM`` class for instantiating the engine.

.. code-block:: python

llm = LLM(
model="llava-hf/llava-1.5-7b-hf",
image_input_type="pixel_values",
image_token_id=32000,
image_input_shape="1,3,336,336",
image_feature_size=576,
)

For now, we only support a single image per text prompt when calling ``llm.generate``. To pass an image to the model, note the following parameters:

* ``prompt``: The prompt should have a number of ``<image>`` tokens equal to ``image_feature_size``.
* ``multi_modal_datas``: This should be an instance of ``ImagePixelData``.

.. code-block:: python

prompt = "<image>" * 576 + (
"\nUSER: What is the content of this image?\nASSISTANT:")

# Load the image using PIL.Image
image = ...

outputs = llm.generate(prompt, multi_modal_datas=ImagePixelData(image))

for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)

A code example can be found in `examples/llava_example.py <https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py>`_.

OpenAI-Compatible Server
------------------------

We support image inputs to the OpenAI Chat API, as described in `GPT-4 with Vision <https://platform.openai.com/docs/guides/vision>`_.

Here is a simple example using the :code:`openai` package:

.. code-block:: python

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)

# Note that this model expects the image to come before the main text
chat_response = client.chat.completions.create(
model="llava-hf/llava-1.5-7b-hf",
messages=[{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
},
{"type": "text", "text": "What's in this image?"},
],
}],
)
print("Chat response:", chat_response)

.. note::

For now, we only support a single image per API call. Also, the ``detail`` parameter is ignored since it may not be applicable to other models.
15 changes: 6 additions & 9 deletions examples/llava_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,10 @@
import subprocess

import torch
from PIL import Image

from vllm import LLM
from vllm.sequence import MultiModalData
from vllm.sequence import ImageFeatureData, ImagePixelData

# The assets are located at `s3://air-example-data-2/vllm_opensource_llava/`.

Expand All @@ -23,11 +24,9 @@ def run_llava_pixel_values():
"\nUSER: What is the content of this image?\nASSISTANT:")

# This should be provided by another online or offline component.
images = torch.load("images/stop_sign_pixel_values.pt")
image = Image.open("images/stop_sign.jpg")

outputs = llm.generate(prompt,
multi_modal_data=MultiModalData(
type=MultiModalData.Type.IMAGE, data=images))
outputs = llm.generate(prompt, multi_modal_datas=ImagePixelData(image))
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
Expand All @@ -46,11 +45,9 @@ def run_llava_image_features():
"\nUSER: What is the content of this image?\nASSISTANT:")

# This should be provided by another online or offline component.
images = torch.load("images/stop_sign_image_features.pt")
image: torch.Tensor = torch.load("images/stop_sign_image_features.pt")

outputs = llm.generate(prompt,
multi_modal_data=MultiModalData(
type=MultiModalData.Type.IMAGE, data=images))
outputs = llm.generate(prompt, multi_modal_datas=ImageFeatureData(image))
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
Expand Down
11 changes: 11 additions & 0 deletions examples/template_llava.jinja
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{%- for message in messages -%}
{{ message['role'].upper() + ': ' + message['content'] }}
{%- if (loop.last and add_generation_prompt) or not loop.last -%}
{{- '\n' -}}
{%- endif -%}
{%- endfor -%}


{%- if add_generation_prompt and messages[-1]['role'] != 'assistant' -%}
{{- 'ASSISTANT:' -}}
{% endif %}
6 changes: 5 additions & 1 deletion requirements-common.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,14 @@ transformers >= 4.40.0 # Required for StarCoder2 & Llava, Llama 3.
tokenizers >= 0.19.1 # Required for Llama 3.
fastapi
uvicorn[standard]
pydantic >= 2.0 # Required for OpenAI server.
prometheus_client >= 0.18.0
tiktoken == 0.6.0 # Required for DBRX tokenizer
lm-format-enforcer == 0.9.3
outlines == 0.0.34 # Requires torch >= 2.1.0
typing_extensions
filelock >= 3.10.4 # filelock starts to support `mode` argument from 3.10.4

# OpenAI server
openai
pydantic >= 2.0
pillow
4 changes: 0 additions & 4 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ pytest-rerunfailures
pytest-shard
httpx
einops # required for MPT
openai
requests
ray
peft
Expand All @@ -30,6 +29,3 @@ ai2-olmo # required for OLMo

# Benchmarking
aiohttp

# Multimodal
pillow
Loading
Loading