Skip to content

Commit

Permalink
[Core] Support image processor (vllm-project#4197)
Browse files Browse the repository at this point in the history
  • Loading branch information
DarkLight1337 authored and joerunde committed Jun 4, 2024
1 parent a53e398 commit b1deaf3
Show file tree
Hide file tree
Showing 30 changed files with 1,043 additions and 257 deletions.
1 change: 1 addition & 0 deletions .github/workflows/mypy.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ jobs:
mypy vllm/distributed --config-file pyproject.toml
mypy vllm/entrypoints --config-file pyproject.toml
mypy vllm/executor --config-file pyproject.toml
mypy vllm/multimodal --config-file pyproject.toml
mypy vllm/usage --config-file pyproject.toml
mypy vllm/*.py --config-file pyproject.toml
mypy vllm/transformers_utils --config-file pyproject.toml
Expand Down
14 changes: 8 additions & 6 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ def setup(app):
"sentencepiece",
"vllm.cuda_utils",
"vllm._C",
"PIL",
"numpy",
"tqdm",
"tensorizer",
Expand All @@ -116,12 +117,13 @@ def add_line(self, line: str, source: str, *lineno: int) -> None:
autodoc.ClassDocumenter = MockedClassDocumenter

intersphinx_mapping = {
'python': ('https://docs.python.org/3', None),
'typing_extensions':
('https://typing-extensions.readthedocs.io/en/latest', None),
'numpy': ('https://numpy.org/doc/stable', None),
'torch': ('https://pytorch.org/docs/stable', None),
'psutil': ('https://psutil.readthedocs.io/en/stable', None),
"python": ("https://docs.python.org/3", None),
"typing_extensions":
("https://typing-extensions.readthedocs.io/en/latest", None),
"pillow": ("https://pillow.readthedocs.io/en/stable", None),
"numpy": ("https://numpy.org/doc/stable", None),
"torch": ("https://pytorch.org/docs/stable", None),
"psutil": ("https://psutil.readthedocs.io/en/stable", None),
}

autodoc_preserve_defaults = True
Expand Down
51 changes: 51 additions & 0 deletions docs/source/dev/multimodal/multimodal_index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
Multi-Modality
==============

.. currentmodule:: vllm.multimodal

vLLM provides experimental support for multi-modal models through the :mod:`vllm.multimodal` package.

:class:`vllm.inputs.PromptStrictInputs` accepts an additional attribute ``multi_modal_data``
which allows you to pass in multi-modal input alongside text and token prompts.

By default, vLLM models do not support multi-modal inputs. To enable multi-modal support for a model,
you must decorate the model class with :meth:`MULTIMODAL_REGISTRY.register_dummy_data <MultiModalRegistry.register_dummy_data>`,
as well as :meth:`MULTIMODAL_REGISTRY.register_input <MultiModalRegistry.register_input>` for each modality type to support.

.. contents::
:local:
:backlinks: none

Module Contents
+++++++++++++++

.. automodule:: vllm.multimodal

Registry
--------

.. data:: vllm.multimodal.MULTIMODAL_REGISTRY

The global :class:`MultiModalRegistry` which is used by model runners.

.. autoclass:: vllm.multimodal.MultiModalRegistry
:members:
:show-inheritance:

Base Classes
------------

.. autoclass:: vllm.multimodal.MultiModalData
:members:
:show-inheritance:

.. autoclass:: vllm.multimodal.MultiModalPlugin
:members:
:show-inheritance:

Image Classes
-------------

.. automodule:: vllm.multimodal.image
:members:
:show-inheritance:
6 changes: 4 additions & 2 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ Documentation
models/adding_model
models/engine_args
models/lora
models/vlm
models/performance

.. toctree::
Expand All @@ -99,18 +100,19 @@ Documentation
quantization/fp8_e4m3_kvcache

.. toctree::
:maxdepth: 2
:maxdepth: 1
:caption: Developer Documentation

dev/sampling_params
dev/offline_inference/offline_index
dev/engine/engine_index
dev/kernel/paged_attention
dev/dockerfile-ubi/dockerfile-ubi
dev/multimodal/multimodal_index
dev/dockerfile/dockerfile

.. toctree::
:maxdepth: 2
:maxdepth: 1
:caption: Community

community/meetups
Expand Down
4 changes: 4 additions & 0 deletions docs/source/models/supported_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,10 @@ Alongside each architecture, we include some popular models that use it.
- LLaMA, Llama 2, Meta Llama 3, Vicuna, Alpaca, Yi
- :code:`meta-llama/Meta-Llama-3-8B-Instruct`, :code:`meta-llama/Meta-Llama-3-70B-Instruct`, :code:`meta-llama/Llama-2-13b-hf`, :code:`meta-llama/Llama-2-70b-hf`, :code:`openlm-research/open_llama_13b`, :code:`lmsys/vicuna-13b-v1.3`, :code:`01-ai/Yi-6B`, :code:`01-ai/Yi-34B`, etc.
- ✅︎
* - :code:`LlavaForConditionalGeneration`
- LLaVA-1.5
- :code:`llava-hf/llava-1.5-7b-hf`\*, :code:`llava-hf/llava-1.5-13b-hf`\*, etc.
-
* - :code:`MiniCPMForCausalLM`
- MiniCPM
- :code:`openbmb/MiniCPM-2B-sft-bf16`, :code:`openbmb/MiniCPM-2B-dpo-bf16`, etc.
Expand Down
56 changes: 56 additions & 0 deletions docs/source/models/vlm.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
.. _vlm:

Using VLMs
==========

This document shows you how to run and serve Vision Language Models (VLMs) using vLLM.

Engine Arguments
----------------

The following :ref:`engine arguments <engine_args>` are specific to VLMs:

.. argparse::
:module: vllm.engine.arg_utils
:func: _vlm_engine_args_parser
:prog: -m vllm.entrypoints.openai.api_server
:nodefaultconst:

Offline Batched Inference
-------------------------

To initialize a VLM, the aforementioned arguments must be passed to the ``LLM`` class for instantiating the engine.

.. code-block:: python
llm = LLM(
model="llava-hf/llava-1.5-7b-hf",
image_input_type="pixel_values",
image_token_id=32000,
image_input_shape="1,3,336,336",
image_feature_size=576,
)
For now, we only support a single image per text prompt. To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`:

* ``prompt``: The prompt should have a number of ``<image>`` tokens equal to ``image_feature_size``.
* ``multi_modal_data``: This should be an instance of :class:`~vllm.multimodal.image.ImagePixelData` or :class:`~vllm.multimodal.image.ImageFeatureData`.

.. code-block:: python
prompt = "<image>" * 576 + (
"\nUSER: What is the content of this image?\nASSISTANT:")
# Load the image using PIL.Image
image = ...
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": ImagePixelData(image),
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
A code example can be found in `examples/llava_example.py <https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py>`_.
29 changes: 15 additions & 14 deletions examples/llava_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,33 +3,36 @@
import subprocess

import torch
from PIL import Image

from vllm import LLM
from vllm.sequence import MultiModalData
from vllm.multimodal.image import ImageFeatureData, ImagePixelData

# The assets are located at `s3://air-example-data-2/vllm_opensource_llava/`.
# You can use `.buildkite/download-images.sh` to download them


def run_llava_pixel_values():
def run_llava_pixel_values(*, disable_image_processor: bool = False):
llm = LLM(
model="llava-hf/llava-1.5-7b-hf",
image_input_type="pixel_values",
image_token_id=32000,
image_input_shape="1,3,336,336",
image_feature_size=576,
disable_image_processor=disable_image_processor,
)

prompt = "<image>" * 576 + (
"\nUSER: What is the content of this image?\nASSISTANT:")

# This should be provided by another online or offline component.
image = torch.load("images/stop_sign_pixel_values.pt")
if disable_image_processor:
image = torch.load("images/stop_sign_pixel_values.pt")
else:
image = Image.open("images/stop_sign.jpg")

outputs = llm.generate({
"prompt":
prompt,
"multi_modal_data":
MultiModalData(type=MultiModalData.Type.IMAGE, data=image),
"prompt": prompt,
"multi_modal_data": ImagePixelData(image),
})

for o in outputs:
Expand All @@ -49,15 +52,13 @@ def run_llava_image_features():
prompt = "<image>" * 576 + (
"\nUSER: What is the content of this image?\nASSISTANT:")

# This should be provided by another online or offline component.
image = torch.load("images/stop_sign_image_features.pt")
image: torch.Tensor = torch.load("images/stop_sign_image_features.pt")

outputs = llm.generate({
"prompt":
prompt,
"multi_modal_data":
MultiModalData(type=MultiModalData.Type.IMAGE, data=image),
"prompt": prompt,
"multi_modal_data": ImageFeatureData(image),
})

for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
Expand Down
1 change: 1 addition & 0 deletions format.sh
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ mypy vllm/core --config-file pyproject.toml
mypy vllm/distributed --config-file pyproject.toml
mypy vllm/entrypoints --config-file pyproject.toml
mypy vllm/executor --config-file pyproject.toml
mypy vllm/multimodal --config-file pyproject.toml
mypy vllm/usage --config-file pyproject.toml
mypy vllm/*.py --config-file pyproject.toml
mypy vllm/transformers_utils --config-file pyproject.toml
Expand Down
1 change: 1 addition & 0 deletions requirements-common.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ aiohttp
openai
uvicorn[standard]
pydantic >= 2.0 # Required for OpenAI server.
pillow # Required for image processing
prometheus_client >= 0.18.0
prometheus-fastapi-instrumentator >= 7.0.0
tiktoken >= 0.6.0 # Required for DBRX tokenizer
Expand Down
3 changes: 0 additions & 3 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,5 @@ sentence-transformers # required for embedding
# Benchmarking
aiohttp

# Multimodal
pillow

# quantization
bitsandbytes==0.42.0
45 changes: 24 additions & 21 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,9 @@
from vllm.distributed import destroy_model_parallel
from vllm.inputs import TextPrompt
from vllm.logger import init_logger
from vllm.sequence import MultiModalData, SampleLogprobs
from vllm.multimodal import MultiModalData
from vllm.multimodal.image import ImageFeatureData, ImagePixelData
from vllm.sequence import SampleLogprobs

logger = init_logger(__name__)

Expand All @@ -24,6 +26,7 @@
_LONG_PROMPTS = [os.path.join(_TEST_DIR, "prompts", "summary.txt")]

# Multi modal related
# You can use `.buildkite/download-images.sh` to download the assets
_PIXEL_VALUES_FILES = [
os.path.join(_TEST_DIR, "images", filename) for filename in
["stop_sign_pixel_values.pt", "cherry_blossom_pixel_values.pt"]
Expand Down Expand Up @@ -89,17 +92,23 @@ def hf_images() -> List[Image.Image]:


@pytest.fixture()
def vllm_images(request) -> "torch.Tensor":
def vllm_images(request) -> List[MultiModalData]:
vision_language_config = request.getfixturevalue("model_and_config")[1]
all_images = []
if vision_language_config.image_input_type == (
VisionLanguageConfig.ImageInputType.IMAGE_FEATURES):
filenames = _IMAGE_FEATURES_FILES
return [
ImageFeatureData(torch.load(filename))
for filename in _IMAGE_FEATURES_FILES
]
else:
filenames = _PIXEL_VALUES_FILES
for filename in filenames:
all_images.append(torch.load(filename))
return torch.concat(all_images, dim=0)
return [
ImagePixelData(Image.open(filename)) for filename in _IMAGE_FILES
]


@pytest.fixture()
def vllm_image_tensors(request) -> List[torch.Tensor]:
return [torch.load(filename) for filename in _PIXEL_VALUES_FILES]


@pytest.fixture()
Expand Down Expand Up @@ -392,23 +401,17 @@ def generate(
self,
prompts: List[str],
sampling_params: SamplingParams,
images: Optional[torch.Tensor] = None,
images: Optional[List[MultiModalData]] = None,
) -> List[Tuple[List[List[int]], List[str]]]:
if images is not None:
assert len(prompts) == len(images)

prompt_inputs: List[TextPrompt] = []
for i, prompt in enumerate(prompts):
prompt = TextPrompt(prompt=prompt)
if images is not None:
prompt["multi_modal_data"] = MultiModalData(
type=MultiModalData.Type.IMAGE,
data=images[i:i + 1],
)

prompt_inputs.append(prompt)
inputs = [TextPrompt(prompt=prompt) for prompt in prompts]
if images is not None:
for i, image in enumerate(images):
inputs[i]["multi_modal_data"] = image

req_outputs = self.model.generate(prompt_inputs,
req_outputs = self.model.generate(inputs,
sampling_params=sampling_params)

outputs: List[Tuple[List[List[int]], List[str]]] = []
Expand Down Expand Up @@ -447,7 +450,7 @@ def generate_greedy(
self,
prompts: List[str],
max_tokens: int,
images: Optional[torch.Tensor] = None,
images: Optional[List[MultiModalData]] = None,
) -> List[Tuple[List[int], str]]:
greedy_params = SamplingParams(temperature=0.0, max_tokens=max_tokens)
outputs = self.generate(prompts, greedy_params, images=images)
Expand Down
Loading

0 comments on commit b1deaf3

Please sign in to comment.