diff --git a/docs/source/conf.py b/docs/source/conf.py index f1a7013edd33..ee0f6c53bd1b 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -92,6 +92,7 @@ def setup(app): "vllm._C", "PIL", "numpy", + 'triton' "tqdm", "tensorizer", ] diff --git a/docs/source/models/vlm.rst b/docs/source/models/vlm.rst index b917688a529d..33aa8246b2e6 100644 --- a/docs/source/models/vlm.rst +++ b/docs/source/models/vlm.rst @@ -16,6 +16,13 @@ The following :ref:`engine arguments ` are specific to VLMs: :prog: -m vllm.entrypoints.openai.api_server :nodefaultconst: +.. important:: + Currently, the support for vision language models on vLLM has the following limitations: + + * Only single image input is supported per text prompt. + * Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means model output might not exactly match the huggingface implementation. + We are continuously improving user & developer experience for VLMs. Please raise an issue on GitHub if you have any feedback or feature requests. + Offline Batched Inference ------------------------- @@ -31,7 +38,7 @@ To initialize a VLM, the aforementioned arguments must be passed to the ``LLM`` image_feature_size=576, ) -For now, we only support a single image per text prompt. To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`: +To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`: * ``prompt``: The prompt should have a number of ```` tokens equal to ``image_feature_size``. * ``multi_modal_data``: This should be an instance of :class:`~vllm.multimodal.image.ImagePixelData` or :class:`~vllm.multimodal.image.ImageFeatureData`.