-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Openvino 2024.2.0 #35
Changes from all commits
d005b1a
4862986
21b6009
8b24681
1ae3567
adc5bf9
5c864d6
341dc68
58af6d4
c09ac34
ccbb0ee
e2adbf8
62d7e23
87740e0
f8f8871
b722707
a231750
7278715
ab14ec5
2e4833d
bb1201a
ebf8111
cdfacb3
755912c
4cc834b
80ee3b9
a76a0ac
addf11a
b6e6fff
3ba1fec
c956ed0
a4e2f23
02e8a96
94cf94f
6d7528b
0a35dc6
315d639
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# This script build the OpenVINO docker image and run the offline inference inside the container. | ||
# It serves a sanity check for compilation and basic model usage. | ||
set -ex | ||
|
||
# Try building the docker image | ||
docker build -t openvino-test -f Dockerfile.openvino . | ||
|
||
# Setup cleanup | ||
remove_docker_container() { docker rm -f openvino-test || true; } | ||
trap remove_docker_container EXIT | ||
remove_docker_container | ||
|
||
# Run the image and launch offline inference | ||
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/vllm/examples/offline_inference.py |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# The vLLM Dockerfile is used to construct vLLM image that can be directly used | ||
# to run the OpenAI compatible server. | ||
|
||
FROM ubuntu:22.04 AS dev | ||
|
||
RUN apt-get update -y && \ | ||
apt-get install -y python3-pip git | ||
WORKDIR /workspace | ||
|
||
# copy requirements | ||
COPY requirements-build.txt /workspace/vllm/ | ||
COPY requirements-common.txt /workspace/vllm/ | ||
COPY requirements-openvino.txt /workspace/vllm/ | ||
|
||
COPY vllm/ /workspace/vllm/vllm | ||
COPY setup.py /workspace/vllm/ | ||
|
||
# install build requirements | ||
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" python3 -m pip install -r /workspace/vllm/requirements-build.txt | ||
# build vLLM with OpenVINO backend | ||
RUN PIP_PRE=1 PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/nightly/" VLLM_TARGET_DEVICE="openvino" python3 -m pip install /workspace/vllm/ | ||
|
||
COPY examples/ /workspace/vllm/examples | ||
COPY benchmarks/ /workspace/vllm/benchmarks | ||
|
||
CMD ["/bin/bash"] |
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,95 @@ | ||||||||||||||||||||||||||||||
.. _installation_openvino: | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
Installation with OpenVINO | ||||||||||||||||||||||||||||||
======================== | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](../dev/models/supported_models.rst) and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support. OpenVINO vLLM backend supports the following advanced vLLM features: | ||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
- Prefix caching (``--enable-prefix-caching``) | ||||||||||||||||||||||||||||||
- Chunked prefill (``--enable-chunked-prefill``) | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
Table of contents: | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
#. :ref:`Requirements <openvino_backend_requirements>` | ||||||||||||||||||||||||||||||
#. :ref:`Quick start using Dockerfile <openvino_backend_quick_start_dockerfile>` | ||||||||||||||||||||||||||||||
#. :ref:`Build from source <binstall_openvino_backend_from_source>` | ||||||||||||||||||||||||||||||
#. :ref:`Performance tips <openvino_backend_performance_tips>` | ||||||||||||||||||||||||||||||
#. :ref:`Limitations <openvino_backend_limitations>` | ||||||||||||||||||||||||||||||
Comment on lines
+11
to
+17
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
.. _openvino_backend_requirements: | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
Requirements | ||||||||||||||||||||||||||||||
------------ | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
* OS: Linux | ||||||||||||||||||||||||||||||
* Instruction set architecture (ISA) requirement: at least AVX2. | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
.. _openvino_backend_quick_start_dockerfile: | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
Quick start using Dockerfile | ||||||||||||||||||||||||||||||
---------------------------- | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
.. code-block:: console | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
$ docker build -f Dockerfile.openvino -t vllm-openvino-env . | ||||||||||||||||||||||||||||||
$ docker run -it --rm vllm-openvino-env | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
.. _install_openvino_backend_from_source: | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
Install from source | ||||||||||||||||||||||||||||||
----------------- | ||||||||||||||||||||||||||||||
Comment on lines
+39
to
+40
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
- First, install Python. For example, on Ubuntu 22.04, you can run: | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
.. code-block:: console | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
$ sudo apt-get update -y | ||||||||||||||||||||||||||||||
$ sudo apt-get install python3 | ||||||||||||||||||||||||||||||
Comment on lines
+44
to
+47
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
- Second, install prerequisites vLLM OpenVINO backend installation: | ||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
.. code-block:: console | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
$ pip install --upgrade pip | ||||||||||||||||||||||||||||||
$ pip install -r requirements-build.txt --extra-index-url https://download.pytorch.org/whl/cpu | ||||||||||||||||||||||||||||||
Comment on lines
+51
to
+54
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
- Finally, install vLLM with OpenVINO backend: | ||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
.. code-block:: console | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
$ PIP_PRE=1 PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/nightly/" VLLM_TARGET_DEVICE=openvino python install -v . | ||||||||||||||||||||||||||||||
Comment on lines
+58
to
+60
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
.. _openvino_backend_performance_tips: | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
Performance tips | ||||||||||||||||||||||||||||||
----------------- | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
vLLM OpenVINO backend uses the following environment variables to control behavior: | ||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
- ``VLLM_OPENVINO_KVCACHE_SPACE`` to specify the KV Cache size (e.g, ``VLLM_OPENVINO_KVCACHE_SPACE=40`` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. | ||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
- ``VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8`` to control KV cache precision. By default, FP16 / BF16 is used depending on platform. | ||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
- ``VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON`` to enable U8 weights compression during model loading stage. By default, compression is turned off. | ||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
To enable better TPOT / TTFT latency, you can use vLLM's chunked prefill feature (``--enable-chunked-prefill``). Based on the experiments, the recommended batch size is ``256`` (``--max-num-batched-tokens``) | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
OpenVINO best known configuration is: | ||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
.. code-block:: console | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
$ VLLM_OPENVINO_KVCACHE_SPACE=100 VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8 VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \ | ||||||||||||||||||||||||||||||
python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --enable-chunked-prefill --max-num-batched-tokens 256 | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
.. _openvino_backend_limitations: | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
Limitations | ||||||||||||||||||||||||||||||
----------------- | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
- LoRA serving is not supported. | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
- Only LLM models are currently supported. LLaVa and encoder-decoder models are not currently enabled in vLLM OpenVINO integration. | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
- Tensor and pipeline parallelism are not currently enabled in vLLM integration. | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
- Speculative sampling is not tested within vLLM integration. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Common dependencies | ||
-r requirements-common.txt | ||
|
||
# OpenVINO dependencies | ||
torch >= 2.1.2 | ||
openvino ~= 2024.3.0.dev | ||
optimum-intel[openvino] >= 1.17.2 | ||
|
||
triton >= 2.2.0 # FIXME(woosuk): This is a hack to avoid import error. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
from dataclasses import dataclass | ||
from typing import List, Optional, Tuple | ||
|
||
import openvino as ov | ||
import torch | ||
|
||
from vllm.attention.backends.abstract import (AttentionBackend, | ||
AttentionMetadata) | ||
|
||
|
||
class OpenVINOAttentionBackend(AttentionBackend): | ||
|
||
@staticmethod | ||
def get_name() -> str: | ||
return "openvino" | ||
|
||
@staticmethod | ||
def get_impl_cls(): | ||
# OpenVINO implements PagedAttention as part of the Optimum | ||
# exported model | ||
raise NotImplementedError | ||
|
||
@staticmethod | ||
def make_metadata(*args, **kwargs) -> "OpenVINOAttentionMetadata": | ||
return OpenVINOAttentionMetadata(*args, **kwargs) | ||
|
||
@staticmethod | ||
def get_kv_cache_shape( | ||
num_blocks: int, | ||
block_size: int, | ||
num_kv_heads: int, | ||
head_size: int, | ||
) -> Tuple[int, ...]: | ||
return (2, num_blocks, num_kv_heads, block_size, head_size) | ||
|
||
@staticmethod | ||
def swap_blocks( | ||
src_kv_cache: ov.Tensor, | ||
dst_kv_cache: ov.Tensor, | ||
src_to_dst: torch.Tensor, | ||
) -> None: | ||
# OpenVINO currently supports only CPU, which does not require | ||
# swap of KV cache blocks | ||
raise NotImplementedError | ||
|
||
@staticmethod | ||
def copy_blocks( | ||
kv_caches: List[Tuple[ov.Tensor, ov.Tensor]], | ||
src_to_dists: List[Tuple[int, int]], | ||
) -> None: | ||
for src, dst in src_to_dists: | ||
for key_cache, value_cache in kv_caches: | ||
key_cache.data[dst, :] = key_cache.data[src, :] | ||
value_cache.data[dst, :] = value_cache.data[src, :] | ||
|
||
|
||
@dataclass | ||
class OpenVINOAttentionMetadata(AttentionMetadata): | ||
"""Metadata for OpenVINOAttentionBackend. | ||
""" | ||
past_lens: torch.Tensor | ||
subsequence_begins: torch.Tensor | ||
block_indices: torch.Tensor | ||
block_indices_begins: torch.Tensor | ||
max_context_len: torch.Tensor | ||
|
||
@property | ||
def prefill_metadata(self) -> Optional["AttentionMetadata"]: | ||
"""Return the attention metadata that's required to run prefill | ||
attention.""" | ||
pass | ||
|
||
@property | ||
def decode_metadata(self) -> Optional["AttentionMetadata"]: | ||
"""Return the attention metadata that's required to run decode | ||
attention.""" | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.