Openvino 2024.2.0 #35

ilya-lavrenov · 2024-05-21T05:55:47Z

No description provided.

[CPU] Fix u8 kvcache for PagedAttention

…-2024.2.0

sgolebiewski-intel · 2024-06-10T13:44:45Z

docs/source/getting_started/openvino-installation.rst

+Installation with OpenVINO
+========================
+
+vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](../dev/models/supported_models.rst) and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support. OpenVINO vLLM backend supports the following advanced vLLM features:


Suggested change

vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](../dev/models/supported_models.rst) and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support. OpenVINO vLLM backend supports the following advanced vLLM features:

vLLM powered by OpenVINO supports all LLM models from

:doc:`vLLM supported models list <../models/supported_models>` and can perform optimal model serving on all x86-64 CPUs with at least AVX2 support. OpenVINO vLLM backend supports the following advanced vLLM features:

sgolebiewski-intel · 2024-06-10T14:18:20Z

docs/source/getting_started/openvino-installation.rst

+Table of contents:
+
+#. :ref:`Requirements <openvino_backend_requirements>`
+#. :ref:`Quick start using Dockerfile <openvino_backend_quick_start_dockerfile>`
+#. :ref:`Build from source <binstall_openvino_backend_from_source>`
+#. :ref:`Performance tips <openvino_backend_performance_tips>`
+#. :ref:`Limitations <openvino_backend_limitations>`


Suggested change

Table of contents:

#. :ref:`Requirements <openvino_backend_requirements>`

#. :ref:`Quick start using Dockerfile <openvino_backend_quick_start_dockerfile>`

#. :ref:`Build from source <binstall_openvino_backend_from_source>`

#. :ref:`Performance tips <openvino_backend_performance_tips>`

#. :ref:`Limitations <openvino_backend_limitations>`

**Table of contents**:

- :ref:`Requirements <openvino_backend_requirements>`

- :ref:`Quick start using Dockerfile <openvino_backend_quick_start_dockerfile>`

- :ref:`Build from source <install_openvino_backend_from_source>`

- :ref:`Performance tips <openvino_backend_performance_tips>`

- :ref:`Limitations <openvino_backend_limitations>`

sgolebiewski-intel · 2024-06-10T14:38:47Z

docs/source/getting_started/openvino-installation.rst

+    $ sudo apt-get update  -y
+    $ sudo apt-get install python3
+
+- Second, install prerequisites vLLM OpenVINO backend installation:


Suggested change

- Second, install prerequisites vLLM OpenVINO backend installation:

- Then, install the prerequisites for vLLM OpenVINO backend installation:

sgolebiewski-intel · 2024-06-10T14:39:59Z

docs/source/getting_started/openvino-installation.rst

+    $ pip install --upgrade pip
+    $ pip install -r requirements-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
+
+- Finally, install vLLM with OpenVINO backend: 


Suggested change

- Finally, install vLLM with OpenVINO backend:

- Finally, install vLLM OpenVINO backend:

sgolebiewski-intel · 2024-06-10T14:41:15Z

docs/source/getting_started/openvino-installation.rst

+Performance tips
+-----------------
+
+vLLM OpenVINO backend uses the following environment variables to control behavior:


Suggested change

vLLM OpenVINO backend uses the following environment variables to control behavior:

To control behavior in vLLM OpenVINO backend, use the following environment variables:

sgolebiewski-intel · 2024-06-10T14:47:34Z

docs/source/getting_started/openvino-installation.rst

+
+- ``VLLM_OPENVINO_KVCACHE_SPACE`` to specify the KV Cache size (e.g, ``VLLM_OPENVINO_KVCACHE_SPACE=40`` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.
+
+- ``VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8`` to control KV cache precision. By default, FP16 / BF16 is used depending on platform.


Suggested change

- ``VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8`` to control KV cache precision. By default, FP16 / BF16 is used depending on platform.

- ``VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8`` controls KV cache precision. By default, ``FP16`` / ``BF16`` is used, depending on platform.

sgolebiewski-intel · 2024-06-10T14:48:26Z

docs/source/getting_started/openvino-installation.rst

+
+- ``VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8`` to control KV cache precision. By default, FP16 / BF16 is used depending on platform.
+
+- ``VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON`` to enable U8 weights compression during model loading stage. By default, compression is turned off.


Suggested change

- ``VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON`` to enable U8 weights compression during model loading stage. By default, compression is turned off.

- ``VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON`` enables U8 weights compression during a model loading stage. By default, the compression is turned off.

sgolebiewski-intel · 2024-06-10T14:50:16Z

docs/source/getting_started/openvino-installation.rst

+
+To enable better TPOT / TTFT latency, you can use vLLM's chunked prefill feature (``--enable-chunked-prefill``). Based on the experiments, the recommended batch size is ``256`` (``--max-num-batched-tokens``)
+
+OpenVINO best known configuration is:


Suggested change

OpenVINO best known configuration is:

Best known configuration in OpenVINO is:

sgolebiewski-intel · 2024-06-10T14:55:12Z

docs/source/getting_started/openvino-installation.rst

+Install from source
+-----------------


Suggested change

Install from source

-----------------

Install from source

-------------------

sgolebiewski-intel · 2024-06-10T14:56:21Z

docs/source/getting_started/openvino-installation.rst

+Installation with OpenVINO
+========================


Suggested change

Installation with OpenVINO

========================

Installation with OpenVINO

==========================

sgolebiewski-intel · 2024-06-11T05:29:06Z

docs/source/getting_started/openvino-installation.rst

+.. code-block:: console
+
+    $ sudo apt-get update  -y
+    $ sudo apt-get install python3


Suggested change

.. code-block:: console

$ sudo apt-get update -y

$ sudo apt-get install python3

.. code-block:: console

$ sudo apt-get update -y

$ sudo apt-get install python3

sgolebiewski-intel · 2024-06-11T05:29:28Z

docs/source/getting_started/openvino-installation.rst

+.. code-block:: console
+
+    $ pip install --upgrade pip
+    $ pip install -r requirements-build.txt --extra-index-url https://download.pytorch.org/whl/cpu


Suggested change

.. code-block:: console

$ pip install --upgrade pip

$ pip install -r requirements-build.txt --extra-index-url https://download.pytorch.org/whl/cpu

.. code-block:: console

$ pip install --upgrade pip

$ pip install -r requirements-build.txt --extra-index-url https://download.pytorch.org/whl/cpu

sgolebiewski-intel · 2024-06-11T05:29:58Z

docs/source/getting_started/openvino-installation.rst

+.. code-block:: console
+
+    $ PIP_PRE=1 PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/nightly/" VLLM_TARGET_DEVICE=openvino python install -v .


Suggested change

.. code-block:: console

$ PIP_PRE=1 PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/nightly/" VLLM_TARGET_DEVICE=openvino python install -v .

.. code-block:: console

$ PIP_PRE=1 PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/nightly/" VLLM_TARGET_DEVICE=openvino python install -v .

ilya-lavrenov and others added 30 commits April 29, 2024 16:55

Part 1

d005b1a

Copy CPU Executor files

4862986

Part 2

21b6009

Fixed setup.py

8b24681

Refactored dependencies, fixed unconditioned triton usage

1ae3567

Temporary changes

adc5bf9

Merge remote-tracking branch 'upstream/main' into openvino-2024.1.0

5c864d6

Works with latest OpenVINO

341dc68

Small improvement

58af6d4

Avoid changes in common vLLM sampling

c09ac34

Merge remote-tracking branch 'upstream/main' into openvino-2024.1.0

ccbb0ee

Self-review

e2adbf8

Added run-openvino-tests.sh

62d7e23

Fixed CUDA hardcode in tests

87740e0

Migrate to new PA interface

f8f8871

Merge remote-tracking branch 'upstream/main' into openvino-2024.1.0

b722707

OpenVINO with latest PA spec works

a231750

Small corrections

7278715

Merge remote-tracking branch 'upstream/main' into openvino-2024.1.0

ab14ec5

Supported prefix cache

2e4833d

Use generic inputs creation

bb1201a

Tested chunked_prefill

ebf8111

fix u8 kvcache

cdfacb3

Merge pull request #37 from luo-cheng2021/luocheng/pa_kv_u8

755912c

[CPU] Fix u8 kvcache for PagedAttention

Supported beam search

4cc834b

Merge remote-tracking branch 'origin/openvino-2024.2.0' into openvino…

80ee3b9

…-2024.2.0

Merge remote-tracking branch 'origin/main' into openvino-2024.2.0

a76a0ac

Merge remote-tracking branch 'upstream/main' into openvino-2024.2.0

addf11a

Fix coflicts

b6e6fff

Extra improvements

3ba1fec

ilya-lavrenov added 5 commits June 10, 2024 14:12

Partially formatted

c956ed0

Fixed spellchecker

a4e2f23

Updated docs

02e8a96

Fixed mypy

94cf94f

Fixed isort

6d7528b

ilya-lavrenov force-pushed the openvino-2024.2.0 branch from 46e6d30 to ddfca9b Compare June 10, 2024 12:58

Extra fixes

0a35dc6

ilya-lavrenov force-pushed the openvino-2024.2.0 branch from ddfca9b to 0a35dc6 Compare June 10, 2024 13:02

Fixed yapf

315d639

ilya-lavrenov force-pushed the openvino-2024.2.0 branch from 0c23762 to 315d639 Compare June 10, 2024 13:13

sgolebiewski-intel reviewed Jun 10, 2024

View reviewed changes

sgolebiewski-intel reviewed Jun 11, 2024

View reviewed changes

ilya-lavrenov closed this Jul 31, 2024

ilya-lavrenov deleted the openvino-2024.2.0 branch July 31, 2024 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Openvino 2024.2.0 #35

Openvino 2024.2.0 #35

ilya-lavrenov commented May 21, 2024

sgolebiewski-intel Jun 10, 2024

sgolebiewski-intel Jun 10, 2024

sgolebiewski-intel Jun 10, 2024

sgolebiewski-intel Jun 10, 2024

sgolebiewski-intel Jun 10, 2024

sgolebiewski-intel Jun 10, 2024

sgolebiewski-intel Jun 10, 2024

sgolebiewski-intel Jun 10, 2024

sgolebiewski-intel Jun 10, 2024

sgolebiewski-intel Jun 10, 2024

sgolebiewski-intel Jun 11, 2024

sgolebiewski-intel Jun 11, 2024

sgolebiewski-intel Jun 11, 2024

	vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](../dev/models/supported_models.rst) and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support. OpenVINO vLLM backend supports the following advanced vLLM features:
	vLLM powered by OpenVINO supports all LLM models from
	:doc:`vLLM supported models list <../models/supported_models>` and can perform optimal model serving on all x86-64 CPUs with at least AVX2 support. OpenVINO vLLM backend supports the following advanced vLLM features:

	- Second, install prerequisites vLLM OpenVINO backend installation:
	- Then, install the prerequisites for vLLM OpenVINO backend installation:

	- Finally, install vLLM with OpenVINO backend:
	- Finally, install vLLM OpenVINO backend:

	vLLM OpenVINO backend uses the following environment variables to control behavior:
	To control behavior in vLLM OpenVINO backend, use the following environment variables:


		- ``VLLM_OPENVINO_KVCACHE_SPACE`` to specify the KV Cache size (e.g, ``VLLM_OPENVINO_KVCACHE_SPACE=40`` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.

		- ``VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8`` to control KV cache precision. By default, FP16 / BF16 is used depending on platform.


		- ``VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8`` to control KV cache precision. By default, FP16 / BF16 is used depending on platform.

		- ``VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON`` to enable U8 weights compression during model loading stage. By default, compression is turned off.


		To enable better TPOT / TTFT latency, you can use vLLM's chunked prefill feature (``--enable-chunked-prefill``). Based on the experiments, the recommended batch size is ``256`` (``--max-num-batched-tokens``)

		OpenVINO best known configuration is:

	OpenVINO best known configuration is:
	Best known configuration in OpenVINO is:

		.. code-block:: console

		$ PIP_PRE=1 PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu https://storage.openvinotoolkit.org/simple/wheels/nightly/" VLLM_TARGET_DEVICE=openvino python install -v .

Openvino 2024.2.0 #35

Openvino 2024.2.0 #35

Conversation

ilya-lavrenov commented May 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment