diff --git a/docs/articles_en/about-openvino/performance-benchmarks.rst b/docs/articles_en/about-openvino/performance-benchmarks.rst index 40b94210f6c43d..ed9d39aaf8b9e6 100644 --- a/docs/articles_en/about-openvino/performance-benchmarks.rst +++ b/docs/articles_en/about-openvino/performance-benchmarks.rst @@ -16,14 +16,12 @@ Performance Benchmarks Getting Performance Numbers -This page presents benchmark results for +This page presents benchmark results for the `Intel® Distribution of OpenVINO™ toolkit `__ and :doc:`OpenVINO Model Server <../openvino-workflow/model-server/ovms_what_is_openvino_model_server>`, for a representative selection of public neural networks and Intel® devices. The results may help you decide which hardware to use in your applications or plan AI workload for the hardware you have already implemented in your solutions. Click the buttons below to see the chosen benchmark data. -For a more detailed view of performance numbers for generative AI models, check the -:doc:`Generative AI Benchmark Results <./performance-benchmarks/generative-ai-performance>` .. grid:: 1 1 2 2 :gutter: 4 @@ -36,7 +34,7 @@ For a more detailed view of performance numbers for generative AI models, check :outline: :expand: - :material-regular:`bar_chart;1.4em` OpenVINO Benchmark Graphs + :material-regular:`bar_chart;1.4em` OpenVINO Benchmark Graphs (general) .. grid-item:: @@ -46,10 +44,35 @@ For a more detailed view of performance numbers for generative AI models, check :outline: :expand: - :material-regular:`bar_chart;1.4em` OVMS Benchmark Graphs + :material-regular:`bar_chart;1.4em` OVMS Benchmark Graphs (general) + + .. grid-item:: + + .. button-link:: ./performance-benchmarks/generative-ai-performance.html + :class: ov-toolkit-benchmark-genai + :color: primary + :outline: + :expand: + + :material-regular:`table_view;1.4em` LLM performance for AI PC + + .. grid-item:: + + .. button-link:: # + :class: ovms-toolkit-benchmark-llm + :color: primary + :outline: + :expand: + + :material-regular:`bar_chart;1.4em` OVMS for GenAI (coming soon) + + + + -Key performance indicators and workload parameters. + +**Key performance indicators and workload parameters** .. tab-set:: @@ -65,13 +88,13 @@ Key performance indicators and workload parameters. .. tab-item:: Latency :sync: latency - For Vision and NLP models this mhis measures the synchronous execution of inference requests and is reported in - milliseconds. Each inference request (for example: preprocess, infer, postprocess) is - allowed to complete before the next is started. This performance metric is relevant in - usage scenarios where a single image input needs to be acted upon as soon as possible. An - example would be the healthcare sector where medical personnel only request analysis of a - single ultra sound scanning image or in real-time or near real-time applications for - example an industrial robot's response to actions in its environment or obstacle avoidance + For Vision and NLP models this measures the synchronous execution of inference requests and + is reported in milliseconds. Each inference request (for example: preprocess, infer, + postprocess) is allowed to complete before the next one starts. This performance metric is + relevant in usage scenarios where a single image input needs to be acted upon as soon as + possible. An example would be the healthcare sector where medical personnel only request + analysis of a single ultra sound scanning image or in real-time or near real-time applications + such as an industrial robot's response to actions in its environment or obstacle avoidance for autonomous vehicles. For Transformer models like Stable-Diffusion this measures the time it takes to convert the prompt or input text into a finished image. It is presented in seconds. @@ -97,9 +120,10 @@ Key performance indicators and workload parameters. * input token length: 1024 (the tokens for GenAI models are in English). -.. raw:: html +**Platforms, Configurations, Methodology** -

Platforms, Configurations, Methodology

+To see the methodology used to obtain the numbers and learn how to test performance yourself, +see the guide on :doc:`getting performance numbers `. For a listing of all platforms and configurations used for testing, refer to the following: @@ -130,59 +154,10 @@ For a listing of all platforms and configurations used for testing, refer to the :material-regular:`download;1.5em` Click for Performance Data [XLSX] -The OpenVINO benchmark setup includes a single system with OpenVINO™, as well as the benchmark -application installed. It measures the time spent on actual inference (excluding any pre or post -processing) and then reports on the inferences per second (or Frames Per Second). - -OpenVINO™ Model Server (OVMS) employs the Intel® Distribution of OpenVINO™ toolkit runtime -libraries and exposes a set of models via a convenient inference API over gRPC or HTTP/REST. -Its benchmark results are measured with the configuration of multiple-clients-single-server, -using two hardware platforms connected by ethernet. Network bandwidth depends on both platforms -and models used. It is set not to be a bottleneck for workload intensity. The connection is -dedicated only to measuring performance. - -.. dropdown:: See more details about OVMS benchmark setup - - The benchmark setup for OVMS consists of four main parts: - .. image:: ../assets/images/performance_benchmarks_ovms_02.png - :alt: OVMS Benchmark Setup Diagram - * **OpenVINO™ Model Server** is launched as a docker container on the server platform and it - listens to (and answers) requests from clients. OpenVINO™ Model Server is run on the same - system as the OpenVINO™ toolkit benchmark application in corresponding benchmarking. Models - served by OpenVINO™ Model Server are located in a local file system mounted into the docker - container. The OpenVINO™ Model Server instance communicates with other components via ports - over a dedicated docker network. - * **Clients** are run in separated physical machine referred to as client platform. Clients - are implemented in Python3 programming language based on TensorFlow* API and they work as - parallel processes. Each client waits for a response from OpenVINO™ Model Server before it - will send a new next request. The role played by the clients is also verification of - responses. - - * **Load balancer** works on the client platform in a docker container. HAProxy is used for - this purpose. Its main role is counting of requests forwarded from clients to OpenVINO™ - Model Server, estimating its latency, and sharing this information by Prometheus service. - The reason of locating the load balancer on the client site is to simulate real life - scenario that includes impact of physical network on reported metrics. - - * **Execution Controller** is launched on the client platform. It is responsible for - synchronization of the whole measurement process, downloading metrics from the load - balancer, and presenting the final report of the execution. - - - -.. raw:: html - -

Test performance yourself

- -You can also test performance for your system yourself, following the guide on -:doc:`getting performance numbers `. - -.. raw:: html - -

Disclaimers

+**Disclaimers** * Intel® Distribution of OpenVINO™ toolkit performance results are based on release 2024.3, as of July 31, 2024. @@ -192,12 +167,11 @@ You can also test performance for your system yourself, following the guide on The results may not reflect all publicly available updates. Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software, or service -activation. Learn more at intel.com, or from the OEM or retailer. +activation. Learn more at intel.com, the OEM, or retailer. See configuration disclosure for details. No product can be absolutely secure. Performance varies by use, configuration and other factors. Learn more at `www.intel.com/PerformanceIndex `__. -Your costs and results may vary. Intel optimizations, for Intel compilers or other products, may not optimize to the same degree for non-Intel products. @@ -205,9 +179,6 @@ for non-Intel products. - - - .. raw:: html diff --git a/docs/articles_en/about-openvino/performance-benchmarks/generative-ai-performance.rst b/docs/articles_en/about-openvino/performance-benchmarks/generative-ai-performance.rst index 35e09f91f72b9c..39b27d12c970fd 100644 --- a/docs/articles_en/about-openvino/performance-benchmarks/generative-ai-performance.rst +++ b/docs/articles_en/about-openvino/performance-benchmarks/generative-ai-performance.rst @@ -4,7 +4,7 @@ Most Efficient Large Language Models for AI PC This page is regularly updated to help you identify the best-performing LLMs on the Intel® Core™ Ultra processor family and AI PCs. -The tables below list the key performance indicators for a selection of Large Language Models, +The tables below list key performance indicators for a selection of Large Language Models, running on an Intel® Core™ Ultra 7-165H based system, on built-in GPUs. @@ -23,24 +23,34 @@ running on an Intel® Core™ Ultra 7-165H based system, on built-in GPUs. :class: modeldata stripe :name: supportedModelsTableOv :header-rows: 1 - :file: ../../_static/download/llm_models.csv + :file: ../../_static/benchmarks_files/llm_models.csv -For complete information on the system config, see: -`Hardware Platforms [PDF] `__ - -To view the data in an editable form, you can download the .csv file here: - .. grid:: 1 1 2 2 :gutter: 4 .. grid-item:: - .. button-link:: ../../_static/download/llm_models.csv + All models listed here were tested with the following parameters: + + * Framework: PyTorch + * Model precision: INT4 + * Beam: 1 + * Batch size: 1 + + .. grid-item:: + + .. button-link:: https://docs.openvino.ai/2024/_static/benchmarks_files/OV-2024.4-platform_list.pdf :color: primary :outline: :expand: - :material-regular:`download;1.5em` Click for OpenVINO LLM results [CSV] + :material-regular:`download;1.5em` Get full system info [PDF] + + .. button-link:: ../../_static/benchmarks_files/llm_models.csv + :color: primary + :outline: + :expand: + :material-regular:`download;1.5em` Get the data in .csv [CSV] diff --git a/docs/articles_en/about-openvino/performance-benchmarks/getting-performance-numbers.rst b/docs/articles_en/about-openvino/performance-benchmarks/getting-performance-numbers.rst index 069c940063cf14..936f1145a6b3b0 100644 --- a/docs/articles_en/about-openvino/performance-benchmarks/getting-performance-numbers.rst +++ b/docs/articles_en/about-openvino/performance-benchmarks/getting-performance-numbers.rst @@ -1,124 +1,201 @@ Getting Performance Numbers =========================== +1. `Benchmarking methodology for OpenVINO <#benchmarking-methodology-for-openvino>`__ + a. `OpenVINO benchmarking (general) <#openvino-benchmarking--general->`__ + b. `OpenVINO Model Server benchmarking (general) <#openvino-model-server-benchmarking--general->`__ + c. `OpenVINO Model Server benchmarking (LLM) <#openvino-model-server-benchmarking--llm->`__ -This guide explains how to use the benchmark_app to get performance numbers. It also explains how the performance -numbers are reflected through internal inference performance counters and execution graphs. It also includes -information on using ITT and Intel® VTune™ Profiler to get performance insights. +2. `How to obtain benchmark results <#how-to-obtain-benchmark-results>`__ + a. `General considerations <#general-considerations>`__ + b. `OpenVINO benchmarking (general) <#openvino-benchmarking--general->`__ + c. `OpenVINO benchmarking (LLM) <#openvino-benchmarking--llm->`__ -.. raw:: html -

Test performance with the benchmark_app

+Benchmarking methodology for OpenVINO +############################################################################################### -You can run OpenVINO benchmarks in both C++ and Python APIs, yet the experience differs in each case. -The Python one is part of OpenVINO Runtime installation, while C++ is available as a code sample. -For a detailed description, see: :doc:`benchmark_app <../../learn-openvino/openvino-samples/benchmark-tool>`. +OpenVINO benchmarking (general) +++++++++++++++++++++++++++++++++++++++++++++ -Make sure to install the latest release package with support for frameworks of the models you want to test. -For the most reliable performance benchmarks, :doc:`prepare the model for use with OpenVINO <../../openvino-workflow/model-preparation>`. +The OpenVINO benchmark setup includes a single system with OpenVINO™, as well as the benchmark +application installed. It measures the time spent on actual inference (excluding any pre or post +processing) and then reports on the inferences per second (or Frames Per Second). +OpenVINO Model Server benchmarking (general) +++++++++++++++++++++++++++++++++++++++++++++ -.. raw:: html +OpenVINO™ Model Server (OVMS) employs the Intel® Distribution of OpenVINO™ toolkit runtime +libraries and exposes a set of models via a convenient inference API over gRPC or HTTP/REST. +Its benchmark results are measured with the configuration of multiple-clients-single-server, +using two hardware platforms connected by ethernet. Network bandwidth depends on both platforms +and models used. It is set not to be a bottleneck for workload intensity. The connection is +dedicated only to measuring performance. -

Running the benchmark application

+.. dropdown:: See more details about OVMS benchmark setup + The benchmark setup for OVMS consists of four main parts: -The benchmark_app includes a lot of device-specific options, but the primary usage is as simple as: + .. image:: ../../assets/images/performance_benchmarks_ovms_02.png + :alt: OVMS Benchmark Setup Diagram -.. code-block:: sh + * **OpenVINO™ Model Server** is launched as a docker container on the server platform and it + listens to (and answers) requests from clients. OpenVINO™ Model Server is run on the same + system as the OpenVINO™ toolkit benchmark application in corresponding benchmarking. Models + served by OpenVINO™ Model Server are located in a local file system mounted into the docker + container. The OpenVINO™ Model Server instance communicates with other components via ports + over a dedicated docker network. - benchmark_app -m -d -i + * **Clients** are run in separated physical machine referred to as client platform. Clients + are implemented in Python3 programming language based on TensorFlow* API and they work as + parallel processes. Each client waits for a response from OpenVINO™ Model Server before it + will send a new next request. The role played by the clients is also verification of + responses. + * **Load balancer** works on the client platform in a docker container. HAProxy is used for + this purpose. Its main role is counting of requests forwarded from clients to OpenVINO™ + Model Server, estimating its latency, and sharing this information by Prometheus service. + The reason of locating the load balancer on the client site is to simulate real life + scenario that includes impact of physical network on reported metrics. -Each of the :doc:`OpenVINO supported devices <../compatibility-and-support/supported-devices>` offers -performance settings that contain command-line equivalents in the Benchmark app. + * **Execution Controller** is launched on the client platform. It is responsible for + synchronization of the whole measurement process, downloading metrics from the load + balancer, and presenting the final report of the execution. -While these settings provide really low-level control for the optimal model performance on the *specific* device, -it is recommended to always start performance evaluation with the :doc:`OpenVINO High-Level Performance Hints <../../openvino-workflow/running-inference/optimize-inference/high-level-performance-hints>` first, like so: -.. code-block:: sh +OpenVINO Model Server benchmarking (LLM) +++++++++++++++++++++++++++++++++++++++++ - # for throughput prioritization - benchmark_app -hint tput -m -d - # for latency prioritization - benchmark_app -hint latency -m -d +In the benchmarking results presented here, the load from clients is simulated using the +benchmark_serving.py script from vLLM and the ShareGPT dataset. It represents real life usage +scenarios. Both OpenVINO Model Server and vLLM expose OpenAI-compatible REST endpoints so the +methodology is identical. +In the experiments, we change the average request rate to identify the tradeoff between total +throughput and the TPOT latency. +Note that in the benchmarking, the feature of prefix_caching is not used. -.. raw:: html -

Additional benchmarking considerations

-.. raw:: html +How to obtain benchmark results +############################################################################################### -

1 - Select a Proper Set of Operations to Measure

+General considerations +++++++++++++++++++++++ +.. dropdown:: Select a proper set of operations to measure -When evaluating performance of a model with OpenVINO Runtime, it is required to measure a proper set of operations. + When evaluating performance of a model with OpenVINO Runtime, it is required to measure a + proper set of operations. -- Avoid including one-time costs such as model loading. -- Track operations that occur outside OpenVINO Runtime (such as video decoding) separately. + * Avoid including one-time costs such as model loading. + * Track operations that occur outside OpenVINO Runtime, such as video decoding, separately. + .. note:: -.. note:: + Some image pre-processing can be baked into OpenVINO IR and accelerated accordingly. + For more information, refer to + :doc:`Embedding Pre-processing <../../documentation/legacy-features/transition-legacy-conversion-api/legacy-conversion-api/[legacy]-embedding-preprocessing-computation>` + and + :doc:`General Runtime Optimizations <../../openvino-workflow/running-inference/optimize-inference/general-optimizations>`. - Some image pre-processing can be baked into OpenVINO IR and accelerated accordingly. For more information, - refer to :doc:`Embedding Pre-processing <../../documentation/legacy-features/transition-legacy-conversion-api/legacy-conversion-api/[legacy]-embedding-preprocessing-computation>` and - :doc:`General Runtime Optimizations <../../openvino-workflow/running-inference/optimize-inference/general-optimizations>`. +.. dropdown:: Maximize the chance to obtain credible data + Performance conclusions should be build on reproducible data. As for the performance + measurements, they should be done with a large number of invocations of the same routine. + Since the first iteration is almost always significantly slower than the subsequent ones, + an aggregated value can be used for the execution time for final projections: + * If the warm-up run does not help or execution times still vary, you can try running a + large number of iterations and then use the mean value of the results. + * If time values differ too much, consider using a geomean. + * Be aware of potential power-related irregularities, such as throttling. A device may assume + one of several different power states, so it is advisable to fix its frequency when + optimizing, for better performance data reproducibility. + * Note that end-to-end application benchmarking should also be performed under real + operational conditions. -.. raw:: html +.. dropdown:: Compare performance with native/framework code -

2 - Try to Get Credible Data

+ When comparing OpenVINO Runtime performance with the framework or reference code, + make sure that both versions are as similar as possible: -Performance conclusions should be build upon reproducible data. As for the performance measurements, they should -be done with a large number of invocations of the same routine. Since the first iteration is almost always significantly -slower than the subsequent ones, an aggregated value can be used for the execution time for final projections: + * Wrap the exact inference execution (for examples, see :doc:`Benchmark app <../../learn-openvino/openvino-samples/benchmark-tool>`). + * Do not include model loading time. + * Ensure that the inputs are identical for OpenVINO Runtime and the framework. For example, watch out for random values that can be used to populate the inputs. + * In situations when any user-side pre-processing should be tracked separately, consider :doc:`image pre-processing and conversion <../../openvino-workflow/running-inference/optimize-inference/optimize-preprocessing>`. + * When applicable, leverage the :doc:`Dynamic Shapes support <../../openvino-workflow/running-inference/dynamic-shapes>`. + * If possible, demand the same accuracy. For example, TensorFlow allows ``FP16`` execution, so when comparing to that, make sure to test the OpenVINO Runtime with the ``FP16`` as well. -- If the warm-up run does not help or execution time still varies, you can try running a large number of iterations - and then average or find a mean of the results. -- If the time values range too much, consider geomean. -- Be aware of the throttling and other power oddities. A device can exist in one of several different power states. - When optimizing your model, consider fixing the device frequency for better performance data reproducibility. - However, the end-to-end (application) benchmarking should also be performed under real operational conditions. +.. dropdown:: Make sure the benchmarking setup is proper for the selected scenario + * Install the latest release package supporting the frameworks of the tested models. + * For the most reliable performance benchmarks, + :doc:`prepare the model for use with OpenVINO <../../openvino-workflow/model-preparation>`. + * For testing generative AI models, make sure you select the method that best suits your case, + Optimum-Intel or the OpenVINO GenAI package. -.. raw:: html -

3 - Compare Performance with Native/Framework Code

+OpenVINO benchmarking (general) ++++++++++++++++++++++++++++++++ -When comparing the OpenVINO Runtime performance with the framework or another reference code, make sure that both versions are as similar as possible: +The default way of measuring OpenVINO performance is running a piece of code, referred to as +:doc:`the benchmark tool <../../learn-openvino/openvino-samples/benchmark-tool>`. +For Python, it is part of the OpenVINO Runtime installation, while for C++, it is available as +a code sample. -- Wrap the exact inference execution (for examples, see :doc:`Benchmark app <../../learn-openvino/openvino-samples/benchmark-tool>`). -- Do not include model loading time. -- Ensure that the inputs are identical for OpenVINO Runtime and the framework. For example, watch out for random values that can be used to populate the inputs. -- In situations when any user-side pre-processing should be tracked separately, consider :doc:`image pre-processing and conversion <../../openvino-workflow/running-inference/optimize-inference/optimize-preprocessing>`. -- When applicable, leverage the :doc:`Dynamic Shapes support <../../openvino-workflow/running-inference/dynamic-shapes>`. -- If possible, demand the same accuracy. For example, TensorFlow allows ``FP16`` execution, so when comparing to that, make sure to test the OpenVINO Runtime with the ``FP16`` as well. +Running the benchmark application +--------------------------------- + +The benchmark_app includes a lot of device-specific options, but the primary usage is as simple +as: + +.. code-block:: sh + + benchmark_app -m -d -i -.. raw:: html -

Internal Inference Performance Counters and Execution Graphs

+Each of the :doc:`OpenVINO supported devices <../compatibility-and-support/supported-devices>` +offers performance settings that contain command-line equivalents in the Benchmark app. -More detailed insights into inference performance breakdown can be achieved with device-specific performance counters and/or execution graphs. +While these settings provide really low-level control for the optimal model performance on a +*specific* device, it is recommended to always start performance evaluation with the +:doc:`OpenVINO High-Level Performance Hints <../../openvino-workflow/running-inference/optimize-inference/high-level-performance-hints>` +first, like so: + +.. code-block:: sh + + # for throughput prioritization + benchmark_app -hint tput -m -d + # for latency prioritization + benchmark_app -hint latency -m -d + + +Internal Inference Performance Counters and Execution Graphs +------------------------------------------------------------- + +More detailed insights into inference performance breakdown can be achieved with device-specific +performance counters and/or execution graphs. Both :doc:`C++ and Python <../../learn-openvino/openvino-samples/benchmark-tool>` -versions of the *benchmark_app* support a ``-pc`` command-line parameter that outputs internal execution breakdown. +versions of the benchmark_app support a ``-pc`` command-line parameter that outputs an internal +execution breakdown. -For example, the table shown below is part of performance counters for quantized -`TensorFlow implementation of ResNet-50 `__ -model inference on :doc:`CPU Plugin <../../openvino-workflow/running-inference/inference-devices-and-modes/cpu-device>`. -Keep in mind that since the device is CPU, the ``realTime`` wall clock and the ``cpu`` time layers are the same. -Information about layer precision is also stored in the performance counters. +For example, the table below is part of performance counters for +:doc:`CPU inference <../../openvino-workflow/running-inference/inference-devices-and-modes/cpu-device>`. +of a `TensorFlow implementation of ResNet-50 `__ +Keep in mind that since the device is CPU, the ``realTime`` wall clock and the ``cpu`` time +layers are the same. Information about layer precision is also stored in the performance +counters. =========================================================== ============= ============== ===================== ================= ============== @@ -136,39 +213,63 @@ Information about layer precision is also stored in the performance counters. | The ``execStatus`` column of the table includes the following possible values: | - ``EXECUTED`` - the layer was executed by standalone primitive. -| - ``NOT_RUN`` - the layer was not executed by standalone primitive or was fused with another operation and executed in another layer primitive. +| - ``NOT_RUN`` - the layer was not executed by standalone primitive or was fused with + another operation and executed in another layer primitive. | -| The ``execType`` column of the table includes inference primitives with specific suffixes. The layers could have the following marks: -| - The ``I8`` suffix is for layers that had 8-bit data type input and were computed in 8-bit precision. +| The ``execType`` column of the table includes inference primitives with specific suffixes. + The layers could have the following marks: +| - The ``I8`` suffix is for layers that had 8-bit data type input and were computed in + 8-bit precision. | - The ``FP32`` suffix is for layers computed in 32-bit precision. | -| All ``Convolution`` layers are executed in ``int8`` precision. The rest of the layers are fused into Convolutions using post-operation optimization, - as described in :doc:`CPU Device <../../openvino-workflow/running-inference/inference-devices-and-modes/cpu-device>`. This contains layer names - (as seen in OpenVINO IR), type of the layer, and execution statistics. +| All ``Convolution`` layers are executed in ``int8`` precision. The rest of the layers are + fused into Convolutions using post-operation optimization, as described in + :doc:`CPU Device <../../openvino-workflow/running-inference/inference-devices-and-modes/cpu-device>`. + This contains layer names (as seen in OpenVINO IR), type of the layer, and execution + statistics. -Both *benchmark_app* versions also support the ``exec_graph_path`` command-line option. It requires OpenVINO to output the same execution -statistics per layer, but in the form of plugin-specific `Netron-viewable `__ graph to the specified file. +Both *benchmark_app* versions also support the ``exec_graph_path`` command-line option. +It requires OpenVINO to output the same execution statistics per layer, but in the form of +plugin-specific `Netron-viewable `__ graph to the specified file. + +Especially when performance-debugging +:doc:`latency <../../openvino-workflow/running-inference/optimize-inference/optimizing-latency>`, +note that the counters do not reflect the time spent in the ``plugin/device/driver/etc`` queues. +If the sum of the counters is too different from the latency of an inference request, consider +testing with less inference requests. For example, running single +:doc:`OpenVINO stream <../../openvino-workflow/running-inference/optimize-inference/optimizing-throughput>` +with multiple requests would produce nearly identical counters as running a single inference +request, while the actual latency can be quite different. + +Lastly, the performance statistics with both performance counters and execution graphs are +averaged, so such data for the +:doc:`inputs of dynamic shapes <../../openvino-workflow/running-inference/dynamic-shapes>` +should be measured carefully, preferably by isolating the specific shape and executing multiple +times in a loop, to gather reliable data. + +Use ITT to Get Performance Insights +-------------------------------------- + +In general, OpenVINO and its individual plugins are heavily instrumented with Intel® +Instrumentation and Tracing Technology (ITT). Therefore, you can also compile OpenVINO from the +source code with ITT enabled and use tools like +`Intel® VTune™ Profiler `__ +to get detailed inference performance breakdown and additional insights in the application-level +performance on the timeline view. + + +OpenVINO benchmarking (LLM) ++++++++++++++++++++++++++++++++ + +Large Language Models require a different benchmarking approach to static models. A detailed +description will be added soon. -Especially when performance-debugging the :doc:`latency <../../openvino-workflow/running-inference/optimize-inference/optimizing-latency>`, note that the counters -do not reflect the time spent in the ``plugin/device/driver/etc`` queues. If the sum of the counters is too different from the latency -of an inference request, consider testing with less inference requests. For example, running single -:doc:`OpenVINO stream <../../openvino-workflow/running-inference/optimize-inference/optimizing-throughput>` with multiple requests would produce nearly identical -counters as running a single inference request, while the actual latency can be quite different. -Lastly, the performance statistics with both performance counters and execution graphs are averaged, -so such data for the :doc:`inputs of dynamic shapes <../../openvino-workflow/running-inference/dynamic-shapes>` should be measured carefully, -preferably by isolating the specific shape and executing multiple times in a loop, to gather reliable data. -.. raw:: html -

Use ITT to Get Performance Insights

-In general, OpenVINO and its individual plugins are heavily instrumented with Intel® Instrumentation and Tracing Technology (ITT). -Therefore, you can also compile OpenVINO from the source code with ITT enabled and use tools like -`Intel® VTune™ Profiler `__ to get detailed inference performance breakdown and additional -insights in the application-level performance on the timeline view. diff --git a/docs/articles_en/about-openvino/performance-benchmarks/model-accuracy-int8-fp32.rst b/docs/articles_en/about-openvino/performance-benchmarks/model-accuracy-int8-fp32.rst index 8b93e6a1aebe7b..3162bae7254704 100644 --- a/docs/articles_en/about-openvino/performance-benchmarks/model-accuracy-int8-fp32.rst +++ b/docs/articles_en/about-openvino/performance-benchmarks/model-accuracy-int8-fp32.rst @@ -4,9 +4,10 @@ Model Accuracy The following two tables present the absolute accuracy drop calculated as the accuracy difference -between OV-accuracy and the original frame work accuracy for FP32, and the same for INT8, BF16 and -FP16 representations of a model on three platform architectures. The third table presents the GenAI model accuracies as absolute accuracy values. Please also refer to notes below -the table for more information. +between OV-accuracy and the original framework accuracy for FP32, and the same for INT8, BF16, +and FP16 representations of a model on three platform architectures. The third table presents +the GenAI model accuracies as absolute accuracy values. Refer to notes below the table for more +information. * A - Intel® Core™ i9-9000K (AVX2), INT8 and FP32 * B - Intel® Xeon® 6338, (VNNI), INT8 and FP32 diff --git a/docs/sphinx_setup/_static/benchmarks_files/llm_models.csv b/docs/sphinx_setup/_static/benchmarks_files/llm_models.csv new file mode 100644 index 00000000000000..dee8e72a9578fd --- /dev/null +++ b/docs/sphinx_setup/_static/benchmarks_files/llm_models.csv @@ -0,0 +1,22 @@ +Model name,"Throughput: (tokens/sec. 2nd token)",1st token latency (msec),Max RSS memory used. (MB),Input tokens,Output tokens +OPT-2.7b,"20.2",2757,7084,937,128 +Phi-3-mini-4k-instruct,"19.9",2776,7028,1062,128 +Orca-mini-3b,"19.2",2966,7032,1024,128 +Phi-2,"17.8",2162,7032,1024,128 +Stable-Zephyr-3b-dpo,"17.0",1791,7007,946,128 +ChatGLM3-6b,"16.5",3569,6741,1024,128 +Dolly-v2-3b,"15.8",6891,6731,1024,128 +Stablelm-3b-4e1t,"15.7",2051,7018,1024,128 +Red-Pajama-Incite-Chat-3b-V1,"14.8",6582,7028,1020,128 +Falcon-7b-instruct,"14.5",4552,7033,1049,128 +Codegen25-7b,"13.3",3982,6732,1024,128 +GPT-j-6b,"13.2",7213,6882,1024,128 +Stablelm-7b,"12.8",6339,7013,1020,128 +Llama-3-8b,"12.8",4356,6953,1024,128 +Llama-2-7b-chat,"12.3",4205,6906,1024,128 +Llama-7b,"11.7",4315,6927,1024,128 +Mistral-7b-v0.1,"10.5",4462,7242,1007,128 +Zephyr-7b-beta,"10.5",4500,7039,1024,128 +Qwen1.5-7b-chat,"9.9",4318,7034,1024,128 +Baichuan2-7b-chat,"9.8",4668,6724,1024,128 +Qwen-7b-chat,"9.0",5141,6996,1024,128 \ No newline at end of file diff --git a/docs/sphinx_setup/_static/download/llm_models.csv b/docs/sphinx_setup/_static/download/llm_models.csv deleted file mode 100644 index 2ff93f503a6d3b..00000000000000 --- a/docs/sphinx_setup/_static/download/llm_models.csv +++ /dev/null @@ -1,22 +0,0 @@ -Model name,"Throughput: (tokens/sec. 2nd token)",1st token latency (msec),Max RSS memory used. (MB),Input tokens,Output tokens,Model Precision,Beam,Batch size,Framework -OPT-2.7b,20.2,2757,7084,937,128,INT4,1,1,PT -Phi-3-mini-4k-instruct,19.9,2776,7028,1062,128,INT4,1,1,PT -Orca-mini-3b,19.2,2966,7032,1024,128,INT4,1,1,PT -Phi-2,17.8,2162,7032,1024,128,INT4,1,1,PT -Stable-Zephyr-3b-dpo,17.0,1791,7007,946,128,INT4,1,1,PT -ChatGLM3-6b,16.5,3569,6741,1024,128,INT4,1,1,PT -Dolly-v2-3b,15.8,6891,6731,1024,128,INT4,1,1,PT -Stablelm-3b-4e1t,15.7,2051,7018,1024,128,INT4,1,1,PT -Red-Pajama-Incite-Chat-3b-V1,14.8,6582,7028,1020,128,INT4,1,1,PT -Falcon-7b-instruct,14.5,4552,7033,1049,128,INT4,1,1,PT -Codegen25-7b,13.3,3982,6732,1024,128,INT4,1,1,PT -GPT-j-6b,13.2,7213,6882,1024,128,INT4,1,1,PT -Stablelm-7b,12.8,6339,7013,1020,128,INT4,1,1,PT -Llama-3-8b,12.8,4356,6953,1024,128,INT4,1,1,PT -Llama-2-7b-chat,12.3,4205,6906,1024,128,INT4,1,1,PT -Llama-7b,11.7,4315,6927,1024,128,INT4,1,1,PT -Mistral-7b-v0.1,10.5,4462,7242,1007,128,INT4,1,1,PT -Zephyr-7b-beta,10.5,4500,7039,1024,128,INT4,1,1,PT -Qwen1.5-7b-chat,9.9,4318,7034,1024,128,INT4,1,1,PT -Baichuan2-7b-chat,9.8,4668,6724,1024,128,INT4,1,1,PT -Qwen-7b-chat,9.0,5141,6996,1024,128,INT4,1,1,PT \ No newline at end of file diff --git a/docs/sphinx_setup/_static/download/llm_models_ovms.csv b/docs/sphinx_setup/_static/download/llm_models_ovms.csv deleted file mode 100644 index d481fd3b6a56e8..00000000000000 --- a/docs/sphinx_setup/_static/download/llm_models_ovms.csv +++ /dev/null @@ -1,100 +0,0 @@ -Product,Model,Framework,Precision,Node,Request Rate,Throughput [tok/s],TPOT Mean Latency -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,0.2,92.75,75.75 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,0.3,137.89,98.6 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,0.4,182.68,144.36 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,0.5,227.02,238.54 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,0.6,259.06,679.07 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,0.7,267.24,785.75 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,0.8,267.77,815.11 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,0.9,270.01,827.09 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,1.0,268.92,840.1 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,2.0,269.6,847.81 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8380,inf,270.55,839.37 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,0.2,92.63,63.23 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,0.4,183.51,105.0 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,0.6,272.59,95.34 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,0.8,359.28,126.61 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,1.0,442.69,169.24 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,1.2,521.61,195.94 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,1.4,589.34,267.43 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,1.6,650.25,291.68 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,1.8,655.39,308.64 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,2.0,680.45,302.09 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8480+,inf,702.42,307.82 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,0.2,92.89,54.69 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,0.4,184.37,77.0 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,0.6,273.06,101.81 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,0.8,360.22,135.38 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,1.0,442.46,170.65 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,1.2,519.5,208.44 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,1.4,590.11,252.86 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,1.6,651.09,286.93 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,1.8,670.74,298.02 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,2.0,684.4,299.41 -ovms,meta-llama/Llama-2-7b-chat-hf,PT,INT8-CW,Xeon Platinum 8580,inf,701.91,305.9 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,0.2,79.24,73.06 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,0.3,118.42,90.31 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,0.4,157.04,113.23 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,0.5,193.85,203.97 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,0.6,232.36,253.17 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,0.7,260.56,581.45 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,0.8,271.97,761.05 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,0.9,273.36,787.74 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,1.0,272.54,811.37 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,2.0,278.07,809.3 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8380,inf,275.71,810.89 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,0.2,78.3,60.37 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,0.4,156.42,69.27 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,0.6,232.27,77.79 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,0.8,307.37,90.07 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,1.0,380.61,104.71 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,1.2,452.18,127.36 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,1.4,519.44,156.18 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,1.6,587.62,169.44 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,1.8,649.94,198.44 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,2.0,707.46,234.44 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8480+,inf,799.46,265.5 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,0.2,78.61,54.12 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,0.4,156.19,70.38 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,0.6,232.36,81.83 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,0.8,307.01,101.66 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,1.0,376.36,139.62 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,1.2,447.75,158.53 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,1.4,519.74,160.26 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,1.6,582.37,190.22 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,1.8,635.46,231.31 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,2.0,698.38,247.77 -ovms,meta-llama/Meta-Llama-3-8B-Instruct,PT,INT8-CW,Xeon Platinum 8580,inf,843.51,252.12 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,0.2,87.18,74.96 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,0.3,130.74,92.67 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,0.4,172.94,117.03 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,0.5,214.71,172.69 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,0.6,255.45,282.74 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,0.7,280.38,629.68 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,0.8,280.55,765.16 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,0.9,289.65,765.65 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,1.0,290.67,783.47 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,2.0,284.14,815.09 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8380,inf,290.39,793.52 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,0.2,88.9,60.04 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,0.4,176.5,70.24 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,0.6,262.04,77.01 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,0.8,346.01,95.29 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,1.0,427.37,114.16 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,1.2,507.86,138.56 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,1.4,582.58,150.72 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,1.6,655.61,166.64 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,1.8,717.9,216.76 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,2.0,774.3,233.49 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8480+,inf,873.93,245.31 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,0.2,88.92,56.33 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,0.4,175.99,72.72 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,0.6,261.96,84.24 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,0.8,346.78,101.67 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,1.0,427.85,128.33 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,1.2,506.17,150.01 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,1.4,581.72,167.61 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,1.6,651.97,190.91 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,1.8,713.2,222.56 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,2.0,771.17,232.08 -ovms,mistralai/Mistral-7B-v0.1,PT,INT8-CW,Xeon Platinum 8580,inf,839.74,253.74