From ee0a025ca087266de53dbb3bef777eb8593b6c3d Mon Sep 17 00:00:00 2001
From: Wanglei Shen <wanglei.shen@intel.com>
Date: Fri, 31 May 2024 19:58:30 +0800
Subject: [PATCH] [DOCS] introduce performance hint and threads scheduling in
 CPU inference (#24147)

### Details:
 - *introduce performance hint and threads scheduling in CPU inference*

### Tickets:
 - *CVS-138834*

---------

Co-authored-by: Sun Xiaoxia <xiaoxia.sun@intel.com>
---
 .../cpu-device.rst                            |  71 +-------
 ...erformance-hint-and-threads-scheduling.rst | 162 ++++++++++++++++++
 2 files changed, 169 insertions(+), 64 deletions(-)
 create mode 100644 docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/cpu-device/performance-hint-and-threads-scheduling.rst

diff --git a/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/cpu-device.rst b/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/cpu-device.rst
index 6f817349800590..b45ff8140031e6 100644
--- a/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/cpu-device.rst
+++ b/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/cpu-device.rst
@@ -238,7 +238,7 @@ property is set for CPU plugin, then multiple streams are created for the model.
 host thread, which means that incoming infer requests can be processed simultaneously. Each stream is pinned to its own group of
 physical cores with respect to NUMA nodes physical memory usage to minimize overhead on data transfer between NUMA nodes.
 
-For more details, see the :doc:`optimization guide <../optimize-inference>`.
+For more details, see the :doc:`optimization guide <../optimize-inference>` and :doc:`threads scheduling introduction <cpu-device/performance-hint-and-threads-scheduling>`.
 
 .. note::
 
@@ -246,6 +246,11 @@ For more details, see the :doc:`optimization guide <../optimize-inference>`.
    on data transfer between NUMA nodes. In that case it is better to use the ``ov::hint::PerformanceMode::LATENCY`` performance hint.
    For more details see the :doc:`performance hints <../optimize-inference/high-level-performance-hints>` overview.
 
+ .. toctree::
+    :maxdepth: 1
+    :hidden:
+ 
+    cpu-device/performance-hint-and-threads-scheduling
 
 Dynamic Shapes
 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
@@ -387,70 +392,8 @@ Multi-Threading Optimization
 
 CPU inference will infer an input or multiple inputs in parallel on multiple logical processors.
 
-User can use the following properties to limit available CPU resource for model inference. If the platform or operating system can support this behavior, OpenVINO Runtime will perform multi-threading scheduling based on limited available CPU.
+For more details, see the :doc:`threads scheduling introduction <cpu-device/performance-hint-and-threads-scheduling>`.
 
-- ``ov::inference_num_threads`` limits number of logical processors used for CPU inference.
-  If the number set by the user is greater than the number of logical processors on the platform, multi-threading scheduler only uses the platform number for CPU inference.
-- ``ov::hint::scheduling_core_type`` limits the type of CPU cores for CPU inference when user runs inference on a hybird platform that includes both Performance-cores (P-cores) with Efficient-cores (E-cores).
-  If user platform only has one type of CPU cores, this property has no effect, and CPU inference always uses this unique core type.
-- ``ov::hint::enable_hyper_threading`` limits the use of one or two logical processors per CPU core when platform has CPU hyperthreading enabled.
-  If there is only one logical processor per CPU core, such as Efficient-cores, this property has no effect, and CPU inference uses all logical processors.
-
-.. tab-set::
-
-   .. tab-item:: Python
-      :sync: py
-
-      .. doxygensnippet:: docs/articles_en/assets/snippets/multi_threading.py
-         :language: python
-         :fragment: [ov:intel_cpu:multi_threading:part0]
-
-   .. tab-item:: C++
-      :sync: cpp
-
-      .. doxygensnippet:: docs/articles_en/assets/snippets/multi_threading.cpp
-         :language: cpp
-         :fragment: [ov:intel_cpu:multi_threading:part0]
-
-
-.. note::
-
-   ``ov::hint::scheduling_core_type`` and ``ov::hint::enable_hyper_threading`` only support Intel® x86-64 CPU on Linux and Windows in current release.
-
-In some use cases, OpenVINO Runtime will enable CPU threads pinning by default for better performance. User can also turn it on or off using property ``ov::hint::enable_cpu_pinning``. Disable threads pinning might be beneficial in complex applications with several workloads executed in parallel. The following table describes the default setting for ``ov::hint::enable_cpu_pinning`` in different use cases.
-
-==================================================== ================================
- Use Case                                             Default Setting of CPU Pinning
-==================================================== ================================
- All use cases with Windows OS                        False
- Stream contains both Pcore and Ecore with Linux OS   False
- Stream only contains Pcore or Ecore with Linux OS    True
- All use cases with Mac OS                            False
-==================================================== ================================
-
-.. tab-set::
-
-   .. tab-item:: Python
-      :sync: py
-
-      .. doxygensnippet:: docs/articles_en/assets/snippets/multi_threading.py
-         :language: python
-         :fragment: [ov:intel_cpu:multi_threading:part1]
-
-   .. tab-item:: C++
-      :sync: cpp
-
-      .. doxygensnippet:: docs/articles_en/assets/snippets/multi_threading.cpp
-         :language: cpp
-         :fragment: [ov:intel_cpu:multi_threading:part1]
-
-
-For details on multi-stream execution check the
-:doc:`optimization guide <../optimize-inference/optimizing-throughput/advanced_throughput_options>`.
-
-.. note::
-
-   ``ov::hint::enable_cpu_pinning`` is not supported on multi-socket platforms with Windows OS.
 
 Denormals Optimization
 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
diff --git a/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/cpu-device/performance-hint-and-threads-scheduling.rst b/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/cpu-device/performance-hint-and-threads-scheduling.rst
new file mode 100644
index 00000000000000..93c8c0bd6b36c7
--- /dev/null
+++ b/docs/articles_en/openvino-workflow/running-inference/inference-devices-and-modes/cpu-device/performance-hint-and-threads-scheduling.rst
@@ -0,0 +1,162 @@
+.. {#openvino_docs_OV_UG_supported_plugins_CPU_Hints_Threading}
+
+Performance Hints and Threads Scheduling 
+========================================
+
+.. meta::
+   :description: The Threads Scheduling of CPU plugin in OpenVINO™ Runtime
+                 detects CPU architecture and sets low-level properties based
+                 on performance hints automatically.
+
+While all supported devices in OpenVINO offer low-level performance settings, it is advisable not to widely use these settings unless targeting specific platforms and models. The recommended approach is configuring performance in OpenVINO Runtime using high-level performance hints property ``ov::hint::performance_mode``. Performance hints ensure optimal portability and scalability of the applications across various platforms and models.
+
+To simplify the configuration of hardware devices, OpenVINO offers two performance hints: the latency hint ``ov::hint::PerformanceMode::LATENCY`` and the throughput hint ``ov::hint::PerformanceMode::THROUGHPUT``.
+
+- ``ov::inference_num_threads`` limits number of logical processors used for CPU inference.
+  If the number set by the user is greater than the number of logical processors on the platform, multi-threading scheduler only uses the platform number for CPU inference.
+- ``ov::num_streams`` limits number of infer requests that can be run in parallel.
+  If the number set by the user is greater than the number of inference threads, multi-threading scheduler only uses the number of inference threads to ensure that there is at least one thread per stream.
+- ``ov::hint::scheduling_core_type`` limits the type of CPU cores for CPU inference when user runs inference on a hybird platform that includes both Performance-cores (P-cores) with Efficient-cores (E-cores).
+  If user platform only has one type of CPU cores, this property has no effect, and CPU inference always uses this unique core type.
+- ``ov::hint::enable_hyper_threading`` limits the use of one or two logical processors per CPU core when platform has CPU hyperthreading enabled.
+  If there is only one logical processor per CPU core, such as Efficient-cores, this property has no effect, and CPU inference uses all logical processors.
+- ``ov::hint::enable_cpu_pinning`` enable CPU pinning during CPU inference. 
+  If user enable this property but inference scenario cannot support it, this property will be disabled during model compilation. 
+
+For additional details on the above configurations, refer to:
+
+- `Multi-stream Execution <https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/cpu-device.html#multi-stream-execution>`__
+
+Latency Hint
+###################################
+
+In this scenario, the default setting of ``ov::hint::scheduling_core_type`` is determined by the model precision and the ratio of P-cores and E-cores.
+
+.. note::
+
+    P-cores is short for Performance-cores and E-cores is for Efficient-cores. These are available after 12th Gen Intel® Core™ Processor. 
+
+.. _Core Type Table of Latency Hint:
++----------------------------+---------------------+---------------------+
+|                            | INT8 model          | FP32 model          |
++============================+=====================+=====================+
+| E-cores / P-cores < 2      | P-cores             | P-cores             |
++----------------------------+---------------------+---------------------+
+| 2 <= E-cores / P-cores < 4 | P-cores             | P-cores and E-cores |
++----------------------------+---------------------+---------------------+
+| 4 <= E-cores / P-cores     | P-cores and E-cores | P-cores and E-cores |
++----------------------------+---------------------+---------------------+
+
+.. note::
+
+   Both P-cores and E-cores may be used for any configuration starting from 14th Gen Intel® Core™ Processor on Windows.
+
+Then the default settings of low-level performance properties on Windows and Linux are as follows:
+
++--------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+
+| Property                             | Windows                                                        | Linux                                                          |
++======================================+================================================================+================================================================+
+| ``ov::num_streams``                  | 1                                                              | 1                                                              |
++--------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+
+| ``ov::inference_num_threads``        | is equal to number of P-cores or P-cores+E-cores on one socket | is equal to number of P-cores or P-cores+E-cores on one socket |
++--------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+
+| ``ov::hint::scheduling_core_type``   | `Core Type Table of Latency Hint`_                             | `Core Type Table of Latency Hint`_                             |
++--------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+
+| ``ov::hint::enable_hyper_threading`` | No                                                             | No                                                             |
++--------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+
+| ``ov::hint::enable_cpu_pinning``     | No / Not Supported                                             | Yes except using P-cores and E-cores together                  |
++--------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+
+
+.. note::
+
+    - ``ov::hint::scheduling_core_type`` might be adjusted for particular inferred model on particular platform based on internal heuristics to guarantee best performance.
+    - Both P-cores and E-cores are used for the Latency Hint on Intel® Core™ Ultra Processors on Windows, except in the case of large language models.
+    - In case hyper-threading is enabled, two logical processors share hardware resource of one CPU core. OpenVINO do not expect to use both logical processors in one stream for one infer request. So ``ov::hint::enable_hyper_threading`` is ``No`` in this scenario.
+    - ``ov::hint::enable_cpu_pinning`` is disabled by default on Windows/Mac, and enabled on Linux. Such default settings are aligned with typical workloads running in corresponding environment to guarantee better OOB performance.
+
+Throughput Hint
+######################################
+
+In this scenario, thread scheduling first evaluates the memory pressure of the model being inferred on the current platform, and determines the number of threads per stream, as shown below.
+
++-----------------+-----------------------+
+| Memory Pressure | Threads per stream    |
++=================+=======================+
+| low             | 1 P-core or 2 E-cores |
++-----------------+-----------------------+
+| medium          | 2                     |
++-----------------+-----------------------+
+| high            | 3 or 4 or 5           |
++-----------------+-----------------------+
+
+Then the value of ``ov::num_streams`` is calculated as ``ov::inference_num_threads`` divided by the number of threads per stream. The default settings of low-level performance properties on Windows and Linux are as follows:
+
++--------------------------------------+-------------------------------+-------------------------------+
+| Property                             | Windows                       | Linux                         |
++======================================+===============================+===============================+
+| ``ov::num_streams``                  | Calculate as above            | Calculate as above            |
++--------------------------------------+-------------------------------+-------------------------------+
+| ``ov::inference_num_threads``        | Number of P-cores and E-cores | Number of P-cores and E-cores |
++--------------------------------------+-------------------------------+-------------------------------+
+| ``ov::hint::scheduling_core_type``   | P-cores and E-cores           | P-cores and E-cores           |
++--------------------------------------+-------------------------------+-------------------------------+
+| ``ov::hint::enable_hyper_threading`` | Yes / No                      | Yes / No                      |
++--------------------------------------+-------------------------------+-------------------------------+
+| ``ov::hint::enable_cpu_pinning``     | No                            | Yes                           |
++--------------------------------------+-------------------------------+-------------------------------+
+
+.. note::
+
+    - By default, different core types are not mixed within single stream in this scenario. And cores from different numa nodes are not mixed within single stream.
+
+Multi-Threading Optimization
+##############################################
+
+User can use the following properties to limit available CPU resource for model inference. If the platform or operating system can support this behavior, OpenVINO Runtime will perform multi-threading scheduling based on limited available CPU.
+
+- ``ov::inference_num_threads`` 
+- ``ov::hint::scheduling_core_type`` 
+- ``ov::hint::enable_hyper_threading`` 
+
+.. tab-set::
+
+   .. tab-item:: Python
+      :sync: py
+
+      .. doxygensnippet:: docs/articles_en/assets/snippets/multi_threading.py
+         :language: python
+         :fragment: [ov:intel_cpu:multi_threading:part0]
+
+   .. tab-item:: C++
+      :sync: cpp
+
+      .. doxygensnippet:: docs/articles_en/assets/snippets/multi_threading.cpp
+         :language: cpp
+         :fragment: [ov:intel_cpu:multi_threading:part0]
+
+
+.. note::
+
+   ``ov::hint::scheduling_core_type`` and ``ov::hint::enable_hyper_threading`` only support Intel® x86-64 CPU on Linux and Windows in current release.
+
+In some use cases, OpenVINO Runtime will enable CPU threads pinning by default for better performance. User can also turn it on or off using property ``ov::hint::enable_cpu_pinning``. Disable threads pinning might be beneficial in complex applications with several workloads executed in parallel.
+
+.. tab-set::
+
+   .. tab-item:: Python
+      :sync: py
+
+      .. doxygensnippet:: docs/articles_en/assets/snippets/multi_threading.py
+         :language: python
+         :fragment: [ov:intel_cpu:multi_threading:part1]
+
+   .. tab-item:: C++
+      :sync: cpp
+
+      .. doxygensnippet:: docs/articles_en/assets/snippets/multi_threading.cpp
+         :language: cpp
+         :fragment: [ov:intel_cpu:multi_threading:part1]
+
+
+For details on multi-stream execution check the
+:doc:`optimization guide <../../optimize-inference/optimizing-throughput/advanced_throughput_options>`.
\ No newline at end of file