From 1f41cbae5d7c4a12da3e23dd1f0a33db44c9f900 Mon Sep 17 00:00:00 2001
From: Liubov Talamanova <liubov.talamanova@intel.com>
Date: Mon, 21 Oct 2024 13:25:37 +0100
Subject: [PATCH] Update NNCF WC documentation (#27101)

Co-authored-by: Alexander Kozlov <alexander.kozlov@intel.com>
Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>
---
 .../weight-compression.rst                    | 46 ++++++++++++++++---
 1 file changed, 40 insertions(+), 6 deletions(-)

diff --git a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst
index 6348ca897c5ea5..47cfed977dc3df 100644
--- a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst
+++ b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst
@@ -161,15 +161,16 @@ trade-offs after optimization:
   `Larger Group Size`: Results in faster inference and a smaller model, but might
   compromise accuracy.
 
-* ``ratio`` controls the ratio between INT4 and INT8_ASYM compressed layers in the model.
+* ``ratio`` controls the ratio between the layers compressed to the precision defined
+  by ``mode`` and the rest of the layers that will be kept in the ``backup_mode`` in the optimized model.
   Ratio is a decimal between 0 and 1. For example, 0.8 means that 80% of layers will be
-  compressed to INT4, while the rest will be compressed to INT8_ASYM precision. The default
-  value for ratio is 1.
+  compressed to the precision defined by ``mode``, while the rest will be compressed to
+  ``backup_mode`` precision. The default value for ratio is 1.
 
-  `Higher Ratio (more INT4)`: Reduces the model size and increase inference speed but
+  `Higher Ratio (more layers set to mode precision)`: Reduces the model size and increase inference speed but
   might lead to higher accuracy degradation.
 
-  `Lower Ratio (more INT8_ASYM)`: Maintains better accuracy but results in a larger model size
+  `Lower Ratio (more layers set to backup_mode precision)`: Maintains better accuracy but results in a larger model size
   and potentially slower inference.
 
   In this example, 90% of the model's layers are quantized to INT4 asymmetrically with
@@ -196,8 +197,11 @@ trade-offs after optimization:
   4 bits. The method can sometimes result in reduced accuracy when used with
   Dynamic Quantization of activations. Requires dataset.
 
+* ``gptq`` - boolean parameter that enables the GPTQ method for more accurate INT4 weight
+  quantization. Requires dataset.
+
 * ``dataset`` - calibration dataset for data-aware weight compression. It is required
-  for some compression options, for example, ``scale_estimation`` or ``awq``. Some types
+  for some compression options, for example, ``scale_estimation``, ``gptq`` or ``awq``. Some types
   of ``sensitivity_metric`` can use data for precision selection.
 
 * ``sensitivity_metric`` - controls the metric to estimate the sensitivity of compressing
@@ -226,6 +230,36 @@ trade-offs after optimization:
 * ``all_layers`` - boolean parameter that enables INT4 weight quantization of all
   Fully-Connected and Embedding layers, including the first and last layers in the model.
 
+* ``lora_correction`` - boolean parameter that enables the LoRA Correction Algorithm
+  to further improve the accuracy of INT4 compressed models on top of other
+  algorithms - AWQ and Scale Estimation.
+
+* ``backup_mode`` - defines a backup precision for mixed-precision weight compression.
+  There are three modes: INT8_ASYM, INT8_SYM, and NONE, which retains
+  the original floating-point precision of the model weights (``INT8_ASYM`` is default value).
+
+
+**Use synthetic data for LLM weight compression**
+
+It is possible to generate a synthetic dataset using the `nncf.data.generate_text_data` method for
+data-aware weight compression. The method takes a language model (e.g. from `optimum.intel.openvino`)
+and a tokenizer (e.g. from `transformers`) as input and returns the list of strings generated by the model.
+Note that dataset generation takes time and depends on various conditions, like the model size,
+requested dataset length or environment setup. Also, since the dataset is generated by the model output,
+it does not guarantee significant accuracy improvement after compression. This method is recommended
+only when a better dataset is not available. Refer to the
+`example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama_synthetic_data>`__
+for details of the usage.
+
+.. code-block:: python
+
+   from nncf import Dataset
+   from nncf.data import generate_text_data
+
+   # Example: Generating synthetic dataset
+   synthetic_data = generate_text_data(model, tokenizer)
+   nncf_dataset = nncf.Dataset(synthetic_data, transform_fn)
+
 For data-aware weight compression refer to the following
 `example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama>`__.