From 1f41cbae5d7c4a12da3e23dd1f0a33db44c9f900 Mon Sep 17 00:00:00 2001 From: Liubov Talamanova Date: Mon, 21 Oct 2024 13:25:37 +0100 Subject: [PATCH] Update NNCF WC documentation (#27101) Co-authored-by: Alexander Kozlov Co-authored-by: Tatiana Savina --- .../weight-compression.rst | 46 ++++++++++++++++--- 1 file changed, 40 insertions(+), 6 deletions(-) diff --git a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst index 6348ca897c5ea5..47cfed977dc3df 100644 --- a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst +++ b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst @@ -161,15 +161,16 @@ trade-offs after optimization: `Larger Group Size`: Results in faster inference and a smaller model, but might compromise accuracy. -* ``ratio`` controls the ratio between INT4 and INT8_ASYM compressed layers in the model. +* ``ratio`` controls the ratio between the layers compressed to the precision defined + by ``mode`` and the rest of the layers that will be kept in the ``backup_mode`` in the optimized model. Ratio is a decimal between 0 and 1. For example, 0.8 means that 80% of layers will be - compressed to INT4, while the rest will be compressed to INT8_ASYM precision. The default - value for ratio is 1. + compressed to the precision defined by ``mode``, while the rest will be compressed to + ``backup_mode`` precision. The default value for ratio is 1. - `Higher Ratio (more INT4)`: Reduces the model size and increase inference speed but + `Higher Ratio (more layers set to mode precision)`: Reduces the model size and increase inference speed but might lead to higher accuracy degradation. - `Lower Ratio (more INT8_ASYM)`: Maintains better accuracy but results in a larger model size + `Lower Ratio (more layers set to backup_mode precision)`: Maintains better accuracy but results in a larger model size and potentially slower inference. In this example, 90% of the model's layers are quantized to INT4 asymmetrically with @@ -196,8 +197,11 @@ trade-offs after optimization: 4 bits. The method can sometimes result in reduced accuracy when used with Dynamic Quantization of activations. Requires dataset. +* ``gptq`` - boolean parameter that enables the GPTQ method for more accurate INT4 weight + quantization. Requires dataset. + * ``dataset`` - calibration dataset for data-aware weight compression. It is required - for some compression options, for example, ``scale_estimation`` or ``awq``. Some types + for some compression options, for example, ``scale_estimation``, ``gptq`` or ``awq``. Some types of ``sensitivity_metric`` can use data for precision selection. * ``sensitivity_metric`` - controls the metric to estimate the sensitivity of compressing @@ -226,6 +230,36 @@ trade-offs after optimization: * ``all_layers`` - boolean parameter that enables INT4 weight quantization of all Fully-Connected and Embedding layers, including the first and last layers in the model. +* ``lora_correction`` - boolean parameter that enables the LoRA Correction Algorithm + to further improve the accuracy of INT4 compressed models on top of other + algorithms - AWQ and Scale Estimation. + +* ``backup_mode`` - defines a backup precision for mixed-precision weight compression. + There are three modes: INT8_ASYM, INT8_SYM, and NONE, which retains + the original floating-point precision of the model weights (``INT8_ASYM`` is default value). + + +**Use synthetic data for LLM weight compression** + +It is possible to generate a synthetic dataset using the `nncf.data.generate_text_data` method for +data-aware weight compression. The method takes a language model (e.g. from `optimum.intel.openvino`) +and a tokenizer (e.g. from `transformers`) as input and returns the list of strings generated by the model. +Note that dataset generation takes time and depends on various conditions, like the model size, +requested dataset length or environment setup. Also, since the dataset is generated by the model output, +it does not guarantee significant accuracy improvement after compression. This method is recommended +only when a better dataset is not available. Refer to the +`example `__ +for details of the usage. + +.. code-block:: python + + from nncf import Dataset + from nncf.data import generate_text_data + + # Example: Generating synthetic dataset + synthetic_data = generate_text_data(model, tokenizer) + nncf_dataset = nncf.Dataset(synthetic_data, transform_fn) + For data-aware weight compression refer to the following `example `__.