Skip to content

Commit

Permalink
Update NNCF WC documentation (openvinotoolkit#27101)
Browse files Browse the repository at this point in the history
Co-authored-by: Alexander Kozlov <alexander.kozlov@intel.com>
Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>
  • Loading branch information
3 people authored Oct 21, 2024
1 parent 85253c4 commit 1f41cba
Showing 1 changed file with 40 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -161,15 +161,16 @@ trade-offs after optimization:
`Larger Group Size`: Results in faster inference and a smaller model, but might
compromise accuracy.

* ``ratio`` controls the ratio between INT4 and INT8_ASYM compressed layers in the model.
* ``ratio`` controls the ratio between the layers compressed to the precision defined
by ``mode`` and the rest of the layers that will be kept in the ``backup_mode`` in the optimized model.
Ratio is a decimal between 0 and 1. For example, 0.8 means that 80% of layers will be
compressed to INT4, while the rest will be compressed to INT8_ASYM precision. The default
value for ratio is 1.
compressed to the precision defined by ``mode``, while the rest will be compressed to
``backup_mode`` precision. The default value for ratio is 1.

`Higher Ratio (more INT4)`: Reduces the model size and increase inference speed but
`Higher Ratio (more layers set to mode precision)`: Reduces the model size and increase inference speed but
might lead to higher accuracy degradation.

`Lower Ratio (more INT8_ASYM)`: Maintains better accuracy but results in a larger model size
`Lower Ratio (more layers set to backup_mode precision)`: Maintains better accuracy but results in a larger model size
and potentially slower inference.

In this example, 90% of the model's layers are quantized to INT4 asymmetrically with
Expand All @@ -196,8 +197,11 @@ trade-offs after optimization:
4 bits. The method can sometimes result in reduced accuracy when used with
Dynamic Quantization of activations. Requires dataset.

* ``gptq`` - boolean parameter that enables the GPTQ method for more accurate INT4 weight
quantization. Requires dataset.

* ``dataset`` - calibration dataset for data-aware weight compression. It is required
for some compression options, for example, ``scale_estimation`` or ``awq``. Some types
for some compression options, for example, ``scale_estimation``, ``gptq`` or ``awq``. Some types
of ``sensitivity_metric`` can use data for precision selection.

* ``sensitivity_metric`` - controls the metric to estimate the sensitivity of compressing
Expand Down Expand Up @@ -226,6 +230,36 @@ trade-offs after optimization:
* ``all_layers`` - boolean parameter that enables INT4 weight quantization of all
Fully-Connected and Embedding layers, including the first and last layers in the model.

* ``lora_correction`` - boolean parameter that enables the LoRA Correction Algorithm
to further improve the accuracy of INT4 compressed models on top of other
algorithms - AWQ and Scale Estimation.

* ``backup_mode`` - defines a backup precision for mixed-precision weight compression.
There are three modes: INT8_ASYM, INT8_SYM, and NONE, which retains
the original floating-point precision of the model weights (``INT8_ASYM`` is default value).


**Use synthetic data for LLM weight compression**

It is possible to generate a synthetic dataset using the `nncf.data.generate_text_data` method for
data-aware weight compression. The method takes a language model (e.g. from `optimum.intel.openvino`)
and a tokenizer (e.g. from `transformers`) as input and returns the list of strings generated by the model.
Note that dataset generation takes time and depends on various conditions, like the model size,
requested dataset length or environment setup. Also, since the dataset is generated by the model output,
it does not guarantee significant accuracy improvement after compression. This method is recommended
only when a better dataset is not available. Refer to the
`example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama_synthetic_data>`__
for details of the usage.

.. code-block:: python
from nncf import Dataset
from nncf.data import generate_text_data
# Example: Generating synthetic dataset
synthetic_data = generate_text_data(model, tokenizer)
nncf_dataset = nncf.Dataset(synthetic_data, transform_fn)
For data-aware weight compression refer to the following
`example <https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino/tiny_llama>`__.

Expand Down

0 comments on commit 1f41cba

Please sign in to comment.