From 555334dec6d708da1f9601cd0ee0084585beb768 Mon Sep 17 00:00:00 2001 From: mgoin Date: Mon, 10 Jun 2024 21:36:54 +0000 Subject: [PATCH] Fix inline code and add checkpoint format --- docs/source/quantization/fp8.rst | 83 ++++++++++++++++++++++---------- 1 file changed, 58 insertions(+), 25 deletions(-) diff --git a/docs/source/quantization/fp8.rst b/docs/source/quantization/fp8.rst index a7aaae8a2805..0c88d8d71509 100644 --- a/docs/source/quantization/fp8.rst +++ b/docs/source/quantization/fp8.rst @@ -9,21 +9,22 @@ Please visit the HF collection of `quantized FP8 checkpoints of popular LLMs rea The FP8 types typically supported in hardware have two distinct representations, each useful in different scenarios: -- **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`. -- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values. +- **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and ``nan``. +- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- ``inf``, and ``nan``. The tradeoff for the increased dynamic range is lower precision of the stored values. Quick Start with Online Dynamic Quantization ------------------------------------- -Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying `--quantization="fp8"` in the command line or setting `quantization="fp8"` in the LLM constructor. +Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying ``--quantization="fp8"`` in the command line or setting ``quantization="fp8"`` in the LLM constructor. -In this mode, all Linear modules (except for the final `lm_head`) have their weights quantized down to FP8_E4M3 precision with a per-tensor scale. Activations have their minimum and maximum values calculated during each forward pass to provide a dynamic per-tensor scale for high accuracy. As a result, latency improvements are limited in this mode. +In this mode, all Linear modules (except for the final ``lm_head``) have their weights quantized down to FP8_E4M3 precision with a per-tensor scale. Activations have their minimum and maximum values calculated during each forward pass to provide a dynamic per-tensor scale for high accuracy. As a result, latency improvements are limited in this mode. .. code-block:: python from vllm import LLM model = LLM("facebook/opt-125m", quantization="fp8") # INFO 06-10 17:55:42 model_runner.py:157] Loading model weights took 0.1550 GB + result = model.generate("Hello, my name is") .. warning:: @@ -39,12 +40,12 @@ For offline quantization to FP8, please install the `AutoFP8 library `_ contained within quantized checkpoints specified through the ``.kv_scale`` parameter present on the Attention Module, such as: + +.. code-block:: text + model.layers.0.self_attn.kv_scale < F32 \ No newline at end of file