[bitsandbytes] Add bitsandbytes doc

vllm-project · Jun 27, 2024 · 608f276 · 608f276
1 parent 3377572
commit 608f276
Show file tree

Hide file tree

Showing 2 changed files with 42 additions and 0 deletions.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -102,6 +102,7 @@ Documentation
 
    quantization/supported_hardware
    quantization/auto_awq
+   quantization/bnb
    quantization/fp8
    quantization/fp8_e5m2_kvcache
    quantization/fp8_e4m3_kvcache

diff --git a/docs/source/quantization/bnb.rst b/docs/source/quantization/bnb.rst
@@ -0,0 +1,41 @@
+.. _bits_and_bytes:
+
+BitsAndBytes
+==================
+
+vLLM now supports `BitsAndBytes <https://github.com/TimDettmers/bitsandbytes>`_ for more efficient model inference.
+BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
+This is particularly useful for deploying large language models in resource-constrained environments.
+Below are the steps to utilize BitsAndBytes with vLLM.
+
+.. code-block:: console
+
+    $ pip install bitsandbytes>=0.42.0
+
+vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
+
+Read quantized checkpoint
+--------------------------
+
+.. code-block:: python
+
+    from vllm import LLM, SamplingParams
+    import torch
+    import time
+    #unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
+    model_id = "unsloth/tinyllama-bnb-4bit"
+    llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
+    ,quantization="bitsandbytes", load_format="bitsandbytes")
+
+Inflight quantization: load as 4bit quantization
+------------------------------------------------
+
+.. code-block:: python
+
+    from vllm import LLM, SamplingParams
+    import torch
+    import time
+    model_id = "huggyllama/llama-7b"
+    llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
+    ,quantization="bitsandbytes", load_format="bitsandbytes")
+