Skip to content

Commit

Permalink
[bitsandbytes] Add bitsandbytes doc
Browse files Browse the repository at this point in the history
  • Loading branch information
thesues committed Jun 27, 2024
1 parent 3377572 commit 608f276
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ Documentation

quantization/supported_hardware
quantization/auto_awq
quantization/bnb
quantization/fp8
quantization/fp8_e5m2_kvcache
quantization/fp8_e4m3_kvcache
Expand Down
41 changes: 41 additions & 0 deletions docs/source/quantization/bnb.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
.. _bits_and_bytes:

BitsAndBytes
==================

vLLM now supports `BitsAndBytes <https://github.com/TimDettmers/bitsandbytes>`_ for more efficient model inference.
BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
This is particularly useful for deploying large language models in resource-constrained environments.
Below are the steps to utilize BitsAndBytes with vLLM.

.. code-block:: console
$ pip install bitsandbytes>=0.42.0
vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.

Read quantized checkpoint
--------------------------

.. code-block:: python
from vllm import LLM, SamplingParams
import torch
import time
#unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
model_id = "unsloth/tinyllama-bnb-4bit"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
,quantization="bitsandbytes", load_format="bitsandbytes")
Inflight quantization: load as 4bit quantization
------------------------------------------------

.. code-block:: python
from vllm import LLM, SamplingParams
import torch
import time
model_id = "huggyllama/llama-7b"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
,quantization="bitsandbytes", load_format="bitsandbytes")

0 comments on commit 608f276

Please sign in to comment.