Skip to content
This repository was archived by the owner on Oct 11, 2024. It is now read-only.

Commit 3b17b5e

Browse files
authored
Update README.md with sparsity and quantization explainers (#91)
1 parent 20d5ce9 commit 3b17b5e

File tree

1 file changed

+47
-7
lines changed

1 file changed

+47
-7
lines changed

README.md

+47-7
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,19 @@
11
# Neural Magic vLLM
22

3-
## About
3+
## Overview
44

5-
[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving that Neural Magic regularly lands upstream improvements to. This fork is our opinionated focus on the latest LLM optimizations, such as quantization and sparsity.
5+
[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference that Neural Magic regularly contributes upstream improvements to. This fork, `nm-vllm` is our opinionated focus on incorporating the latest LLM optimizations like quantization and sparsity for enhanced performance.
66

77
## Installation
88

9-
`nm-vllm` is a Python library that contained pre-compiled C++ and CUDA (12.1) binaries.
9+
The [nm-vllm PyPi package](https://pypi.org/project/nm-vllm/) includes pre-compiled binaries for CUDA (version 12.1) kernels, streamlining the setup process. For other PyTorch or CUDA versions, please compile the package from source.
1010

1111
Install it using pip:
1212
```bash
1313
pip install nm-vllm
1414
```
1515

16-
In order to use the weight-sparsity kernels, like through `sparsity="sparse_w16a16"`, install the extras using:
16+
For utilizing weight-sparsity kernels, such as through `sparsity="sparse_w16a16"`, you can extend the installation with the `sparsity` extras:
1717
```bash
1818
pip install nm-vllm[sparsity]
1919
```
@@ -27,15 +27,33 @@ pip install -e .
2727

2828
## Quickstart
2929

30-
There are many sparse models already pushed up on our HF organization profiles, [neuralmagic](https://huggingface.co/neuralmagic) and [nm-testing](https://huggingface.co/nm-testing). You can find [this collection of SparseGPT models ready for inference](https://huggingface.co/collections/nm-testing/sparsegpt-llms-65ca6def5495933ab05cd439).
30+
Neural Magic maintains a variety of sparse models on our Hugging Face organization profiles, [neuralmagic](https://huggingface.co/neuralmagic) and [nm-testing](https://huggingface.co/nm-testing). A collection of ready-to-use SparseGPT models is available [here](https://huggingface.co/collections/nm-testing/sparsegpt-llms-65ca6def5495933ab05cd439).
3131

32-
Here is a smoke test using a small test `llama2-110M` model train on storytelling:
32+
#### Model Inference with Marlin (4-bit Quantization)
33+
34+
Marlin is an extremely optimized FP16xINT4 matmul kernel aimed at LLM inference that can deliver close to ideal (4x) speedups up to batchsizes of 16-32 tokens.
35+
To use Marlin within nm-vllm, simply pass the Marlin quantized directly to the engine. It will detect the quantization from the model's config.
36+
37+
Here is a demonstraiton with a [4-bit quantized Llama-2 7B chat](https://huggingface.co/neuralmagic/llama-2-7b-chat-marlin) model:
38+
39+
```python
40+
from vllm import LLM, SamplingParams
41+
42+
model = LLM("neuralmagic/llama-2-7b-chat-marlin")
43+
sampling_params = SamplingParams(max_tokens=100, temperature=0.8, top_p=0.95)
44+
outputs = model.generate("Who is the president?", sampling_params)
45+
print(outputs[0].outputs[0].text)
46+
```
47+
48+
#### Model Inference with Weight Sparsity
49+
50+
For a quick demonstration, here's how to run a small [50% sparse llama2-110M](https://huggingface.co/nm-testing/llama2.c-stories110M-pruned50) model trained on storytelling:
3351

3452
```python
3553
from vllm import LLM, SamplingParams
3654

3755
model = LLM(
38-
"nm-testing/llama2.c-stories110M-pruned2.4",
56+
"nm-testing/llama2.c-stories110M-pruned50",
3957
sparsity="sparse_w16a16", # If left off, model will be loaded as dense
4058
)
4159

@@ -60,9 +78,31 @@ outputs = model.generate("Hello my name is", sampling_params=sampling_params)
6078
print(outputs[0].outputs[0].text)
6179
```
6280

81+
#### Integration with OpenAI-Compatible Server
82+
6383
You can also quickly use the same flow with an OpenAI-compatible model server:
6484
```bash
6585
python -m vllm.entrypoints.openai.api_server \
6686
--model nm-testing/OpenHermes-2.5-Mistral-7B-pruned50 \
6787
--sparsity sparse_w16a16
6888
```
89+
90+
## Quantized Inference
91+
92+
Developed in collaboration with IST-Austria, [GPTQ](https://arxiv.org/abs/2210.17323) is the leading quantization algorithm for LLMs, which enables compressing the model weights from 16 bits to 4 bits with limited impact on accuracy. nm-vllm includes support for the recently-developed Marlin kernels for accelerating GPTQ models. Prior to Marlin, the existing kernels for INT4 inference failed to scale in scenarios with multiple concurrent users.
93+
94+
<p align="center">
95+
<img alt="Marlin Performance" src="https://github.com/neuralmagic/nm-vllm/assets/3195154/6ac9f5b0-667a-41f3-8e6d-ca51c268bec5" width="60%" />
96+
</p>
97+
98+
## Sparse Inference
99+
100+
Developed in collaboration with IST-Austria, [SparseGPT](https://arxiv.org/abs/2301.00774) and [Sparse Fine-tuning](https://arxiv.org/abs/2310.06927) are the leading algorithms for pruning LLMs, which enables removing at least half of model weights with limited impact on accuracy.
101+
102+
nm-vllm includes support for newly-developed sparse inference kernels, which provides both memory reduction and acceleration of sparse models leveraging sparsity.
103+
104+
<p align="center">
105+
<img alt="Sparse Memory Compression" src="https://github.com/neuralmagic/nm-vllm/assets/3195154/2fdd2212-3081-4b97-b492-a809ce23fdd3" width="40%" />
106+
<img alt="Sparse Inference Performance" src="https://github.com/neuralmagic/nm-vllm/assets/3195154/3448e3ee-535f-4c50-ac9b-00645673cc8c" width="40%" />
107+
</p>
108+

0 commit comments

Comments
 (0)