You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Oct 11, 2024. It is now read-only.
Copy file name to clipboardexpand all lines: README.md
+47-7
Original file line number
Diff line number
Diff line change
@@ -1,19 +1,19 @@
1
1
# Neural Magic vLLM
2
2
3
-
## About
3
+
## Overview
4
4
5
-
[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving that Neural Magic regularly lands upstream improvements to. This forkis our opinionated focus on the latest LLM optimizations, such as quantization and sparsity.
5
+
[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference that Neural Magic regularly contributes upstream improvements to. This fork, `nm-vllm`is our opinionated focus on incorporating the latest LLM optimizations like quantization and sparsity for enhanced performance.
6
6
7
7
## Installation
8
8
9
-
`nm-vllm` is a Python library that contained pre-compiled C++ and CUDA (12.1) binaries.
9
+
The [nm-vllm PyPi package](https://pypi.org/project/nm-vllm/) includes pre-compiled binaries for CUDA (version 12.1) kernels, streamlining the setup process. For other PyTorch or CUDA versions, please compile the package from source.
10
10
11
11
Install it using pip:
12
12
```bash
13
13
pip install nm-vllm
14
14
```
15
15
16
-
In order to use the weight-sparsity kernels, like through `sparsity="sparse_w16a16"`, install the extras using:
16
+
For utilizing weight-sparsity kernels, such as through `sparsity="sparse_w16a16"`, you can extend the installation with the `sparsity` extras:
17
17
```bash
18
18
pip install nm-vllm[sparsity]
19
19
```
@@ -27,15 +27,33 @@ pip install -e .
27
27
28
28
## Quickstart
29
29
30
-
There are many sparse models already pushed up on our HF organization profiles, [neuralmagic](https://huggingface.co/neuralmagic) and [nm-testing](https://huggingface.co/nm-testing). You can find [this collection of SparseGPT models ready for inference](https://huggingface.co/collections/nm-testing/sparsegpt-llms-65ca6def5495933ab05cd439).
30
+
Neural Magic maintains a variety of sparse models on our Hugging Face organization profiles, [neuralmagic](https://huggingface.co/neuralmagic) and [nm-testing](https://huggingface.co/nm-testing). A collection of ready-to-use SparseGPT models is available [here](https://huggingface.co/collections/nm-testing/sparsegpt-llms-65ca6def5495933ab05cd439).
31
31
32
-
Here is a smoke test using a small test `llama2-110M` model train on storytelling:
32
+
#### Model Inference with Marlin (4-bit Quantization)
33
+
34
+
Marlin is an extremely optimized FP16xINT4 matmul kernel aimed at LLM inference that can deliver close to ideal (4x) speedups up to batchsizes of 16-32 tokens.
35
+
To use Marlin within nm-vllm, simply pass the Marlin quantized directly to the engine. It will detect the quantization from the model's config.
36
+
37
+
Here is a demonstraiton with a [4-bit quantized Llama-2 7B chat](https://huggingface.co/neuralmagic/llama-2-7b-chat-marlin) model:
outputs = model.generate("Who is the president?", sampling_params)
45
+
print(outputs[0].outputs[0].text)
46
+
```
47
+
48
+
#### Model Inference with Weight Sparsity
49
+
50
+
For a quick demonstration, here's how to run a small [50% sparse llama2-110M](https://huggingface.co/nm-testing/llama2.c-stories110M-pruned50) model trained on storytelling:
33
51
34
52
```python
35
53
from vllm importLLM, SamplingParams
36
54
37
55
model = LLM(
38
-
"nm-testing/llama2.c-stories110M-pruned2.4",
56
+
"nm-testing/llama2.c-stories110M-pruned50",
39
57
sparsity="sparse_w16a16", # If left off, model will be loaded as dense
40
58
)
41
59
@@ -60,9 +78,31 @@ outputs = model.generate("Hello my name is", sampling_params=sampling_params)
60
78
print(outputs[0].outputs[0].text)
61
79
```
62
80
81
+
#### Integration with OpenAI-Compatible Server
82
+
63
83
You can also quickly use the same flow with an OpenAI-compatible model server:
Developed in collaboration with IST-Austria, [GPTQ](https://arxiv.org/abs/2210.17323) is the leading quantization algorithm for LLMs, which enables compressing the model weights from 16 bits to 4 bits with limited impact on accuracy. nm-vllm includes support for the recently-developed Marlin kernels for accelerating GPTQ models. Prior to Marlin, the existing kernels for INT4 inference failed to scale in scenarios with multiple concurrent users.
Developed in collaboration with IST-Austria, [SparseGPT](https://arxiv.org/abs/2301.00774) and [Sparse Fine-tuning](https://arxiv.org/abs/2310.06927) are the leading algorithms for pruning LLMs, which enables removing at least half of model weights with limited impact on accuracy.
101
+
102
+
nm-vllm includes support for newly-developed sparse inference kernels, which provides both memory reduction and acceleration of sparse models leveraging sparsity.
0 commit comments