Skip to content
This repository was archived by the owner on Oct 11, 2024. It is now read-only.

Commit 310191b

Browse files
robertgshaw2-redhatderekk-nm
authored andcommitted
Pruned Readme (#313)
1 parent b863a5c commit 310191b

File tree

1 file changed

+10
-107
lines changed

1 file changed

+10
-107
lines changed

README.md

+10-107
Original file line numberDiff line numberDiff line change
@@ -5,17 +5,20 @@
55

66
## Overview
77

8-
[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference that Neural Magic regularly contributes upstream improvements to. This fork, `nm-vllm` is our opinionated focus on incorporating the latest LLM optimizations like quantization and sparsity for enhanced performance.
8+
[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference that Neural Magic regularly contributes to.
9+
10+
`nm-vllm` is our supported enterprise distribution of vLLM.
911

1012
## Installation
13+
1114
The [nm-vllm PyPi package](https://pypi.neuralmagic.com/simple/nm-vllm/index.html) includes pre-compiled binaries for CUDA (version 12.1) kernels, streamlining the setup process. For other PyTorch or CUDA versions, please compile the package from source.
1215

1316
Install it using pip:
1417
```bash
1518
pip install nm-vllm --extra-index-url https://pypi.neuralmagic.com/simple
1619
```
1720

18-
For utilizing weight-sparsity kernels, such as through `sparsity="sparse_w16a16"`, you can extend the installation with the `sparsity` extras:
21+
To utilize the weight sparsity features, include the optional `sparse` dependencies.
1922
```bash
2023
pip install nm-vllm[sparse] --extra-index-url https://pypi.neuralmagic.com/simple
2124
```
@@ -24,111 +27,11 @@ You can also build and install `nm-vllm` from source (this will take ~10 minutes
2427
```bash
2528
git clone https://github.com/neuralmagic/nm-vllm.git
2629
cd nm-vllm
27-
pip install -e .
28-
```
29-
30-
## Quickstart
31-
32-
Neural Magic maintains a variety of sparse models on our Hugging Face organization profiles, [neuralmagic](https://huggingface.co/neuralmagic) and [nm-testing](https://huggingface.co/nm-testing).
33-
34-
A collection of ready-to-use SparseGPT and GPTQ models in inference optimized marlin format are [available on Hugging Face](https://huggingface.co/collections/neuralmagic/compressed-llms-for-nm-vllm-65e73e3d51d3200e34b77431)
35-
36-
#### Model Inference with Marlin (4-bit Quantization)
37-
38-
Marlin is an extremely optimized FP16xINT4 matmul kernel aimed at LLM inference that can deliver close to ideal (4x) speedups up to batchsizes of 16-32 tokens.
39-
To use Marlin within nm-vllm, simply pass the Marlin quantized directly to the engine. It will detect the quantization from the model's config.
40-
41-
Here is a demonstraiton with a [4-bit quantized OpenHermes Mistral](https://huggingface.co/neuralmagic/OpenHermes-2.5-Mistral-7B-marlin) model:
42-
43-
```python
44-
from vllm import LLM, SamplingParams
45-
from transformers import AutoTokenizer
46-
47-
model_id = "neuralmagic/OpenHermes-2.5-Mistral-7B-marlin"
48-
model = LLM(model_id, max_model_len=4096)
49-
tokenizer = AutoTokenizer.from_pretrained(model_id)
50-
sampling_params = SamplingParams(max_tokens=100, temperature=0.8, top_p=0.95)
51-
52-
messages = [
53-
{"role": "user", "content": "What is synthetic data in machine learning?"},
54-
]
55-
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
56-
outputs = model.generate(formatted_prompt, sampling_params=sampling_params)
57-
print(outputs[0].outputs[0].text)
58-
```
59-
60-
#### Model Inference with Weight Sparsity
61-
62-
For a quick demonstration, here's how to run a small [50% sparse llama2-110M](https://huggingface.co/nm-testing/llama2.c-stories110M-pruned50) model trained on storytelling:
63-
64-
```python
65-
from vllm import LLM, SamplingParams
66-
67-
model = LLM(
68-
"neuralmagic/llama2.c-stories110M-pruned50",
69-
sparsity="sparse_w16a16", # If left off, model will be loaded as dense
70-
)
71-
72-
sampling_params = SamplingParams(max_tokens=100, temperature=0)
73-
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
74-
print(outputs[0].outputs[0].text)
30+
pip install -e .[sparse] --extra-index-url https://pypi.neuralmagic.com/simple
7531
```
7632

77-
Here is a more realistic example of running a 50% sparse OpenHermes 2.5 Mistral 7B model finetuned for instruction-following:
78-
79-
```python
80-
from vllm import LLM, SamplingParams
81-
from transformers import AutoTokenizer
82-
83-
model_id = "neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50"
84-
model = LLM(model_id, sparsity="sparse_w16a16", max_model_len=4096)
85-
tokenizer = AutoTokenizer.from_pretrained(model_id)
86-
sampling_params = SamplingParams(max_tokens=100, temperature=0.8, top_p=0.95)
87-
88-
messages = [
89-
{"role": "user", "content": "What is sparsity in deep learning?"},
90-
]
91-
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
92-
outputs = model.generate(formatted_prompt, sampling_params=sampling_params)
93-
print(outputs[0].outputs[0].text)
94-
```
95-
96-
There is also support for semi-structured 2:4 sparsity using the `sparsity="semi_structured_sparse_w16a16"` argument:
97-
```python
98-
from vllm import LLM, SamplingParams
99-
100-
model = LLM("neuralmagic/llama2.c-stories110M-pruned2.4", sparsity="semi_structured_sparse_w16a16")
101-
sampling_params = SamplingParams(max_tokens=100, temperature=0)
102-
outputs = model.generate("Once upon a time, ", sampling_params=sampling_params)
103-
print(outputs[0].outputs[0].text)
104-
```
105-
106-
#### Integration with OpenAI-Compatible Server
107-
108-
You can also quickly use the same flow with an OpenAI-compatible model server:
109-
```bash
110-
python -m vllm.entrypoints.openai.api_server \
111-
--model neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50 \
112-
--sparsity sparse_w16a16 \
113-
--max-model-len 4096
114-
```
115-
116-
## Quantized Inference Performance
117-
118-
Developed in collaboration with IST-Austria, [GPTQ](https://arxiv.org/abs/2210.17323) is the leading quantization algorithm for LLMs, which enables compressing the model weights from 16 bits to 4 bits with limited impact on accuracy. nm-vllm includes support for the recently-developed Marlin kernels for accelerating GPTQ models. Prior to Marlin, the existing kernels for INT4 inference failed to scale in scenarios with multiple concurrent users.
119-
120-
<p align="center">
121-
<img alt="Marlin Performance" src="https://github.com/neuralmagic/nm-vllm/assets/3195154/6ac9f5b0-667a-41f3-8e6d-ca51c268bec5" width="60%" />
122-
</p>
123-
124-
## Sparse Inference Performance
125-
126-
Developed in collaboration with IST-Austria, [SparseGPT](https://arxiv.org/abs/2301.00774) and [Sparse Fine-tuning](https://arxiv.org/abs/2310.06927) are the leading algorithms for pruning LLMs, which enables removing at least half of model weights with limited impact on accuracy.
127-
128-
nm-vllm includes support for newly-developed sparse inference kernels, which provides both memory reduction and acceleration of sparse models leveraging sparsity.
129-
130-
<p align="center">
131-
<img alt="Sparse Memory Compression" src="https://github.com/neuralmagic/nm-vllm/assets/3195154/2fdd2212-3081-4b97-b492-a809ce23fdd3" width="40%" />
132-
<img alt="Sparse Inference Performance" src="https://github.com/neuralmagic/nm-vllm/assets/3195154/3448e3ee-535f-4c50-ac9b-00645673cc8c" width="40%" />
133-
</p>
33+
## Models
13434

35+
Neural Magic maintains a variety of optimized models on our Hugging Face organization profiles:
36+
- [neuralmagic](https://huggingface.co/neuralmagic)
37+
- [nm-testing](https://huggingface.co/nm-testing)

0 commit comments

Comments
 (0)