You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Oct 11, 2024. It is now read-only.
Copy file name to clipboardexpand all lines: README.md
+10-107
Original file line number
Diff line number
Diff line change
@@ -5,17 +5,20 @@
5
5
6
6
## Overview
7
7
8
-
[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference that Neural Magic regularly contributes upstream improvements to. This fork, `nm-vllm` is our opinionated focus on incorporating the latest LLM optimizations like quantization and sparsity for enhanced performance.
8
+
[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference that Neural Magic regularly contributes to.
9
+
10
+
`nm-vllm` is our supported enterprise distribution of vLLM.
9
11
10
12
## Installation
13
+
11
14
The [nm-vllm PyPi package](https://pypi.neuralmagic.com/simple/nm-vllm/index.html) includes pre-compiled binaries for CUDA (version 12.1) kernels, streamlining the setup process. For other PyTorch or CUDA versions, please compile the package from source.
Neural Magic maintains a variety of sparse models on our Hugging Face organization profiles, [neuralmagic](https://huggingface.co/neuralmagic) and [nm-testing](https://huggingface.co/nm-testing).
33
-
34
-
A collection of ready-to-use SparseGPT and GPTQ models in inference optimized marlin format are [available on Hugging Face](https://huggingface.co/collections/neuralmagic/compressed-llms-for-nm-vllm-65e73e3d51d3200e34b77431)
35
-
36
-
#### Model Inference with Marlin (4-bit Quantization)
37
-
38
-
Marlin is an extremely optimized FP16xINT4 matmul kernel aimed at LLM inference that can deliver close to ideal (4x) speedups up to batchsizes of 16-32 tokens.
39
-
To use Marlin within nm-vllm, simply pass the Marlin quantized directly to the engine. It will detect the quantization from the model's config.
40
-
41
-
Here is a demonstraiton with a [4-bit quantized OpenHermes Mistral](https://huggingface.co/neuralmagic/OpenHermes-2.5-Mistral-7B-marlin) model:
For a quick demonstration, here's how to run a small [50% sparse llama2-110M](https://huggingface.co/nm-testing/llama2.c-stories110M-pruned50) model trained on storytelling:
63
-
64
-
```python
65
-
from vllm importLLM, SamplingParams
66
-
67
-
model = LLM(
68
-
"neuralmagic/llama2.c-stories110M-pruned50",
69
-
sparsity="sparse_w16a16", # If left off, model will be loaded as dense
Developed in collaboration with IST-Austria, [GPTQ](https://arxiv.org/abs/2210.17323) is the leading quantization algorithm for LLMs, which enables compressing the model weights from 16 bits to 4 bits with limited impact on accuracy. nm-vllm includes support for the recently-developed Marlin kernels for accelerating GPTQ models. Prior to Marlin, the existing kernels for INT4 inference failed to scale in scenarios with multiple concurrent users.
Developed in collaboration with IST-Austria, [SparseGPT](https://arxiv.org/abs/2301.00774) and [Sparse Fine-tuning](https://arxiv.org/abs/2310.06927) are the leading algorithms for pruning LLMs, which enables removing at least half of model weights with limited impact on accuracy.
127
-
128
-
nm-vllm includes support for newly-developed sparse inference kernels, which provides both memory reduction and acceleration of sparse models leveraging sparsity.
0 commit comments