Update README.md with sparsity and quantization explainers (#91)

mgoin · web-flow · commit 3b17b5e76d13 · 2024-03-04T20:44:22.000-07:00
diff --git a/README.md b/README.md
@@ -1,19 +1,19 @@
 # Neural Magic vLLM
 
-## About
+## Overview
 
-[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving that Neural Magic regularly lands upstream improvements to. This fork is our opinionated focus on the latest LLM optimizations, such as quantization and sparsity.
+[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference that Neural Magic regularly contributes upstream improvements to. This fork, `nm-vllm` is our opinionated focus on incorporating the latest LLM optimizations like quantization and sparsity for enhanced performance. 
 
 ## Installation
 
-`nm-vllm` is a Python library that contained pre-compiled C++ and CUDA (12.1) binaries.
+The [nm-vllm PyPi package](https://pypi.org/project/nm-vllm/) includes pre-compiled binaries for CUDA (version 12.1) kernels, streamlining the setup process. For other PyTorch or CUDA versions, please compile the package from source.
 
 Install it using pip:
 ```bash
 pip install nm-vllm
 ```
 
-In order to use the weight-sparsity kernels, like through `sparsity="sparse_w16a16"`, install the extras using:
+For utilizing weight-sparsity kernels, such as through `sparsity="sparse_w16a16"`, you can extend the installation with the `sparsity` extras:
 ```bash
 pip install nm-vllm[sparsity]
 ```
@@ -27,15 +27,33 @@ pip install -e .
 
 ## Quickstart
 
-There are many sparse models already pushed up on our HF organization profiles, [neuralmagic](https://huggingface.co/neuralmagic) and [nm-testing](https://huggingface.co/nm-testing). You can find [this collection of SparseGPT models ready for inference](https://huggingface.co/collections/nm-testing/sparsegpt-llms-65ca6def5495933ab05cd439).
+Neural Magic maintains a variety of sparse models on our Hugging Face organization profiles, [neuralmagic](https://huggingface.co/neuralmagic) and [nm-testing](https://huggingface.co/nm-testing). A collection of ready-to-use SparseGPT models is available [here](https://huggingface.co/collections/nm-testing/sparsegpt-llms-65ca6def5495933ab05cd439).
 
-Here is a smoke test using a small test `llama2-110M` model train on storytelling:
+#### Model Inference with Marlin (4-bit Quantization)
+
+Marlin is an extremely optimized FP16xINT4 matmul kernel aimed at LLM inference that can deliver close to ideal (4x) speedups up to batchsizes of 16-32 tokens. 
+To use Marlin within nm-vllm, simply pass the Marlin quantized directly to the engine. It will detect the quantization from the model's config. 
+
+Here is a demonstraiton with a [4-bit quantized Llama-2 7B chat](https://huggingface.co/neuralmagic/llama-2-7b-chat-marlin) model:
+
+```python
+from vllm import LLM, SamplingParams
+
+model = LLM("neuralmagic/llama-2-7b-chat-marlin")
+sampling_params = SamplingParams(max_tokens=100, temperature=0.8, top_p=0.95)
+outputs = model.generate("Who is the president?", sampling_params)
+print(outputs[0].outputs[0].text)
+```
+
+#### Model Inference with Weight Sparsity
+
+For a quick demonstration, here's how to run a small [50% sparse llama2-110M](https://huggingface.co/nm-testing/llama2.c-stories110M-pruned50) model trained on storytelling:
 
 ```python
 from vllm import LLM, SamplingParams
 
 model = LLM(
-    "nm-testing/llama2.c-stories110M-pruned2.4", 
+    "nm-testing/llama2.c-stories110M-pruned50", 
     sparsity="sparse_w16a16",   # If left off, model will be loaded as dense
 )
 
@@ -60,9 +78,31 @@ outputs = model.generate("Hello my name is", sampling_params=sampling_params)
 print(outputs[0].outputs[0].text)
 ```
 
+#### Integration with OpenAI-Compatible Server
+
 You can also quickly use the same flow with an OpenAI-compatible model server:
 ```bash
 python -m vllm.entrypoints.openai.api_server \
     --model nm-testing/OpenHermes-2.5-Mistral-7B-pruned50 \
     --sparsity sparse_w16a16
 ```
+
+## Quantized Inference
+
+Developed in collaboration with IST-Austria, [GPTQ](https://arxiv.org/abs/2210.17323) is the leading quantization algorithm for LLMs, which enables compressing the model weights from 16 bits to 4 bits with limited impact on accuracy. nm-vllm includes support for the recently-developed Marlin kernels for accelerating GPTQ models. Prior to Marlin, the existing kernels for INT4 inference failed to scale in scenarios with multiple concurrent users.
+
+<p align="center">
+   <img alt="Marlin Performance" src="https://github.com/neuralmagic/nm-vllm/assets/3195154/6ac9f5b0-667a-41f3-8e6d-ca51c268bec5" width="60%" />
+</p>
+
+## Sparse Inference
+
+Developed in collaboration with IST-Austria, [SparseGPT](https://arxiv.org/abs/2301.00774) and [Sparse Fine-tuning](https://arxiv.org/abs/2310.06927) are the leading algorithms for pruning LLMs, which enables removing at least half of model weights with limited impact on accuracy.
+
+nm-vllm includes support for newly-developed sparse inference kernels, which provides both memory reduction and acceleration of sparse models leveraging sparsity.
+
+<p align="center">
+   <img alt="Sparse Memory Compression" src="https://github.com/neuralmagic/nm-vllm/assets/3195154/2fdd2212-3081-4b97-b492-a809ce23fdd3" width="40%" />
+   <img alt="Sparse Inference Performance" src="https://github.com/neuralmagic/nm-vllm/assets/3195154/3448e3ee-535f-4c50-ac9b-00645673cc8c" width="40%" />
+</p>
+