Neural Magic vLLM

Fork of vLLM with sparsity.

To Run

Clone and install nm_gpu:

git clone https://github.com/neuralmagic/nm_gpu.git
cd nm_gpu
export TORCH_CUDA_ARCH_LIST=8.6
pip install -e .

Install:

cd ../
pip install -e .

Run Sample

Run a 50% sparse model:

from vllm import LLM, SamplingParams

model = LLM(
    "nm-testing/Llama-2-7b-pruned50-retrained", 
    sparsity="sparse_w16a16",   # If left off, model will be loaded as dense
    enforce_eager=True,         # Does not work with cudagraphs yet
    dtype="float16",
    tensor_parallel_size=1,
    max_model_len=1024
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
outputs[0].outputs[0].text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Neural Magic vLLM

To Run

Run Sample

Files

README.md

Latest commit

History

README.md

File metadata and controls

Neural Magic vLLM

To Run

Run Sample