[Hardware][Nvidia] Enable support for Pascal GPUs #4409

jasonacox · 2024-04-27T05:18:24Z

[Hardware][Nvidia] Enable support for Pascal GPUs (sm_60, sm_61)

--

This is a new PR as a placeholder in the hope that the wheel size >100MB request is someday granted. This only adds compute capability 6.0 and 6.1. Note: pytorch is now only supporting sm_60.

>>> torch.__version__
'2.2.1+cu121'
>>> torch.cuda.torch.cuda.get_arch_list()
['sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90']
>>>

Pascal Architecture

(+) SM60 or SM_60, compute_60 – Quadro GP100, Tesla P100, DGX-1 (Generic Pascal)
(+) SM61 or SM_61, compute_61– GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030 (GP108), GT 1010 (GP108) Titan Xp, Tesla P40, Tesla P4, Discrete GPU on the NVIDIA Drive PX2
(-) SM62 or SM_62, compute_62 – Integrated GPU on the NVIDIA Drive PX2, Tegra (Jetson) TX2

Example test on 4 x P100 GPUs on CUDA 12.2 system:

# build
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm-openai --no-cache

 # run
docker run -d \
    --shm-size=10.24gb \
    --gpus '"device=0,1,2,3"' \
    -v /data/models:/root/.cache/huggingface \
    --env "HF_TOKEN=xyz" \
    -p 8000:8000 \
    --restart unless-stopped \
    --name vllm-openai \
    vllm-openai \
    --host 0.0.0.0 \
    --model=mistralai/Mistral-7B-Instruct-v0.1 \
    --enforce-eager \
    --dtype=float \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size=4

sasha0552 · 2024-05-10T01:58:52Z

@youkaichao
As I see it, pypi/support#3792 has been approved. Is it possible to merge this PR now?

sasha0552 · 2024-05-19T06:02:02Z

(From Release Tracker)

#4409 might need a little bit more discussion given what features are supported for Pascal GPUs and whether building from source might be a better option.

I've been using vLLM on my P40s every day for almost a month now, and everything works fine. triton didn't accept one of my patches (they said we dropped support for pre-A100 GPUs, so I think there will soon be problems with other older architectures as well.), so things that depend on triton and use the tl.dot operation won't work (prefix caching, for example). However, there is a patched triton (sasha0552/triton), and just installing the patched triton is easier than installing both the patched triton and the patched vLLM. Also considering that the basic functionality works fine without triton.

Maybe the patched triton could be shipped like nccl (although not installed by default)? The patch is very simple, and I don't think it would be hard to maintain. I can maintain support for Pascal GPUs, if needed (I'm not going to move on from these GPUs until better options become available for the price per VRAM GB).

P.S. Whoever is reading this, you might want to check out my project, which has pre-built vllm and triton wheels for Pascal GPUs (and also patches & build scripts).

AslanEZ · 2024-07-02T13:04:45Z

[Hardware][Nvidia] Enable support for Pascal GPUs (sm_60, sm_61)

FIX: #963 #1284

Related: #4290 #2635

--

This is a new PR as a placeholder in the hope that the wheel size >100MB request is someday granted. This only adds compute capability 6.0 and 6.1. Note: pytorch is now only supporting sm_60.
>>> torch.__version__
'2.2.1+cu121'
>>> torch.cuda.torch.cuda.get_arch_list()
['sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90']
>>> 
Pascal Architecture

(+) SM60 or SM_60, compute_60 – Quadro GP100, Tesla P100, DGX-1 (Generic Pascal)

(+) SM61 or SM_61, compute_61– GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030 (GP108), GT 1010 (GP108) Titan Xp, Tesla P40, Tesla P4, Discrete GPU on the NVIDIA Drive PX2

(-) SM62 or SM_62, compute_62 – Integrated GPU on the NVIDIA Drive PX2, Tegra (Jetson) TX2

Example test on 4 x P100 GPUs on CUDA 12.2 system:
# build
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm-openai --no-cache

 # run
docker run -d \
    --shm-size=10.24gb \
    --gpus '"device=0,1,2,3"' \
    -v /data/models:/root/.cache/huggingface \
    --env "HF_TOKEN=xyz" \
    -p 8000:8000 \
    --restart unless-stopped \
    --name vllm-openai \
    vllm-openai \
    --host 0.0.0.0 \
    --model=mistralai/Mistral-7B-Instruct-v0.1 \
    --enforce-eager \
    --dtype=float \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size=4

Does this mean I can't run vllm on a Tesla P4, Even a small model?

jasonacox · 2024-07-03T05:02:14Z

Does this mean I can't run vllm on a Tesla P4, Even a small model?

@AslanEZ I believe the P4 has a compute capability of 6.1. This PR requests to add that. Have you tested?

AslanEZ · 2024-07-03T06:49:38Z

Does this mean I can't run vllm on a Tesla P4, Even a small model?

@AslanEZ I believe the P4 has a compute capability of 6.1. This PR requests to add that. Have you tested?

I have tested it by installing with pip. It didn't work.

[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/xformers.py", line 323, in forward [rank0]: output[num_prefill_tokens:] = PagedAttention.forward_decode( [rank0]: RuntimeError: CUDA error: no kernel image is available for execution on the device [rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. [rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I intend to try your code now.

AslanEZ · 2024-07-03T08:46:35Z

Does this mean I can't run vllm on a Tesla P4, Even a small model?

@AslanEZ I believe the P4 has a compute capability of 6.1. This PR requests to add that. Have you tested?

Oh, it works! Thank you!

dirkson · 2024-08-12T17:17:42Z

Could we get an update on the status of this PR? I've been eagerly awaiting it, as I can't use vllm until it supports my hardware.

sasha0552 · 2024-08-12T19:11:31Z

@dirkson it was answered here #6434 (comment)

jasonacox added 2 commits April 26, 2024 21:16

Add Pascal GPU support

782c81f

Merge branch 'main' of https://github.com/jasonacox/vllm

3c145e7

sasha0552 mentioned this pull request May 10, 2024

[NVIDIA] Add support for tensor conversion from fp16 to fp32 using ExtFOp triton-lang/triton#3874

Closed

sasha0552 mentioned this pull request May 18, 2024

v0.4.3 Release Tracker #4895

Closed

6 tasks

sasha0552 mentioned this pull request Jun 7, 2024

[Hardware][Nvidia] Enable support for Pascal GPUs #4290

Closed

jasonacox mentioned this pull request Jun 8, 2024

vLLM Pascal architecture fix no longer works jasonacox/TinyLLM#8

Open

sasha0552 mentioned this pull request Jun 9, 2024

v0.5.0 Release Tracker #5224

Closed

2 tasks

sasha0552 mentioned this pull request Jun 28, 2024

v0.5.1 Release Tracker #5806

Closed

sasha0552 mentioned this pull request Jul 16, 2024

v0.5.2, v0.5.3, v0.6.0 Release Tracker #6434

Closed

7 tasks

dirkson mentioned this pull request Aug 12, 2024

[Bug]: Mistral 7B crashes on NVidia Tesla P100 with a CUDA Error #5219

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hardware][Nvidia] Enable support for Pascal GPUs #4409

[Hardware][Nvidia] Enable support for Pascal GPUs #4409

jasonacox commented Apr 27, 2024 •

edited

Loading

sasha0552 commented May 10, 2024

sasha0552 commented May 19, 2024 •

edited

Loading

AslanEZ commented Jul 2, 2024

jasonacox commented Jul 3, 2024

AslanEZ commented Jul 3, 2024

AslanEZ commented Jul 3, 2024

dirkson commented Aug 12, 2024

sasha0552 commented Aug 12, 2024

[Hardware][Nvidia] Enable support for Pascal GPUs #4409

Are you sure you want to change the base?

[Hardware][Nvidia] Enable support for Pascal GPUs #4409

Conversation

jasonacox commented Apr 27, 2024 • edited Loading

sasha0552 commented May 10, 2024

sasha0552 commented May 19, 2024 • edited Loading

AslanEZ commented Jul 2, 2024

jasonacox commented Jul 3, 2024

AslanEZ commented Jul 3, 2024

AslanEZ commented Jul 3, 2024

dirkson commented Aug 12, 2024

sasha0552 commented Aug 12, 2024

jasonacox commented Apr 27, 2024 •

edited

Loading

sasha0552 commented May 19, 2024 •

edited

Loading