[Performance]: V1 higher memory usage #12529

wedobetter · 2025-01-28T22:23:50Z

Proposal to improve performance

No response

Report of performance regression

Hardware: 4x RTX 3070 = 32GB VRAM

Issue: I was able to run Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4 with 12K context length with 0.6.x, now with 0.7.0 + VLLM_USE_V1=1 I cannot push the context length higher than 3K or encountering a CUDA OOM error.
Of course, I can reconfigure it to avoid OOM, my question is: Is V1 expected to consume more memory?

Some of the libraries:

flashinfer==0.1.6+cu124torch2.4
torch==2.5.1
transformers==4.48.1
vllm==0.7.0

VLLM command

        - vllm
        - serve
        - Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4
        - --gpu-memory-utilization=1
        - --tensor-parallel-size=4
        - --load-format=auto
        - --enforce-eager
        - --swap-space=0
        - --max-model-len=12K
        - --max-num-batched-tokens=12K
        - --disable-fastapi-docs
        - --trust-remote-code
        - --enable-auto-tool-choice
        - --tool-call-parser=hermes

Thanks

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

WoosukKwon · 2025-01-28T22:43:29Z

Hi @wedobetter thanks for reporting the issue. We will take a look at the memory profiling part of V1. Meanwhile, could you please try lower pu-memory-utilization like 0.95 or 0.9? Generally, We don't recommend 1.0 because our memory profiling is not 100% accurate (since we do dynamic memory allocation during run time).

mgoin · 2025-01-29T00:58:38Z

I think you could also try reducing --max-num-batched-tokens=12K, this could be much smaller like 2k and might reduce peak activations for memory and from torch.compile compilation

DarkLight1337 · 2025-01-29T03:56:24Z

I have noticed similar issues for VLMs as well. See Slack thread. cc @ywang96

wedobetter · 2025-01-29T06:42:58Z

I think you could also try reducing --max-num-batched-tokens=12K, this could be much smaller like 2k and might reduce peak activations for memory and from torch.compile compilation

As I stated, I know how to reconfigure the execution parameters, my observation was OOM encountered just by upgrading from latest 0.6.x to 0.7.0 VLLM and enabling V1, while keeping all the runtime parameters the same

wedobetter · 2025-01-29T06:46:10Z

Hi @wedobetter thanks for reporting the issue. We will take a look at the memory profiling part of V1. Meanwhile, could you please try lower pu-memory-utilization like 0.95 or 0.9? Generally, We don't recommend 1.0 because our memory profiling is not 100% accurate (since we do dynamic memory allocation during run time).

Thanks for your input. I generally set that to 1, fearing that CPU offloading can significantly affect performance but I have probably confused gpu-memory-utilization with cpu_offload_gb options.

robertgshaw2-redhat · 2025-01-30T17:35:19Z

Thanks for the feedback. Note for contributors -- another place to look is torch.compile and the number of cudagraphs we use

Leon-Sander · 2025-01-30T20:05:25Z

(since we do dynamic memory allocation during run time).

Can you elaborate a bit more on what exactly is dynamic?

focuzz8 · 2025-01-31T23:50:20Z

I can confirm increased memory consumption with Qwen/Qwen2.5-32B-Instruct-AWQ
Hardware: 2x RTX 4070 Ti = 32GB VRAM
I suspect that @robertgshaw2-redhat is correct and problem is related to the torch.compile.

chelseaztq · 2025-02-07T11:56:25Z

I have the same issue. When deploying the DeepSeek R1 32B distilled model, the GPU memory usage under the v1 engine is higher than under v0. This causes the v1 engine to run out of memory (OOM) under the same configuration, while v0 does not.

rohan-uiuc · 2025-02-13T17:50:36Z

I have the same issue, blocking me to use V1 arch. I tried using the same configuration as with V0, also tried reducing gpu memory utilization from 0.95 to 0.8 but to no luck always resulting in an OOM on startup.

Also, it takes extremely long for the server to warmup as compared to V0.

wedobetter added the performance Performance-related issues label Jan 28, 2025

simon-mo added the v1 label Jan 29, 2025

wedobetter mentioned this issue Jan 30, 2025

[V1] Feedback Thread #12568

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: V1 higher memory usage #12529

[Performance]: V1 higher memory usage #12529

wedobetter commented Jan 28, 2025

WoosukKwon commented Jan 28, 2025

mgoin commented Jan 29, 2025

DarkLight1337 commented Jan 29, 2025

wedobetter commented Jan 29, 2025

wedobetter commented Jan 29, 2025 •

edited

Loading

robertgshaw2-redhat commented Jan 30, 2025

Leon-Sander commented Jan 30, 2025

focuzz8 commented Jan 31, 2025

chelseaztq commented Feb 7, 2025

rohan-uiuc commented Feb 13, 2025

[Performance]: V1 higher memory usage #12529

[Performance]: V1 higher memory usage #12529

Comments

wedobetter commented Jan 28, 2025

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

WoosukKwon commented Jan 28, 2025

mgoin commented Jan 29, 2025

DarkLight1337 commented Jan 29, 2025

wedobetter commented Jan 29, 2025

wedobetter commented Jan 29, 2025 • edited Loading

robertgshaw2-redhat commented Jan 30, 2025

Leon-Sander commented Jan 30, 2025

focuzz8 commented Jan 31, 2025

chelseaztq commented Feb 7, 2025

rohan-uiuc commented Feb 13, 2025

wedobetter commented Jan 29, 2025 •

edited

Loading