-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance]: V1 higher memory usage #12529
Comments
Hi @wedobetter thanks for reporting the issue. We will take a look at the memory profiling part of V1. Meanwhile, could you please try lower |
I think you could also try reducing |
I have noticed similar issues for VLMs as well. See Slack thread. cc @ywang96 |
As I stated, I know how to reconfigure the execution parameters, my observation was OOM encountered just by upgrading from latest 0.6.x to 0.7.0 VLLM and enabling V1, while keeping all the runtime parameters the same |
Thanks for your input. I generally set that to 1, fearing that CPU offloading can significantly affect performance but I have probably confused gpu-memory-utilization with cpu_offload_gb options. |
Thanks for the feedback. Note for contributors -- another place to look is torch.compile and the number of cudagraphs we use |
Can you elaborate a bit more on what exactly is dynamic? |
I can confirm increased memory consumption with Qwen/Qwen2.5-32B-Instruct-AWQ |
I have the same issue. When deploying the DeepSeek R1 32B distilled model, the GPU memory usage under the v1 engine is higher than under v0. This causes the v1 engine to run out of memory (OOM) under the same configuration, while v0 does not. |
I have the same issue, blocking me to use V1 arch. I tried using the same configuration as with V0, also tried reducing gpu memory utilization from 0.95 to 0.8 but to no luck always resulting in an OOM on startup. Also, it takes extremely long for the server to warmup as compared to V0. |
Proposal to improve performance
No response
Report of performance regression
Hardware: 4x RTX 3070 = 32GB VRAM
Issue: I was able to run
Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4
with 12K context length with 0.6.x, now with0.7.0 + VLLM_USE_V1=1
I cannot push the context length higher than 3K or encountering a CUDA OOM error.Of course, I can reconfigure it to avoid OOM, my question is: Is V1 expected to consume more memory?
Some of the libraries:
VLLM command
Thanks
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: