-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] Feedback Thread #12568
Comments
👍 I have not done a proper benchmark but V1 feels superior, i.e. higher throughput + lower latency, TTFT. I have encountered a possible higher memory consumption issue, but am overall very pleased with the vllm community's hard work on V1. |
Does anyone know about this bug with n>1? Thanks |
Logging is in progress. Current main has a lot more and we will maintain compatibility with V0. Thanks! |
Quick feedback [VLLM_USE_V1=1]:
|
Thanks, both are in progress |
are logprobs output (and specifically prompt logprobs with echo=True) expected to be working with current V1 (0.7.0)? |
Maybe there is a better place to discuss this but the implementation for models that use more than one extra modality is quite non-intuitive. |
Still in progress |
Thanks for fixing metrics logs in 0.7.1! |
I'm either going insane, but with V1 qwen 8b instruct LLM just breaks in fp8 and around 25% of generations are just gibberish, with same running code and everything. Do I need to make a bug report, or it's an expected behaviour and I need some specific setup of sampling params for it to work in v1? |
The V1 engine doesn't seem to support logits processors or min-p filtering. Issue #12678 |
Something is weird with memory calculation in V1 and tensor parallel. Here are 2 cases that I tested recently: vllm 0.7.0 on 2x A6000: Starting normally a 32b-awq model and using Everything works as previously, GPUs both get to ~44-46GB usage Using GPUs both load up to ~24-25GB and it slowly goes up as inference runs. I've seen it go up to 32GB on each GPU. Updating to vllm 0.7.1 and running a 7b-awq model this time, I also noticed that running the above command "normally" the logs show Maximum concurrency at 44x Using V1 I get:
And finally, with vllm 0.7.0 and 4x L4 loading a 32b-awq model with tp 4 works in "normal mode", but OOMs with V1. |
I did a little experiment with DeepSeek-R1 on 8xH200 GPU. vLLM 0.7.0 showed the following results with
In general, vLLM without VLLM_USE_V1 looked more productive. I also tried V0 with
Throughput was still 2 times lower than SGLang in the same benchmark. Today I updated vLLM to the new version (0.7.1) and decided to repeat the experiment. And the results in version V0 have become much better!
But running vLLM with
|
v1 not support T4,are you support? |
Hi @bao231, V1 does not support T4 or older-generation GPUs since the kernel libraries used in V1 (e.g., flash-attn) do not support them. |
V1 support other attention libs?has you plan? @WoosukKwon |
Thanks!
|
Can you provide a more detailed reproduction instruction? cc @WoosukKwon |
Thanks. We are actively working on PP |
Check out #sig-multi-modality in our slack! This is the best place for a discussion like this |
Its pretty hard to follow what you are seeing. Please attach:
Thanks! |
Hi, please see Launch command
|
I ran the following code after upgrading the V1 version vllm and encountered an error: However, if --tensor_parallel_size" is set to 1, it works fine. Is there a compatibility issue with the v1 version with the multi-card deployment model? |
With dual rtx3090 in V1: CUDA out of memory. Tried to allocate 594.00 MiB. GPU 0 has a total capacity of 23.48 GiB of which 587.38 MiB is free. Including non-PyTorch memory, this process has 22.89 GiB memory in use. Of the allocated memory 21.56 GiB is allocated by PyTorch, and 815.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation With v0 it works, something changed about memory in V1. |
Will V1 support flashinfer in the future? |
Does V1 support FP8 (W8A8) quantization? I tried nm-testing/Qwen2-VL-7B-Instruct-FP8-dynamic on v0.7.1 V1 arch, no error thrown but got gibberish result. Same code and model works properly on v0.7.1 V0 arch. UPDATE: it works on v0.7.1 V1 arch eager mode, but borken on v0.7.1 V1 arch torch.compiled mode. I'm figuring out if this problem is model-dependent or not. UPDATE: tried another model nm-testing/DeepSeek-R1-Distill-Qwen-14B-FP8-Dynamic and same bug presents on v0.7.1 V1 arch torch.compiled mode UPDATE: it works after i turned custom_ops on (change Lines 3237 to 3249 in 3ee696a
|
When I tested the fine-tuned Qwen2.5_VL_3B model service using v1 mode (by setting the environment variable VLLM_USE_V1=1) and the default mode in OpenAI-compatible mode, I found inconsistencies in the output results. I tested two samples: I conducted the same comparative experiment on Qwen2VL, and both v1 and default modes produced correct outputs. Has anyone else encountered a similar issue? If so, could this indicate a compatibility issue between v1 mode and Qwen2.5_VL_3B? |
cc @ywang96 |
@lyhh123 can you open a separate issue for this and share some examples? There are multiple layers so I want to take a look where the issue might be.
This is also an interesting obversation since the V1-rearch for multimodal models should be model-agnostic, so I'm curious to see where the problem comes from. |
Thank you for paying attention to my issue. Two days ago, I encountered this problem during testing. Over the past two days, I have made a series of attempts to adjust the sampling parameters, mainly by modifying top_p or other parameters to maintain output stability as much as possible. Currently, I have re-tested using the v1/default model and the qwen2.5vl-3B model. Apart from content related to coordinates, the outputs have remained largely consistent. I attempted to adjust the parameters but was unable to reproduce the issue from two days ago. I still remember that, with fixed parameters at that time, there were unexpected differences across multiple outputs between the v1 and default modes. However, I cannot rule out the possibility of other potential variables affecting the results at that time. I will do my best to identify the root cause of the issue, and if I make any relevant discoveries, I will update you promptly. |
@robertgshaw2-redhat Hi now can we use V1 get high generate token throughput than V0 on Deepseek-R1? |
@imkero Is the bug fixed now (without the change you suggested)? I wasn't able to reproduce the bug with the latest main. |
I have made a mistake. I found that it's |
Please leave comments here about your usage of V1, does it work? does it not work? which feature do you need in order to adopt it? any bugs?
For bug report, please file it separately and link the issue here.
For in depth discussion, please feel free to join #sig-v1 in the vLLM Slack workspace.
The text was updated successfully, but these errors were encountered: