[Doc] Compatibility matrix for mutual exclusive features #8512

wallashss · 2024-09-16T18:03:42Z

Greetings,

We did a study of mutual exclusive features on vLLM and consolidated in a compatibility matrix.

We propose to add the compatibility matrix to the documentation pages to help users to quick consult to plan their implementation or study.

Following the table in markdown for quick checking and help reviewers.

CC @njhill @maxdebayser


Unnamed: 0	Chunked Prefill	APC	LoRa	Prompt Adapter	Speculative decoding	CUDA Graphs	Encoder/Decoder	Logprobs	Prompt Logprobs	Async Output	Multi-step
APC	✅
LoRa	✗ [C]	✅
Prompt Adapter	✅	✅	✅
Speculative decoding	✗ [C] [T]	✅	✗ [C]	✅
CUDA Graphs	✅	✅	✅	✅	✅
Encoder/Decoder	✗ [C]	✗ [C][T]	✗ [C]	✗ [C]	✗ [C][T]	✗ [C][T]
Logprobs	✅	✅	✅	✅	✅	✅	✅
Prompt Logprobs	✅	✅	✅	✅	✗ [C] [T]	✅	✅	✅
Async Output	✅	✅	✅	✅	✗ [C]	✅ [C]	✗ [C] [C]	✅	✅
Multi-step	✗ [C]	✅	✗ [C]	✅	✗ [C]	✅	✗ [C]	✅	✗ [C][T]	✅
NVIDIA	✅	✅	✅	✅	✅	✅	✅	✅	✅	✅	✅
CPU	✗ [C]	✗ [C]	✗ [C][T]	✗ [T]	✅	✗ [C]	✗ [C]	✅	✅	✗ [C]	✗ [T]
AMD	✅	✅	✅	✅	✅	✅	✗ [C]	✅	✅	✅	✗ [T]

[C] = Link to where a check is made in the code and error reported
[T] = Link to open tracking issue or PR to address the incompatibility

DarkLight1337 · 2024-09-17T02:50:37Z

This should also cut down on the number of issues flagged as bugs when in fact the feature is not supported yet. Thanks for adding this!

Some comments:

Is it intended that you omitted the first row (chunked prefill)? It kinda bothers me that the matrix is asymmetric.
The table is getting quite long. I would split the table into two sections: one for core features and another one for hardware backends (by the way, you missed TPU backend).
Let's also add a row for multimodal support under core features. cc @ywang96
- Chunked prefill: Not supported, WIP ([Core][VLM] Add precise multi-modal placeholder tracking #8346)
- APC: Not supported, WIP ([Core][VLM] Add support for prefix caching for multi-modal models #8348)
- LoRA: Not supported, WIP ([Model][LoRA]LoRA support added for MiniCPMV2.5 #7199)
- Speculative decoding: Not sure
- CUDA Graphs: Supported
- Encoder/decoder: Not supported (https://github.com/vllm-project/vllm/blob/main/vllm/inputs/preprocess.py#L300)
- Logprobs: Supported
- Prompt logprobs: Supported
- Async output: Not sure
- Multi-step: Not sure

wallashss · 2024-09-18T19:54:49Z

Thanks for the feedback and the contribution on the multimodal feature @DarkLight1337 @ywang96 !

Is it intended that you omitted the first row (chunked prefill)? It kinda bothers me that the matrix is asymmetric.

Yeah, kind of. The first time I added, the row become empty. Therefore I thought it would be waste to add it. I added again to you see it. If you think it looks nice I have no problem on keep it.

The table is getting quite long. I would split the table into two sections: one for core features and another one for hardware backends

I discussed that before with a colleague, and we agreed to present like this in a concise format. What do you think? Do you think it looks like that bad? I can discuss this again and review it.

(by the way, you missed TPU backend).

Yeah, I know. There are some other devices that I did not add. Of course, there are some hanging fruits. But TPU and other devices I am currently not working with them and I don't have an easy setup to test like I did to the others. Would be nice if someone could contribute to add this information.

BTW, do you know or know someone that knows the support of multimodal on AMD and CPU? I assume that is supported on Nvidia by default. But do you know the minimum compute capability for this feature?

Here is the updated table:

TABLE


Feature	Chunked Prefill	APC	LoRa	Prompt Adapter	Speculative decoding	CUDA Graphs	Encoder/Decoder	Logprobs	Prompt Logprobs	Async Output	Multi-step	Multimodal
Chunked Prefill
APC	✅
LoRa	✗ [C]	✅
Prompt Adapter	✅	✅	✅
Speculative decoding	✗ [C] [T]	✅	✗ [C]	✅
CUDA Graphs	✅	✅	✅	✅	✅
Encoder/Decoder	✗ [C]	✗ [C][T]	✗ [C]	✗ [C]	✗ [C][T]	✗ [C][T]
Logprobs	✅	✅	✅	✅	✅	✅	✅
Prompt Logprobs	✅	✅	✅	✅	✗ [C] [T]	✅	✅	✅
Async Output	✅	✅	✅	✅	✗ [C]	✅ [C]	✗ [C] [C]	✅	✅
Multi-step	✗ [C]	✅	✗ [C]	✅	✗ [C]	✅	✗ [C]	✅	✗ [C][T]	✅
Multimodal	✗ [T]	✗ [T]	✗ [T]		?	✅	✗ [C]	✅	✅	?
NVIDIA	✅	✅	✅	✅	✅	✅	✅	✅	✅	✅	✅	✅
CPU	✗ [C]	✗ [C]	✗ [C][T]	✗ [T]	✅	✗ [C]	✗ [C]	✅	✅	✗ [C]	✗ [T]	?
AMD	✅	✅	✅	✅	✅	✅	✗ [C]	✅	✅	✅	✗ [T]	?

DarkLight1337 · 2024-09-19T02:27:32Z

Yeah, kind of. The first time I added, the row become empty. Therefore I thought it would be waste to add it. I added again to you see it. If you think it looks nice I have no problem on keep it.

Looks better now, thanks.

I discussed that before with a colleague, and we agreed to present like this in a concise format. What do you think? Do you think it looks like that bad? I can discuss this again and review it.

I'm not overly bothered by this so you can keep it as is if that's the plan.

BTW, do you know or know someone that knows the support of multimodal on AMD and CPU? I assume that is supported on Nvidia by default. But do you know the minimum compute capability for this feature?

It's supported on both. This feature isn't tied to compute capability, it only depends on whether the model runner implements it.

docs/source/serving/compatibility_matrix.rst

njhill · 2024-09-19T16:18:01Z

Thanks a lot for this @wallashss! We should get the first pass of this added to the docs imo while we're working on adding more rows/columns, and make more folks aware of it.

Is it intended that you omitted the first row (chunked prefill)? It kinda bothers me that the matrix is asymmetric.

Yeah, kind of. The first time I added, the row become empty. Therefore I thought it would be waste to add it. I added again to you see it. If you think it looks nice I have no problem on keep it.

Yeah it was my suggestion to keep the top right triangle empty so that we don't have a bunch of duplicate entries (which could also then become out of sync if we updated one and not its reflection). But it does make it harder to look at one row/column at a time and see all the features it works with.. you have to follow an L shape. Not really sure which is better.

I discussed that before with a colleague, and we agreed to present like this in a concise format. What do you think? Do you think it looks like that bad? I can discuss this again and review it.

I'm not overly bothered by this so you can keep it as is if that's the plan.

@wallashss assuming I'm the colleague, I'm fine with having it split to a separate table :), but it would still be good for the columns to be the same and lined up (and not sure whether that would be possible with markdown).

Some other features we might want to add:

Beam search
best-of-n generation
Guided decoding

We could also shorten some of the other names to make things fit better (and can include a legend at the bottom if needed).. like "Spec decoding" or even "SD".

K-Mistele · 2024-09-19T18:43:24Z

would it be possible to add a column for prefix caching? I have found it does not work on volta-arch CUDA devices (Nvidia v100 tesla) due to some triton operation being unsupported on the older device.

Possibly also rows for different cuda capabilities - some features work on hopper and ada but not ampere, or on ampere and newer but not volta.

njhill · 2024-09-19T18:55:13Z

@K-Mistele APC is (automatic) prefix caching :)

Agree expanding with specific compute capabilities would be good!

wallashss · 2024-09-20T13:35:39Z

Thanks for the feedback @K-Mistele.

I have found it does not work on volta-arch CUDA devices (Nvidia v100 tesla) due to some triton operation being unsupported on the older device.

How did you figure out that? Does vllm raise an exception/log warning with a clear message, or did it just run "fine" and you knew that was not right? The original idea was to split on Nvidia architecture, but in my research, at least for those features, I thought it may not be necessary.

K-Mistele · 2024-09-20T15:22:40Z

Thanks for the feedback @K-Mistele.

I have found it does not work on volta-arch CUDA devices (Nvidia v100 tesla) due to some triton operation being unsupported on the older device.

How did you figure out that? Does vllm raise an exception/log warning with a clear message, or did it just run "fine" and you knew that was not right? The original idea was to split on Nvidia architecture, but in my research, at least for those features, I thought it may not be necessary.

It raised an exception about an unsupported triton operation or something like that. I can reproduce it if you'd like to see the specific error although I don't think I have a screenshot of it right this second. I think this happens for chunked prefill too.

wallashss · 2024-09-23T12:33:50Z

It raised an exception about an unsupported triton operation or something like that. I can reproduce it if you'd like to see the specific error although I don't think I have a screenshot of it right this second. I think this happens for chunked prefill too.

Wow, if you could paste the errors here that would be really nice! Thanks.

wallashss · 2024-09-25T17:31:04Z

UPDATE:

Tested on Turing Architecture (7.5) and both APC and Chunked prefill worked fine. @K-Mistele did have any chance to get the error on Volta?

K-Mistele · 2024-09-25T21:59:31Z

It raised an exception about an unsupported triton operation or something like that. I can reproduce it if you'd like to see the specific error although I don't think I have a screenshot of it right this second. I think this happens for chunked prefill too.

Wow, if you could paste the errors here that would be really nice! Thanks.

Can do! I ran into the issue on a v100 Tesla 32GB, which is a volta device - not turing.

pooyadavoodi · 2024-09-28T22:06:54Z

It may be good to remind PR authors about updating the compatibility matrix if relevant (e.g. using PR checklist). Just to make sure it remains up to date.

K-Mistele · 2024-10-03T16:37:32Z

It raised an exception about an unsupported triton operation or something like that. I can reproduce it if you'd like to see the specific error although I don't think I have a screenshot of it right this second. I think this happens for chunked prefill too.

Wow, if you could paste the errors here that would be really nice! Thanks.

Can do! I ran into the issue on a v100 Tesla 32GB, which is a volta device - not turing.

btw @wallashss I found open PRs with more details information on chunked prefill and APC issues on volta devices:

For Chunked prefill on Nvidia Tesla v100 (Volta):

wallashss · 2024-10-04T20:06:50Z

THanks for that @K-Mistele

This will be very useful, gonna update the table and propose a new one very soon.

K-Mistele · 2024-10-04T20:12:02Z

THanks for that @K-Mistele

This will be very useful, gonna update the table and propose a new one very soon.

no problem! Haven't gotten around to reproducing the APC issue but IIRC it was similar.

tjtanaa · 2024-10-07T00:46:36Z

FYI as additional info to the compatibility matrix, as of commit cb3b2b9 . AMD Multi-Step Feature is working, however a combination of MultiStep + Chunked Prefill is only supported on CUDA.

Related issue: #9111 (comment)
Related PR: #9038

njhill · 2024-10-08T00:27:15Z

@wallashss this can be added for LoRA + chunked prefill: #9057

Re @pooyadavoodi's comment above, perhaps you could include an addition to the pull request template in this same PR: https://github.com/vllm-project/vllm/blob/main/.github/PULL_REQUEST_TEMPLATE.md

Signed-off-by: Wallas Santos <wallashss@ibm.com>

wallashss · 2024-10-08T02:22:07Z

Hey everyone,

Thank you for your contributions and feedback.

I did a major update on the matrix:

Split on two matrix: feature x feature and feature x hardware. For the hardware matrix, I also split Nvidia on architectures due to the evidence of unsupported features on Volta (Thanks @K-Mistele for the help with that)
Tried to shrink a lot to make them fit on the screen. I added some CSS to decrease the font in order to do that (I hope nobody bothers with that). I did some abbreviations, and to help identify those I added some links to the documentation of the feature or tooltips that appears when user do a mouse hover.
I removed the [C] and [T] labels, because I thought they were messing up the readability of the table. Therefore I only added the ✗ with a link for the issue (if it does have one).
For the [C] labels (label for code check), I went to the source code that they pointed to and added a comment with a reminder to update the compatibility matrix if the combo becomes valid. I guess it solves the problem of updating the reference (permalink) and helps contributors/reviewers check in loco that there is something more to update. What do you think @njhill?

Screenshots

Signed-off-by: Wallas Santos <wallashss@ibm.com>

njhill · 2024-10-11T04:57:00Z

Thanks @wallashss for all of the hard work on this! Let's get it merged and we can make other adjustments as follow-ons.

…t#8512) Signed-off-by: Wallas Santos <wallashss@ibm.com> Signed-off-by: Alvant <alvasian@yandex.ru>

…t#8512) Signed-off-by: Wallas Santos <wallashss@ibm.com> Signed-off-by: Amit Garg <mitgarg17495@gmail.com>

…t#8512) Signed-off-by: Wallas Santos <wallashss@ibm.com> Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>

…t#8512) Signed-off-by: Wallas Santos <wallashss@ibm.com>

…t#8512) Signed-off-by: Wallas Santos <wallashss@ibm.com> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>

wallashss mentioned this pull request Sep 16, 2024

[DRAFT] Compatibility Matrix opendatahub-io/vllm#140

Closed

wallashss force-pushed the compat_matrix branch from e72169b to 18afafe Compare September 16, 2024 18:24

vllm-project deleted a comment from github-actions bot Sep 17, 2024

DarkLight1337 reviewed Sep 19, 2024

View reviewed changes

docs/source/serving/compatibility_matrix.rst Outdated Show resolved Hide resolved

docs/source/serving/compatibility_matrix.rst Show resolved Hide resolved

wallashss force-pushed the compat_matrix branch from 19e1482 to 10f4fab Compare October 8, 2024 01:00

[Doc] Feature compatibility matrix

de254c7

Signed-off-by: Wallas Santos <wallashss@ibm.com>

wallashss force-pushed the compat_matrix branch from 10f4fab to de254c7 Compare October 8, 2024 02:01

[Doc] Filled some cells for Multimodal models

079d12a

Signed-off-by: Wallas Santos <wallashss@ibm.com>

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 11, 2024

njhill approved these changes Oct 11, 2024

View reviewed changes

simon-mo merged commit 8baf85e into vllm-project:main Oct 11, 2024
72 of 74 checks passed

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Doc] Compatibility matrix for mutual exclusive features (vllm-projec…

da7dbe4

…t#8512) Signed-off-by: Wallas Santos <wallashss@ibm.com> Signed-off-by: Alvant <alvasian@yandex.ru>

garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024

[Doc] Compatibility matrix for mutual exclusive features (vllm-projec…

1dad903

…t#8512) Signed-off-by: Wallas Santos <wallashss@ibm.com> Signed-off-by: Amit Garg <mitgarg17495@gmail.com>

K-Mistele mentioned this pull request Oct 30, 2024

[Bug]: Unable to use fp8 kv cache with chunked prefill on ampere #7714

Open

sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024

[Doc] Compatibility matrix for mutual exclusive features (vllm-projec…

ea8cba8

…t#8512) Signed-off-by: Wallas Santos <wallashss@ibm.com> Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[Doc] Compatibility matrix for mutual exclusive features (vllm-projec…

b9a00fb

…t#8512) Signed-off-by: Wallas Santos <wallashss@ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Doc] Compatibility matrix for mutual exclusive features #8512

[Doc] Compatibility matrix for mutual exclusive features #8512

wallashss commented Sep 16, 2024 •

edited by njhill

Loading

DarkLight1337 commented Sep 17, 2024 •

edited

Loading

wallashss commented Sep 18, 2024

DarkLight1337 commented Sep 19, 2024

njhill commented Sep 19, 2024

K-Mistele commented Sep 19, 2024 •

edited

Loading

njhill commented Sep 19, 2024

wallashss commented Sep 20, 2024 •

edited

Loading

K-Mistele commented Sep 20, 2024

wallashss commented Sep 23, 2024

wallashss commented Sep 25, 2024

K-Mistele commented Sep 25, 2024

pooyadavoodi commented Sep 28, 2024

K-Mistele commented Oct 3, 2024 •

edited

Loading

wallashss commented Oct 4, 2024

K-Mistele commented Oct 4, 2024

tjtanaa commented Oct 7, 2024

njhill commented Oct 8, 2024

wallashss commented Oct 8, 2024

njhill commented Oct 11, 2024

[Doc] Compatibility matrix for mutual exclusive features #8512

[Doc] Compatibility matrix for mutual exclusive features #8512

Conversation

wallashss commented Sep 16, 2024 • edited by njhill Loading

DarkLight1337 commented Sep 17, 2024 • edited Loading

wallashss commented Sep 18, 2024

DarkLight1337 commented Sep 19, 2024

njhill commented Sep 19, 2024

K-Mistele commented Sep 19, 2024 • edited Loading

njhill commented Sep 19, 2024

wallashss commented Sep 20, 2024 • edited Loading

K-Mistele commented Sep 20, 2024

wallashss commented Sep 23, 2024

wallashss commented Sep 25, 2024

K-Mistele commented Sep 25, 2024

pooyadavoodi commented Sep 28, 2024

K-Mistele commented Oct 3, 2024 • edited Loading

wallashss commented Oct 4, 2024

K-Mistele commented Oct 4, 2024

tjtanaa commented Oct 7, 2024

njhill commented Oct 8, 2024

wallashss commented Oct 8, 2024

njhill commented Oct 11, 2024

wallashss commented Sep 16, 2024 •

edited by njhill

Loading

DarkLight1337 commented Sep 17, 2024 •

edited

Loading

K-Mistele commented Sep 19, 2024 •

edited

Loading

wallashss commented Sep 20, 2024 •

edited

Loading

K-Mistele commented Oct 3, 2024 •

edited

Loading