Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] Compatibility matrix for mutual exclusive features #8512

Merged
merged 2 commits into from
Oct 11, 2024

Conversation

wallashss
Copy link
Contributor

@wallashss wallashss commented Sep 16, 2024

Greetings,

We did a study of mutual exclusive features on vLLM and consolidated in a compatibility matrix.

We propose to add the compatibility matrix to the documentation pages to help users to quick consult to plan their implementation or study.

Following the table in markdown for quick checking and help reviewers.

CC @njhill @maxdebayser

                       
Unnamed: 0 Chunked Prefill APC LoRa Prompt Adapter Speculative decoding CUDA Graphs Encoder/Decoder Logprobs Prompt Logprobs Async Output Multi-step
APC
LoRa ✗ [C]
Prompt Adapter
Speculative decoding ✗ [C] [T] ✗ [C]
CUDA Graphs
Encoder/Decoder ✗ [C] ✗ [C][T] ✗ [C] ✗ [C] ✗ [C][T] ✗ [C][T]
Logprobs
Prompt Logprobs ✗ [C] [T]
Async Output ✗ [C] ✅ [C] ✗ [C] [C]
Multi-step ✗ [C] ✗ [C] ✗ [C] ✗ [C] ✗ [C][T]
NVIDIA
CPU ✗ [C] ✗ [C] ✗ [C][T] ✗ [T] ✗ [C] ✗ [C] ✗ [C] ✗ [T]
AMD ✗ [C] ✗ [T]

[C] = Link to where a check is made in the code and error reported
[T] = Link to open tracking issue or PR to address the incompatibility

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 17, 2024

This should also cut down on the number of issues flagged as bugs when in fact the feature is not supported yet. Thanks for adding this!

Some comments:

@vllm-project vllm-project deleted a comment from github-actions bot Sep 17, 2024
@wallashss
Copy link
Contributor Author

Thanks for the feedback and the contribution on the multimodal feature @DarkLight1337 @ywang96 !

Is it intended that you omitted the first row (chunked prefill)? It kinda bothers me that the matrix is asymmetric.

Yeah, kind of. The first time I added, the row become empty. Therefore I thought it would be waste to add it. I added again to you see it. If you think it looks nice I have no problem on keep it.

The table is getting quite long. I would split the table into two sections: one for core features and another one for hardware backends

I discussed that before with a colleague, and we agreed to present like this in a concise format. What do you think? Do you think it looks like that bad? I can discuss this again and review it.

(by the way, you missed TPU backend).

Yeah, I know. There are some other devices that I did not add. Of course, there are some hanging fruits. But TPU and other devices I am currently not working with them and I don't have an easy setup to test like I did to the others. Would be nice if someone could contribute to add this information.

BTW, do you know or know someone that knows the support of multimodal on AMD and CPU? I assume that is supported on Nvidia by default. But do you know the minimum compute capability for this feature?

Here is the updated table:

TABLE
                         
Feature Chunked Prefill APC LoRa Prompt Adapter Speculative decoding CUDA Graphs Encoder/Decoder Logprobs Prompt Logprobs Async Output Multi-step Multimodal
Chunked Prefill
APC
LoRa ✗ [C]
Prompt Adapter
Speculative decoding ✗ [C] [T] ✗ [C]
CUDA Graphs
Encoder/Decoder ✗ [C] ✗ [C][T] ✗ [C] ✗ [C] ✗ [C][T] ✗ [C][T]
Logprobs
Prompt Logprobs ✗ [C] [T]
Async Output ✗ [C] ✅ [C] ✗ [C] [C]
Multi-step ✗ [C] ✗ [C] ✗ [C] ✗ [C] ✗ [C][T]
Multimodal ✗ [T] ✗ [T] ✗ [T] ? ✗ [C] ?
NVIDIA
CPU ✗ [C] ✗ [C] ✗ [C][T] ✗ [T] ✗ [C] ✗ [C] ✗ [C] ✗ [T] ?
AMD ✗ [C] ✗ [T] ?

@DarkLight1337
Copy link
Member

Yeah, kind of. The first time I added, the row become empty. Therefore I thought it would be waste to add it. I added again to you see it. If you think it looks nice I have no problem on keep it.

Looks better now, thanks.

I discussed that before with a colleague, and we agreed to present like this in a concise format. What do you think? Do you think it looks like that bad? I can discuss this again and review it.

I'm not overly bothered by this so you can keep it as is if that's the plan.

BTW, do you know or know someone that knows the support of multimodal on AMD and CPU? I assume that is supported on Nvidia by default. But do you know the minimum compute capability for this feature?

It's supported on both. This feature isn't tied to compute capability, it only depends on whether the model runner implements it.

@njhill
Copy link
Member

njhill commented Sep 19, 2024

Thanks a lot for this @wallashss! We should get the first pass of this added to the docs imo while we're working on adding more rows/columns, and make more folks aware of it.

Is it intended that you omitted the first row (chunked prefill)? It kinda bothers me that the matrix is asymmetric.

Yeah, kind of. The first time I added, the row become empty. Therefore I thought it would be waste to add it. I added again to you see it. If you think it looks nice I have no problem on keep it.

Yeah it was my suggestion to keep the top right triangle empty so that we don't have a bunch of duplicate entries (which could also then become out of sync if we updated one and not its reflection). But it does make it harder to look at one row/column at a time and see all the features it works with.. you have to follow an L shape. Not really sure which is better.

I discussed that before with a colleague, and we agreed to present like this in a concise format. What do you think? Do you think it looks like that bad? I can discuss this again and review it.

I'm not overly bothered by this so you can keep it as is if that's the plan.

@wallashss assuming I'm the colleague, I'm fine with having it split to a separate table :), but it would still be good for the columns to be the same and lined up (and not sure whether that would be possible with markdown).

Some other features we might want to add:

  • Beam search
  • best-of-n generation
  • Guided decoding

We could also shorten some of the other names to make things fit better (and can include a legend at the bottom if needed).. like "Spec decoding" or even "SD".

@K-Mistele
Copy link
Contributor

K-Mistele commented Sep 19, 2024

would it be possible to add a column for prefix caching? I have found it does not work on volta-arch CUDA devices (Nvidia v100 tesla) due to some triton operation being unsupported on the older device.

Possibly also rows for different cuda capabilities - some features work on hopper and ada but not ampere, or on ampere and newer but not volta.

@njhill
Copy link
Member

njhill commented Sep 19, 2024

@K-Mistele APC is (automatic) prefix caching :)

Agree expanding with specific compute capabilities would be good!

@wallashss
Copy link
Contributor Author

wallashss commented Sep 20, 2024

Thanks for the feedback @K-Mistele.

I have found it does not work on volta-arch CUDA devices (Nvidia v100 tesla) due to some triton operation being unsupported on the older device.

How did you figure out that? Does vllm raise an exception/log warning with a clear message, or did it just run "fine" and you knew that was not right? The original idea was to split on Nvidia architecture, but in my research, at least for those features, I thought it may not be necessary.

@K-Mistele
Copy link
Contributor

Thanks for the feedback @K-Mistele.

I have found it does not work on volta-arch CUDA devices (Nvidia v100 tesla) due to some triton operation being unsupported on the older device.

How did you figure out that? Does vllm raise an exception/log warning with a clear message, or did it just run "fine" and you knew that was not right? The original idea was to split on Nvidia architecture, but in my research, at least for those features, I thought it may not be necessary.

It raised an exception about an unsupported triton operation or something like that. I can reproduce it if you'd like to see the specific error although I don't think I have a screenshot of it right this second. I think this happens for chunked prefill too.

@wallashss
Copy link
Contributor Author

It raised an exception about an unsupported triton operation or something like that. I can reproduce it if you'd like to see the specific error although I don't think I have a screenshot of it right this second. I think this happens for chunked prefill too.

Wow, if you could paste the errors here that would be really nice! Thanks.

@wallashss
Copy link
Contributor Author

UPDATE:

Tested on Turing Architecture (7.5) and both APC and Chunked prefill worked fine. @K-Mistele did have any chance to get the error on Volta?

@K-Mistele
Copy link
Contributor

It raised an exception about an unsupported triton operation or something like that. I can reproduce it if you'd like to see the specific error although I don't think I have a screenshot of it right this second. I think this happens for chunked prefill too.

Wow, if you could paste the errors here that would be really nice! Thanks.

Can do! I ran into the issue on a v100 Tesla 32GB, which is a volta device - not turing.

@pooyadavoodi
Copy link
Contributor

It may be good to remind PR authors about updating the compatibility matrix if relevant (e.g. using PR checklist). Just to make sure it remains up to date.

@K-Mistele
Copy link
Contributor

K-Mistele commented Oct 3, 2024

It raised an exception about an unsupported triton operation or something like that. I can reproduce it if you'd like to see the specific error although I don't think I have a screenshot of it right this second. I think this happens for chunked prefill too.

Wow, if you could paste the errors here that would be really nice! Thanks.

Can do! I ran into the issue on a v100 Tesla 32GB, which is a volta device - not turing.

btw @wallashss I found open PRs with more details information on chunked prefill and APC issues on volta devices:

For Chunked prefill on Nvidia Tesla v100 (Volta):

@wallashss
Copy link
Contributor Author

THanks for that @K-Mistele

This will be very useful, gonna update the table and propose a new one very soon.

@K-Mistele
Copy link
Contributor

THanks for that @K-Mistele

This will be very useful, gonna update the table and propose a new one very soon.

no problem! Haven't gotten around to reproducing the APC issue but IIRC it was similar.

@tjtanaa
Copy link
Contributor

tjtanaa commented Oct 7, 2024

FYI as additional info to the compatibility matrix, as of commit cb3b2b9 . AMD Multi-Step Feature is working, however a combination of MultiStep + Chunked Prefill is only supported on CUDA.

Related issue: #9111 (comment)
Related PR: #9038

@njhill
Copy link
Member

njhill commented Oct 8, 2024

@wallashss this can be added for LoRA + chunked prefill: #9057

Re @pooyadavoodi's comment above, perhaps you could include an addition to the pull request template in this same PR: https://github.com/vllm-project/vllm/blob/main/.github/PULL_REQUEST_TEMPLATE.md

Signed-off-by: Wallas Santos <wallashss@ibm.com>
@wallashss
Copy link
Contributor Author

Hey everyone,

Thank you for your contributions and feedback.

I did a major update on the matrix:

  • Split on two matrix: feature x feature and feature x hardware. For the hardware matrix, I also split Nvidia on architectures due to the evidence of unsupported features on Volta (Thanks @K-Mistele for the help with that)
  • Tried to shrink a lot to make them fit on the screen. I added some CSS to decrease the font in order to do that (I hope nobody bothers with that). I did some abbreviations, and to help identify those I added some links to the documentation of the feature or tooltips that appears when user do a mouse hover.
  • I removed the [C] and [T] labels, because I thought they were messing up the readability of the table. Therefore I only added the with a link for the issue (if it does have one).
  • For the [C] labels (label for code check), I went to the source code that they pointed to and added a comment with a reminder to update the compatibility matrix if the combo becomes valid. I guess it solves the problem of updating the reference (permalink) and helps contributors/reviewers check in loco that there is something more to update. What do you think @njhill?
Screenshots

image

image

Signed-off-by: Wallas Santos <wallashss@ibm.com>
@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 11, 2024
@njhill
Copy link
Member

njhill commented Oct 11, 2024

Thanks @wallashss for all of the hard work on this! Let's get it merged and we can make other adjustments as follow-ons.

@simon-mo simon-mo merged commit 8baf85e into vllm-project:main Oct 11, 2024
72 of 74 checks passed
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
…t#8512)

Signed-off-by: Wallas Santos <wallashss@ibm.com>
Signed-off-by: Alvant <alvasian@yandex.ru>
garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024
…t#8512)

Signed-off-by: Wallas Santos <wallashss@ibm.com>
Signed-off-by: Amit Garg <mitgarg17495@gmail.com>
sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024
…t#8512)

Signed-off-by: Wallas Santos <wallashss@ibm.com>
Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
mfournioux pushed a commit to mfournioux/vllm that referenced this pull request Nov 20, 2024
…t#8512)

Signed-off-by: Wallas Santos <wallashss@ibm.com>
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants