-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Encoder/decoder models & feature compatibility #7366
Comments
Looks like footnote references (i.e. [^1], [^2], etc.) are not rendered for RFCs on github, so I just edited the RFC to replace all of the footnote references with direct links. |
cc @mgoin |
FYI, added a section to the RFC about adding multi-step scheduling + encoder/decoder support. |
Sorry for mentioning this so late. For multimodal models, there are actually two ways to apply cross-attention:
I wonder how the current plan could handle the latter case. To keep the API consistent, we should distinguish between the above two cases internally when multimodal data is passed. |
Tagging #8811 for future reference |
I can start looking into T5 support starting from the custom attention bias 🤞🏻 |
Thanks @NickLucche - this would be greatly appreciated |
@
Hi @NickLucche are you looking at any particular kernel for adding the custom bias needed for T5? FYI I am trying to add support to make the encoder-decoder models work with the flash_attn kernel. |
@sroy745 - I think it’s probably easiest to do it with the paged attention backend, but doing in flash_attn would be the best |
For us the most important is having something functional for T5 so was also thinking xformers option might be preferable/easier for the initial support. |
Thanks for the quick feedback! FlashAttention would be super interesting to explore, but will likely require PRs to https://github.com/vllm-project/flash-attention or to the original repo. Even then I am not sure if a fully materialized bias matrix option would be accepted there, while the fused approach would be too specific to T5 to live there. Personally I would like to start with the "less-efficient"/easier approach first (fully materialized bias) and leave optimization for later, unless there's a more mature implementation we can somehow integrate here. |
Why isn't LoRA included as an LTS feature? What considerations are there behind this decision? I've recently been working on supporting LoRA for |
Motivation #
There is significant interest in vLLM supporting encoder/decoder models. Issues #187 and #180 , for example, request encoder/decoder model support. As a result encoder/decoder support was recently introduced to vLLM via the following three PRs:
These three PRs make encoder/decoder model inference possible; however, they leave more to be desired in terms of (1) parity between vLLM's decoder-only & encoder/decoder request processing pipelines with respect to feature support, and (2) the number of encoder/decoder models which are supported.
The ask for the vLLM community is to contribute PRs which help bring vLLM encoder/decoder functionality to a similar level of maturity as that of vLLM's decoder-only functionality.
Proposed changes #
The support matrix below summarizes which encoder/decoder models have already been added & which features are currently compatible with the vLLM encoder/decoder pipeline, versus which features & models will require additional PRs to implement in the long-term:
(Issue #7447)
This RFC gives an overview of those features & models which are not compatible with encoder/decoder currently, but which should be made compatible eventually (i.e. No in the second column, Yes in the third column in the support matrix.)
Note that there are features (automatic prefix caching/sliding window/chunked prefill/LoRA) which are not long-term compatibility goals.
Background #
Before continuing, it will be helpful to review the details of the new vLLM encoder/decoder infrastructure.
It will also be helpful to review this how-to guide for adding new encoder/decoder models & improving encoder/decoder feature compatibility.
Initial goal #
Members of the vLLM contributor community identify models/features in the support matrix above, for which they will work on writing a PR.
Detailed long-term goals #
Add new models to vLLM #
Please review the how-to guide for adding new models to vLLM
See
tests/models/test_bart.py
for an example of an encoder/decoder model unit test. Seetests/distributed/test_basic_distributed_correctness_enc_dec.py
for an example of an encoder/decoder model test with TP > 1.Add Whisper model #
Steps to add support for Whisper, a multimodal encoder/decoder speech recognition model:
tests/models/
Proposal: consider whether or not it makes sense to implement encoder/decoder multimodality, audio support, and Whisper in the same PR; that way, the Whisper model may be used to facilitate an end-to-end test with of audio multimodality.
Add T5 model #
Note: T5 depends on custom attention bias being supported by at least one of the attention backends which also supports encoder attention & cross-attention; at time of writing no vLLM attention backend fulfills this requirement. The vLLM XFormers attention backend is the only backend which supports encoder/decoder models but neither it nor any other vLLM attention backend supports custom attention bias. (Custom attention bias is required in order to support T5 relative positional encoding.)
Steps to add support for the T5 model:
tests/models/
Note: T5 was added to an older version of vLLM in #3117 , which could be a helpful starting-point
Add other encoder/decoder models
Quantization #
The goal of this workstream is to make sure that quantization + encoder/decoder models is fully-tested, and to fill in any gaps (should they exist) in vLLM's support for quantized encoder/decoder models.
Steps to ensure that vLLM supports encoder/decoder models in combination with all existing vLLM quantization methods:
vLLM encoder/decoder infrastructure should be compatible with most of the existing vLLM quantization methods, because the specialized quantization kernels are only employed for GEMM operations involving the learned weight matrices ($W_q$ , $W_k$ , etc.), whereas the encoder/decoder work really only modifies how the
Attention(q, k, v, kv_cache)
layer behaves & does not impact the learned weight matrices at all.It is less clear whether vLLM encoder/decoder infrastructure would be incompatible with FP8. It does appear that a specialized quantized KV cache kernel is employed by the
Attention(q, k, v, kv_cache)
layer when FP8 quantization is employed.Support encoder/decoder multimodality #
Technically, vLLM already supports multimodality for models which have an "encoder" and a "decoder", i.e. Llava. However, Llava's decoder does not utilize cross-attention & the model is basically compatible with vLLM's pre-existing decoder-only infrastructure.
But critically, for encoder/decoder models with cross-attention such as Whisper vLLM does not currently support multimodality of any sort. The processing pipeline does not extract or utilize multimodal data from the input prompt, and the
EncoderDecoderModelRunner
has an assert which fails if the multimodal config is notNone
. Addressing this is what is meant by "supporting encoder/decoder multimodality".Steps to extend existing vLLM multimodality support to encoder/decoder models:
assert_enc_dec_mr_supported_scenario()
invllm/worker/utils.py
)There are a number of multimodal encoder/decoder models which will benefit from this feature. One possibility is to add multimodality support & a multimodal model such as Whisper in the same PR, so that Whisper may be used to facilitate an end-to-end test with multimodality.
Another possibility is to implement multimodality support in its own PR.
Considerations for designing multimodal encoder/decoder prompt formats #
One approach to designing the vLLM multimodal encoder/decoder prompt formats, is to consider what we want the user experience to be for high-priority multimodal encoder/decoder models such as
Initial proposal for multimodal encoder/decoder prompt formats
It may be helpful to review
TextPrompt
,TokensPrompt
) as well asExplicitEncoderDecoder
promptsmulti_modal_data
here and also review the vLLM documentation on multimodalityGenerally speaking, in encoder/decoder models based on cross-attention, the non-text input modality is passed to the encoder as input. Conversely, any text prompt is typically passed to the decoder as a input prompt.
The following two encoder/decoder multimodal prompt formats are tentatively proposed:
Singleton
TextPrompt
withmulti_modal_data
fieldmulti_modal_data
and pass it to the encoder moduleFor example passing the
TextPrompt
below to vLLM BARTresults in
Singleton
TokensPrompt
withmulti_modal_data
fieldmulti_modal_data
and pass it to the encoder moduleFor example passing the
TokensPrompt
below to vLLM BARTresults in
It may also be worth considering whether or how to support
ExplicitEncoderDecoderPrompt
s with multimodalityAdd support for encoder attention and cross-attention to additional backends #
At time of writing, XFormers is the only vLLM attention backend which supports encoder attention & cross-attention.
The goal of this workstream would be to extend encoder attention & cross-attention support to additional backends, the highest-priority being flash-attention and flashinfer.
Reviewing encoder attention and cross-attention support in the XFormers backend would be a good starting-point for extending support to other models.
For context on the requirements for a backend to support encoder and cross-attention, it may help to review the encoder/decoder architecture, the way that attention masks are currently constructed in the XFormers backend, and the recommended architecture for vLLM encoder/decoder models.
A summary of the key changes required for an attention backend to support encoder attention and cross-attention:
AttentionMetadata
subclass must support fields for encoder sequence lengths, encoder sequence token count, cross-attention blocktables, and cross-attention slot mapping. XFormers examples:AttentionMetadata
subclass' encoder field declarationsprefill_metadata()
methoddecode_metadata()
methodforward()
method of the backend implementation must accept anattn_type
argument of typeAttentionType
, which allows choosing between encoder attention, decoder attention, or encoder/decoder cross-attention. XFormers exampleattn_type
, and adjust accordingly in terms of (1) how it utilizesattn_metadata
when invoking the attention kernels (review XFormersforward()
for context), and (2) the choice of causal or non-causal attention, as well the choice of attention mask shape (XFormers example).Initial goals
__init__()
method ofEncoderDecoderModelRunner
invokes a methodEncoderDecoderModelRunner._maybe_force_supported_attention_backend()
defined here which (1) attempts to force encoder/decoder models to use XFormers attention backend, and (2) raises an exception if the user has overridden the attention backend to be anything other than XFormers.Long-term goals
Support custom attention bias #
Note: T5 takes a dependency on custom attention bias. Custom attention bias is likely complex enough to merit its own PR.
Note: custom bias support was added to
PagedAttention
in an older version of vLLM as part of #3117 ; given changes in vLLM since then, additional work would be required to integrate this implementation.Custom attention bias and relative positional encoding
Attention bias refers to adding a matrix$A$ to the scaled dot-product (SDP) attention scores matrix before performing softmax, as shown below:
Here, custom attention bias is understood to mean that the vLLM attention backend allows$A$ to be an arbitrary matrix, provided the tensor dimensions are commensurate with the shape of the SDP attention scores matrix. This is in contrast to the existing vLLM attention backend implementations, which can only accommodate simple block-diagonal causal or non-causal masks which are uniformly either $0$ or $-\infty$ .
There are broadly two possible approaches to custom attention bias, which do not necessarily have to be mutually-exclusive:
T5 employs custom attention bias in order to implement relative positional encoding, wherein pairwise positional relationships between tokens are represented by the bias matrix. The HuggingFace Transformers T5 implementation provides an example of how the relative positional encoding matrix is computed.
Existing attention bias support
Currently, no vLLM attention backend fully supports passing in a custom attention bias. This is primarily due to underlying kernel limitations. For example, the xFormers
memory_efficient_attention_forward
kernel is the only NVIDIA-GPU-oriented kernel which permits passing in an arbitrary PyTorch tensor as a materialized attention bias (via theattn_bias
argument) (at time of writing I have not investigated if custom attention bias is supported by any of the kernels for AMD GPU, CPU, etc.) Regardless, vLLM only employs xFormersmemory_efficient_attention_forward
for prefill; to my knowledge, none of the decode-phase kernels employed by vLLM can accept an arbitrary tensor as a custom attention bias, making custom attention bias impossible to apply end-to-end for both prefill and decode under the current vLLM implementation.In addition to lack of kernel-level support for custom attention bias, most vLLM backends also prevent passing a custom attention bias matrix to the underlying kernel. The exception is the XFormers backend, which accepts an attention bias via
XFormersMetadata.attn_bias
attribute (however the XFormers backend only utilizesattn_bias
in the prefill phase.)Proposed methods for supporting custom attention bias
Here the following two approaches for supporting custom attention bias in vLLM are proposed:
AttentionMetadata.attn_bias
field.It may make sense to support one or both of these methods.
Note that custom attention bias support must be added on a backend-by-backend basis, because of the kernel modifications & backend logic changes required.
Initial goals for introducing custom attention bias support
attn_bias
attribute to attention backend'sAttentionMetadata
subclassattn_bias
attribute to both the prefill and decode kernelsFinal goals for introducing custom attention bias support
Some links which may be helpful for understanding how causal & non-causal attention masks are currently configured in vLLM:
Invocation of flash-attention for prefill in vLLM backend, using
causal
flagInvocation of xFormers attention kernel for prefill in vLLM backend, using
BlockDiagonalMask
andBlockDiagonalCausalMask
Invocation of FlashInfer attention kernel for prefill in backend, using
causal
flagInvocation of PagedAttention kernel for decode in vLLM backend
Invocation of FlashInfer kernel for decode in vLLM backend
Support CUDAGraph with encoder/decoder models #
Note: this topic is being tracked by Issue #7447
Steps to support CUDAGraph with encoder/decoder models:
assert_enc_dec_mr_supported_scenario()
invllm/worker/utils.py
)Support pipeline-parallelism with encoder/decoder models #
Steps to support pipeline-parallelism with encoder/decoder models:
assert_enc_dec_mr_supported_scenario()
invllm/worker/utils.py
)Support multi-step scheduling with encoder/decoder models #
Note: depends on #7000 landing in order to add multi-step scheduling support; it may be helpful to review the multi-step scheduling RFC ( #6854 )
Steps to support multi-step scheduling with encoder/decoder models:
EncoderDecoderModelRunner
multi-step supportLow-priority high-effort tasks #
Here it is proposed that these features are low-priority. Adding support for speculative decoder and automatic prefix caching would require a significant of effort to scope out and design the implementations.
Note that adding support for either of these features would require removing the assertions which fail when speculative decoding or automatic prefix caching are enabled for an encoder/decoder model (see
assert_enc_dec_mr_supported_scenario()
invllm/worker/utils.py
)Feedback Period.
Closed.
CC List.
@WoosukKwon
@robertgshaw2-neuralmagic
@mgoin
@tms
@njhill
@sroy745
@ywang96
@DarkLight1337
@js8544
Any Other Things.
No response
The text was updated successfully, but these errors were encountered: