Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Encoder/decoder models & feature compatibility #7366

Open
afeldman-nm opened this issue Aug 9, 2024 · 12 comments
Open

[RFC]: Encoder/decoder models & feature compatibility #7366

afeldman-nm opened this issue Aug 9, 2024 · 12 comments
Labels

Comments

@afeldman-nm
Copy link
Contributor

afeldman-nm commented Aug 9, 2024

Motivation #

There is significant interest in vLLM supporting encoder/decoder models. Issues #187 and #180 , for example, request encoder/decoder model support. As a result encoder/decoder support was recently introduced to vLLM via the following three PRs:

These three PRs make encoder/decoder model inference possible; however, they leave more to be desired in terms of (1) parity between vLLM's decoder-only & encoder/decoder request processing pipelines with respect to feature support, and (2) the number of encoder/decoder models which are supported.

The ask for the vLLM community is to contribute PRs which help bring vLLM encoder/decoder functionality to a similar level of maturity as that of vLLM's decoder-only functionality.

Proposed changes #

The support matrix below summarizes which encoder/decoder models have already been added & which features are currently compatible with the vLLM encoder/decoder pipeline, versus which features & models will require additional PRs to implement in the long-term:

Model/feature Model is already available/feature is already compatible with encoder-decoder? Having this model/making this feature compatible is a long-term goal?
Encoder/decoder infrastructure Yes Yes
BART Yes Yes
Whisper No Yes
T5 No Yes
Other enc/dec models No Yes
Quantization Untested Yes
Multimodality No Yes
Attention backends other than Xformers (esp. flash-attn, flashinfer) No Yes
Custom attention bias support No Yes
CUDAGraph No
(Issue #7447)
Yes
Pipeline parallelism No Yes
Speculative decoding No Low-priority but nice-to-have; difficult.
Automatic prefix caching No Low-priority; difficult.
Sliding window No No
Chunked prefill No No
LoRA No No

This RFC gives an overview of those features & models which are not compatible with encoder/decoder currently, but which should be made compatible eventually (i.e. No in the second column, Yes in the third column in the support matrix.)

Note that there are features (automatic prefix caching/sliding window/chunked prefill/LoRA) which are not long-term compatibility goals.

Background #

Before continuing, it will be helpful to review the details of the new vLLM encoder/decoder infrastructure.

It will also be helpful to review this how-to guide for adding new encoder/decoder models & improving encoder/decoder feature compatibility.

Initial goal #

Members of the vLLM contributor community identify models/features in the support matrix above, for which they will work on writing a PR.

Detailed long-term goals #

Add new models to vLLM #

Please review the how-to guide for adding new models to vLLM

See tests/models/test_bart.py for an example of an encoder/decoder model unit test. See tests/distributed/test_basic_distributed_correctness_enc_dec.py for an example of an encoder/decoder model test with TP > 1.

Add Whisper model #

Steps to add support for Whisper, a multimodal encoder/decoder speech recognition model:

Proposal: consider whether or not it makes sense to implement encoder/decoder multimodality, audio support, and Whisper in the same PR; that way, the Whisper model may be used to facilitate an end-to-end test with of audio multimodality.

Add T5 model #

Note: T5 depends on custom attention bias being supported by at least one of the attention backends which also supports encoder attention & cross-attention; at time of writing no vLLM attention backend fulfills this requirement. The vLLM XFormers attention backend is the only backend which supports encoder/decoder models but neither it nor any other vLLM attention backend supports custom attention bias. (Custom attention bias is required in order to support T5 relative positional encoding.)

Steps to add support for the T5 model:

  • Port HuggingFace T5 model to vLLM
    • This includes porting over the method which computes the custom attention bias matrix for T5 relative position encoding
  • Modify each T5 layer, where appropriate, to support TP > 1
    • The custom attention bias computation must also support TP > 1
  • Add a T5 test to tests/models/

Note: T5 was added to an older version of vLLM in #3117 , which could be a helpful starting-point

Add other encoder/decoder models

  • Review open vLLM issues on GitHub and identify other encoder/decoder models which are requested by users

Quantization #

The goal of this workstream is to make sure that quantization + encoder/decoder models is fully-tested, and to fill in any gaps (should they exist) in vLLM's support for quantized encoder/decoder models.

Steps to ensure that vLLM supports encoder/decoder models in combination with all existing vLLM quantization methods:

  • Identify the list of quantization methods which vLLM currently supports with decoder-only models.
  • Add unit tests for encoder/decoder models with all of these quantization methods.
  • Determine which quantization methods are currently incompatible with vLLM encoder/decoder infrastructure.
  • Scope out the effort involved in making these quantization methods compatible & submit a PR making the change.

vLLM encoder/decoder infrastructure should be compatible with most of the existing vLLM quantization methods, because the specialized quantization kernels are only employed for GEMM operations involving the learned weight matrices ($W_q$, $W_k$, etc.), whereas the encoder/decoder work really only modifies how the Attention(q, k, v, kv_cache) layer behaves & does not impact the learned weight matrices at all.

It is less clear whether vLLM encoder/decoder infrastructure would be incompatible with FP8. It does appear that a specialized quantized KV cache kernel is employed by the Attention(q, k, v, kv_cache) layer when FP8 quantization is employed.

Support encoder/decoder multimodality #

Technically, vLLM already supports multimodality for models which have an "encoder" and a "decoder", i.e. Llava. However, Llava's decoder does not utilize cross-attention & the model is basically compatible with vLLM's pre-existing decoder-only infrastructure.

But critically, for encoder/decoder models with cross-attention such as Whisper vLLM does not currently support multimodality of any sort. The processing pipeline does not extract or utilize multimodal data from the input prompt, and the EncoderDecoderModelRunner has an assert which fails if the multimodal config is not None. Addressing this is what is meant by "supporting encoder/decoder multimodality".

Steps to extend existing vLLM multimodality support to encoder/decoder models:

  • Review existing vLLM multimodality support in the decoder-only pipeline
  • Scope out a plan for adding encoder/decoder multimodality support.
  • Propose & implement one or more multimodal prompt formats for encoder/decoder models
  • Integrate multimodality support into encoder/decoder processing pipeline
  • Remove the assertion which fails when multimodality is enabled for an encoder/decoder model (see assert_enc_dec_mr_supported_scenario() in vllm/worker/utils.py)
  • Add one or more unit tests with multimodal data

There are a number of multimodal encoder/decoder models which will benefit from this feature. One possibility is to add multimodality support & a multimodal model such as Whisper in the same PR, so that Whisper may be used to facilitate an end-to-end test with multimodality.

Another possibility is to implement multimodality support in its own PR.

Considerations for designing multimodal encoder/decoder prompt formats #

One approach to designing the vLLM multimodal encoder/decoder prompt formats, is to consider what we want the user experience to be for high-priority multimodal encoder/decoder models such as

Initial proposal for multimodal encoder/decoder prompt formats

It may be helpful to review

Generally speaking, in encoder/decoder models based on cross-attention, the non-text input modality is passed to the encoder as input. Conversely, any text prompt is typically passed to the decoder as a input prompt.

The following two encoder/decoder multimodal prompt formats are tentatively proposed:

  • Singleton TextPrompt with multi_modal_data field

    • vLLM will extract the multi_modal_data and pass it to the encoder module
    • vLLM will extract the prompt text, tokenize it and pass the token-list to the decoder (note that this is the opposite of vLLM behavior for non-multimodal prompts, where the prompt text would be passed to the encoder.)

    For example passing the TextPrompt below to vLLM BART

    TextPrompt(
        'prompt': "The rain in spain falls mainly on the",
        'multi_modal_data': <multi modal data structure>
    )
    

    results in

    Encoder input: <multi modal data structure>
    Decoder prompt: "The rain in spain falls mainly on the"
    
  • Singleton TokensPrompt with multi_modal_data field

    • vLLM will extract the multi_modal_data and pass it to the encoder module
    • vLLM will extract the token list and pass it unmodified to the decoder (note that this is the opposite of vLLM behavior for non-multimodal prompts, where the prompt tokens would be passed to the encoder.)

    For example passing the TokensPrompt below to vLLM BART

    TokensPrompt(
        'prompt_tokens': [2,0,171,5,2],
        'multi_modal_data': <multi modal data structure>
    )
    

    results in

    Encoder prompt: <multi modal data structure>
    Decoder prompt: [2,0,171,5,2]
    

It may also be worth considering whether or how to support

  • ExplicitEncoderDecoderPrompts with multimodality
  • An input prompt format which encapsulates only multimodal encoder inputs, with no associated decoder text/tokens prompt (this would result in the decoder being passed a "default" or empty prompt.)

Add support for encoder attention and cross-attention to additional backends #

At time of writing, XFormers is the only vLLM attention backend which supports encoder attention & cross-attention.

The goal of this workstream would be to extend encoder attention & cross-attention support to additional backends, the highest-priority being flash-attention and flashinfer.

Reviewing encoder attention and cross-attention support in the XFormers backend would be a good starting-point for extending support to other models.

For context on the requirements for a backend to support encoder and cross-attention, it may help to review the encoder/decoder architecture, the way that attention masks are currently constructed in the XFormers backend, and the recommended architecture for vLLM encoder/decoder models.

A summary of the key changes required for an attention backend to support encoder attention and cross-attention:

Initial goals

  • Identify the changes required to add encoder attention & cross-attention support to flash-attention and flashinfer
  • PR the required changes
    • Remove/modify any asserts which fail if the vLLM attention backend is not XFormers
      • Currently, the __init__() method of EncoderDecoderModelRunner invokes a method EncoderDecoderModelRunner._maybe_force_supported_attention_backend() defined here which (1) attempts to force encoder/decoder models to use XFormers attention backend, and (2) raises an exception if the user has overridden the attention backend to be anything other than XFormers.

Long-term goals

  • All vLLM attention backends support encoder attention and cross-attention

Support custom attention bias #

Note: T5 takes a dependency on custom attention bias. Custom attention bias is likely complex enough to merit its own PR.

Note: custom bias support was added to PagedAttention in an older version of vLLM as part of #3117 ; given changes in vLLM since then, additional work would be required to integrate this implementation.

Custom attention bias and relative positional encoding

Attention bias refers to adding a matrix $A$ to the scaled dot-product (SDP) attention scores matrix before performing softmax, as shown below:

$$ attn(Q,K,V,A) = softmax(\frac{Q K^T + A}{\sqrt{d}})V $$

Here, custom attention bias is understood to mean that the vLLM attention backend allows $A$ to be an arbitrary matrix, provided the tensor dimensions are commensurate with the shape of the SDP attention scores matrix. This is in contrast to the existing vLLM attention backend implementations, which can only accommodate simple block-diagonal causal or non-causal masks which are uniformly either $0$ or $-\infty$.

There are broadly two possible approaches to custom attention bias, which do not necessarily have to be mutually-exclusive:

  • $A$ is a fully-materialized attention bias matrix passed to the attention backend
  • $A$ is computed on-the-fly by the attention kernel, using an element-wise formula for the attention bias which is fused with the $Q K^T$ and $softmax$ computations

T5 employs custom attention bias in order to implement relative positional encoding, wherein pairwise positional relationships between tokens are represented by the bias matrix. The HuggingFace Transformers T5 implementation provides an example of how the relative positional encoding matrix is computed.

Existing attention bias support

Currently, no vLLM attention backend fully supports passing in a custom attention bias. This is primarily due to underlying kernel limitations. For example, the xFormers memory_efficient_attention_forward kernel is the only NVIDIA-GPU-oriented kernel which permits passing in an arbitrary PyTorch tensor as a materialized attention bias (via the attn_bias argument) (at time of writing I have not investigated if custom attention bias is supported by any of the kernels for AMD GPU, CPU, etc.) Regardless, vLLM only employs xFormers memory_efficient_attention_forward for prefill; to my knowledge, none of the decode-phase kernels employed by vLLM can accept an arbitrary tensor as a custom attention bias, making custom attention bias impossible to apply end-to-end for both prefill and decode under the current vLLM implementation.

In addition to lack of kernel-level support for custom attention bias, most vLLM backends also prevent passing a custom attention bias matrix to the underlying kernel. The exception is the XFormers backend, which accepts an attention bias via XFormersMetadata.attn_bias attribute (however the XFormers backend only utilizes attn_bias in the prefill phase.)

Proposed methods for supporting custom attention bias

Here the following two approaches for supporting custom attention bias in vLLM are proposed:

  • Fully-materialized bias matrix: Modify vLLM attention backends to accept an arbitrary PyTorch tensor, passed into the backend via the AttentionMetadata.attn_bias field.
  • On-the-fly/fused bias matrix computation: Enable an efficient workflow whereby vLLM developers can tweak an attention kernel to compute the custom attention bias on the fly
    • For example: rather than computing the T5 relative position encoder bias matrix once, instead the attention kernel can fuse the element-wise bias matrix formula with the $Q K^T$ and $softmax()$. The attention bias matrix is never fully materialized.
    • FlexAttention enables fused custom attention bias computations in a FlashAttention-style kernel, using torch.compile.

image

It may make sense to support one or both of these methods.

Note that custom attention bias support must be added on a backend-by-backend basis, because of the kernel modifications & backend logic changes required.

Initial goals for introducing custom attention bias support

  1. Focus on a particular vLLM attention backend
  1. Scope out the effort involved in introducing custom attention bias support to this backend
  2. Some steps which will likely be involved in introducing custom attention bias support:
  • Augment attention backend's kernels to accept custom attention bias; for example, the PagedAttention kernel (for XFormers backend), the Flash-attention kernel (for the flash-attn backend), or the Flashinfer kernels (for the Flashinfer backend)
  • (Except for XFormers) add an attn_bias attribute to attention backend's AttentionMetadata subclass
  • Ensure that the attention backend passes the attn_bias attribute to both the prefill and decode kernels
  1. Add at least two custom attention bias unit tests (for prefill & decode respectively)

Final goals for introducing custom attention bias support

  • All vLLM attention backends should support custom attention bias, with unit tests

Some links which may be helpful for understanding how causal & non-causal attention masks are currently configured in vLLM:

Support CUDAGraph with encoder/decoder models #

Note: this topic is being tracked by Issue #7447

Steps to support CUDAGraph with encoder/decoder models:

  • Scope out the effort require to support CUDAGraph with encoder/decoder models
  • Write a PR for CUDAGraph + encoder/decoder
    • Remove the assertion which fails when CUDAGraph is enabled for an encoder/decoder model (see assert_enc_dec_mr_supported_scenario() in vllm/worker/utils.py)

Support pipeline-parallelism with encoder/decoder models #

Steps to support pipeline-parallelism with encoder/decoder models:

  • Scope out the effort required to support pipeline-parallelism with encoder/decoder models
  • Write a PR for pipeline-parallelism + encoder/decoder
    • Remove the assertion which fails when pipeline-parallelism is enabled for an encoder/decoder model (see assert_enc_dec_mr_supported_scenario() in vllm/worker/utils.py)

Support multi-step scheduling with encoder/decoder models #

Note: depends on #7000 landing in order to add multi-step scheduling support; it may be helpful to review the multi-step scheduling RFC ( #6854 )

Steps to support multi-step scheduling with encoder/decoder models:

  • Scope out the effort required to support multi-step scheduling
    • EncoderDecoderModelRunner multi-step support
  • Write a PR for multi-step scheduling + encoder/decoder
  • Write at least one test of an encoder/decoder model with multi-step scheduling

Low-priority high-effort tasks #

  • Speculative decoding
  • Automatic prefix caching

Here it is proposed that these features are low-priority. Adding support for speculative decoder and automatic prefix caching would require a significant of effort to scope out and design the implementations.

Note that adding support for either of these features would require removing the assertions which fail when speculative decoding or automatic prefix caching are enabled for an encoder/decoder model (see assert_enc_dec_mr_supported_scenario() in vllm/worker/utils.py)

Feedback Period.

Closed.

CC List.

@WoosukKwon
@robertgshaw2-neuralmagic
@mgoin
@tms
@njhill
@sroy745
@ywang96
@DarkLight1337
@js8544

Any Other Things.

No response

@afeldman-nm
Copy link
Contributor Author

Looks like footnote references (i.e. [^1], [^2], etc.) are not rendered for RFCs on github, so I just edited the RFC to replace all of the footnote references with direct links.

@robertgshaw2-neuralmagic
Copy link
Collaborator

cc @mgoin

@afeldman-nm
Copy link
Contributor Author

FYI, added a section to the RFC about adding multi-step scheduling + encoder/decoder support.

@robertgshaw2-neuralmagic

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 17, 2024

Sorry for mentioning this so late. For multimodal models, there are actually two ways to apply cross-attention:

  • Cross-attention between text and multimodal features (e.g. Llama 3.1 multimodal, Whisper)
  • Cross-attention between text features only (i.e. multimodal encoder with an encoder-decoder language model) (e.g. [New Model]: Florence-2 #5934, BLIP-2 w/ FLAN-T5)

I wonder how the current plan could handle the latter case. To keep the API consistent, we should distinguish between the above two cases internally when multimodal data is passed.

@DarkLight1337
Copy link
Member

Tagging #8811 for future reference

@NickLucche
Copy link
Contributor

I can start looking into T5 support starting from the custom attention bias 🤞🏻

@robertgshaw2-neuralmagic
Copy link
Collaborator

I can start looking into T5 support starting from the custom attention bias 🤞🏻

Thanks @NickLucche - this would be greatly appreciated

@sroy745
Copy link
Collaborator

sroy745 commented Oct 22, 2024

@

I can start looking into T5 support starting from the custom attention bias 🤞🏻

Thanks @NickLucche - this would be greatly appreciated

Hi @NickLucche are you looking at any particular kernel for adding the custom bias needed for T5? FYI I am trying to add support to make the encoder-decoder models work with the flash_attn kernel.

@robertgshaw2-neuralmagic
Copy link
Collaborator

@

I can start looking into T5 support starting from the custom attention bias 🤞🏻

Thanks @NickLucche - this would be greatly appreciated

Hi @NickLucche are you looking at any particular kernel for adding the custom bias needed for T5? FYI I am trying to add support to make the encoder-decoder models work with the flash_attn kernel.

@sroy745 - I think it’s probably easiest to do it with the paged attention backend, but doing in flash_attn would be the best

@njhill
Copy link
Member

njhill commented Oct 22, 2024

For us the most important is having something functional for T5 so was also thinking xformers option might be preferable/easier for the initial support.

@NickLucche
Copy link
Contributor

NickLucche commented Oct 25, 2024

Thanks for the quick feedback!
@sroy745 @robertgshaw2-neuralmagic @njhill so xformers appears to be indeed the easiest, but it has to be paired with a PagedAttention modification too, AFAIK xformers is not used in decode because it has no concept of kv paging/block table.

FlashAttention would be super interesting to explore, but will likely require PRs to https://github.com/vllm-project/flash-attention or to the original repo. Even then I am not sure if a fully materialized bias matrix option would be accepted there, while the fused approach would be too specific to T5 to live there.

Personally I would like to start with the "less-efficient"/easier approach first (fully materialized bias) and leave optimization for later, unless there's a more mature implementation we can somehow integrate here.

@jeejeelee
Copy link
Collaborator

Why isn't LoRA included as an LTS feature? What considerations are there behind this decision? I've recently been working on supporting LoRA for mllama , and currently my main challenge is dealing with cross attention

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants