[Question] About MultiHeadAttention's inputs shape. #3831

matrix97317 · 2024-04-28T02:28:54Z

Hi, I have a question about MHAv2. MHAv2 uses [S,B,3*E,1,1] as Inputs shape, Is S must be the same for Q,K,V？I think K and V must be the same, but Q is not.

matrix97317 · 2024-04-29T01:03:35Z

@ttyio @rajeevsrao @asfiyab-nvidia

lix19937 · 2024-05-02T02:38:43Z

Are u means bertQKVToContextPlugin's fused_multihead_attention_v2 ?

The input tensor contains all 3 matrices Q, K, V - This input tensor is computed by multiplying a tensor of size [S, B, E] with the weights W_qkv of size [E, 3 * E] - The weight matrix W_qkv is NOT just the vertical concatenation of individual matrices W_tmp = [W_q', W_k', W_v']', but to start with W_tmp, reshaping it into [E, 3, N, H] (where N * H = E and N is number of heads, H is head size) transposing it into [E, N, 3, H] and reshaping it back to [E, 3 * E]. The interpretation is to layout the k-th heads of Q, K and V next to each other, instead of first all N heads of Q, then all N heads of K, then all heads of V

ref https://github.com/NVIDIA/TensorRT/blob/release/8.5/plugin/bertQKVToContextPlugin/README.md

S : seq_len
B : batch_size
E : hidden size

## x,y,z is embeddings, x(/y/z).shape (S, B, E)   
Q = self.Wq(x)
K = self.Wk(y)
V = self.Wv(z)
qkv = torch.cat([Q, K, V], dim=2)
qkv = qkv.view(x.size(0), x.size(1), 3, self.num_heads, self.size_per_head)

## the last qkv as plugin's one input  
last_qkv = qkv.transpose(2, 3).contiguous().view(x.size(0), x.size(1), 3*self.hidden_size, 1, 1)

weights of self.Wq, self.Wk, self.Wv can be different.

If in torch.nn.MultiheadAttention view,

forward(query, 
        key, 
        value, 
        key_padding_mask=None, 
        need_weights=True, 
        attn_mask=None, 
        average_attn_weights=True, 
        is_causal=False)
# https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html

if Q, K, V means Query embeddings, Key embeddings, Value embeddings, respond to x,y,z. They can be the same or not.

ttyio · 2024-05-02T16:11:50Z

Yes, the mha requires sequenceTo for k, v; sequenceFrom for q. So you are right. And in demobert, since the q, k and v have the same sequence, so we simplified the problem to only support same sequence in the plugin. horizontal merged into single buffer as input.

matrix97317 · 2024-05-03T04:15:47Z

@lix19937 @ttyio So, current bertQKVToContextPlugin must be Q's seqlen == K's seqlen == V's seqlen? If so, my problem has been solved. Are you considering supporting Q's seqlen= The seqlen of K?

lix19937 · 2024-05-03T12:45:15Z

current bertQKVToContextPlugin must be Q's seqlen == K's seqlen == V's seqlen?

yes

ttyio · 2024-05-03T17:52:29Z

@matrix97317 could you try direct import ONNX ? let TRT to do the mha fusion. The native mha fusion in TRT support different sequence length. see https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#mha-fusion

ttyio · 2024-07-02T16:57:47Z

closing since no activity for more than 3 weeks, pls reopen if you still have question, thanks all!

zerollzeng added the triaged Issue has been triaged by maintainers label May 3, 2024

zerollzeng assigned ttyio May 3, 2024

ttyio closed this as completed Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] About MultiHeadAttention's inputs shape. #3831

[Question] About MultiHeadAttention's inputs shape. #3831

matrix97317 commented Apr 28, 2024

matrix97317 commented Apr 29, 2024 •

edited

Loading

lix19937 commented May 2, 2024 •

edited

Loading

ttyio commented May 2, 2024

matrix97317 commented May 3, 2024

lix19937 commented May 3, 2024

ttyio commented May 3, 2024

ttyio commented Jul 2, 2024

[Question] About MultiHeadAttention's inputs shape. #3831

[Question] About MultiHeadAttention's inputs shape. #3831

Comments

matrix97317 commented Apr 28, 2024

matrix97317 commented Apr 29, 2024 • edited Loading

lix19937 commented May 2, 2024 • edited Loading

ttyio commented May 2, 2024

matrix97317 commented May 3, 2024

lix19937 commented May 3, 2024

ttyio commented May 3, 2024

ttyio commented Jul 2, 2024

matrix97317 commented Apr 29, 2024 •

edited

Loading

lix19937 commented May 2, 2024 •

edited

Loading