Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about calculation of Q and transpose(K). #10

Open
jaes77 opened this issue Apr 20, 2023 · 0 comments
Open

Question about calculation of Q and transpose(K). #10

jaes77 opened this issue Apr 20, 2023 · 0 comments

Comments

@jaes77
Copy link

jaes77 commented Apr 20, 2023

Thanks for your effort to make this great platform.

In normal attention, the input of softmax function is a form of matmul(Q,K_T) and its dimension is (batch, num_heads, q_len, k_len)
Also, the attention mask is like a trigonal shape (total shape is could be q_len x k_len)
so, matmul(q, k_t) is masked with the attention mask.

However, I don't understand how matmul(q_chunk, transposed k_chunk) works and results in masked input of softmax compared with original attention algorithm flow at the code lines below.

attn_weights = einsum('i ... d, j ... d -> i ... j', q_scaled, k_chunk)
key_mask_chunk = rearrange(key_mask_chunk, 'j b -> 1 b 1 j')
attn_weights = jnp.where(key_mask_chunk, attn_weights, MASK_VALUE)

Can you explain it with details?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant