Any plan for support paged attention? #660

donglinz · 2023-11-09T03:33:16Z

First of all, thank you for the great work!

Is there any plan to support paged kv cache in non-contiguous memory? For instance, in flash_attn_with_kvcache?

tridao · 2023-11-09T03:34:32Z

It's not the highest priority at the moment. Does the implementation from vLLM not work well?

donglinz · 2023-11-09T03:39:34Z

It's not the highest priority at the moment. Does the implementation from vLLM not work well?

Their scope is slightly different from flash_attn_with_kvcache, which only supports decoding (one token for each batch) I suppose. In many scenarios like speculative decoding. flash_attn_with_kvcache is preferable as it can compute multiple tokens for each batch in parallel.

tridao · 2023-11-09T03:40:50Z

Is vLLM planning to implement a version that can support more than 1 token?

tridao · 2023-11-09T03:48:45Z

Does it make sense to have paged KV cache as a standalone function without all the cache management kernels (in vLLM)? How would one use paged KV cache without a cache manager to copy / update the pages?

donglinz · 2023-11-09T03:53:05Z

Is vLLM planning to implement a version that can support more than 1 token?

I have no information on that but I can ask them under the vllm repo. vllm-project/vllm#1598 for reference.

donglinz · 2023-11-09T04:00:31Z

Does it make sense to have paged KV cache as a standalone function without all the cache management kernels (in vLLM)? How would one use paged KV cache without a cache manager to copy / update the pages?

Yes, it is cache manager dependent as flash attention and vllm using different kv cache formats ([B,L,H,D] vs [n_blocks, H, D//x, block_size, x]).

But I think it should be fine as the biggest obstacle on my side is I cannot find a set of kernels that support both paged prefill and paged decode. The cache manager is not a big issue for me because it can be implemented in ~100 lines of python code. (As a user) as long as I have the kernels, I would gladly implement a cache manager by myself that fits the kernel format.

tridao · 2023-11-09T05:04:31Z

I'm not sure I understand what paged prefill mean, can you say more?
During prefill, the KV cache are calculated as the output of the (nn.Linear) K_proj and V_proj. This is a contiguous memory blob. This contiguous memory blob then can be use for attention during prefill as usual (e.g. calling flash_attn).
I assume vLLM would then copy this contiguous memory blob to different blocks in preparation for decoding?

tdene · 2023-11-24T03:56:35Z

@donglinz did you ever find a solution? I noticed that you closed vllm-project/vllm#1598.

zhaoyang-star · 2024-01-09T07:34:18Z

For the prefill, no cache will be used. I just replaced xformers with FA as xformers does not support MQA/GQA and found attention caculation (softmax(Q @ K^T * softmax_scale) @ V) latency is reduced 2+ times. More details can be found on vllm-project/vllm#1880

For decode stage, we should either rewrite the paged attention kernel in vllm, or modify the FlashAttention kernel to support paged KV cache. I have not evaluation the workload yet.

tridao · 2024-01-23T07:46:52Z

flash-attn now supports paged KV cache as of v2.5.0

curry-zzl · 2024-04-11T08:16:37Z

flash-attn now supports paged KV cache as of v2.5.0
@tridao I still wonder
How would one use paged KV cache without a cache manager to copy / update the pages?

tridao · 2024-04-11T08:58:39Z

You'd need to implement your own cache manager

tspeterkim · 2024-06-28T16:08:51Z

For those who are interested, here's a simple cache manager: https://github.com/tspeterkim/paged-attention-minimal/

learning-chip mentioned this issue Jan 12, 2024

Questions on combined attention mask structure for Jacobi iteration hao-ai-lab/LookaheadDecoding#44

Open

tridao closed this as completed Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any plan for support paged attention? #660

Any plan for support paged attention? #660

donglinz commented Nov 9, 2023 •

edited

Loading

tridao commented Nov 9, 2023

donglinz commented Nov 9, 2023

tridao commented Nov 9, 2023

tridao commented Nov 9, 2023

donglinz commented Nov 9, 2023

donglinz commented Nov 9, 2023 •

edited

Loading

tridao commented Nov 9, 2023

tdene commented Nov 24, 2023

zhaoyang-star commented Jan 9, 2024

tridao commented Jan 23, 2024

curry-zzl commented Apr 11, 2024 •

edited

Loading

tridao commented Apr 11, 2024

tspeterkim commented Jun 28, 2024

Any plan for support paged attention? #660

Any plan for support paged attention? #660

Comments

donglinz commented Nov 9, 2023 • edited Loading

tridao commented Nov 9, 2023

donglinz commented Nov 9, 2023

tridao commented Nov 9, 2023

tridao commented Nov 9, 2023

donglinz commented Nov 9, 2023

donglinz commented Nov 9, 2023 • edited Loading

tridao commented Nov 9, 2023

tdene commented Nov 24, 2023

zhaoyang-star commented Jan 9, 2024

tridao commented Jan 23, 2024

curry-zzl commented Apr 11, 2024 • edited Loading

tridao commented Apr 11, 2024

tspeterkim commented Jun 28, 2024

donglinz commented Nov 9, 2023 •

edited

Loading

donglinz commented Nov 9, 2023 •

edited

Loading

curry-zzl commented Apr 11, 2024 •

edited

Loading