Support paged kv cache for single batch chat module #1651

cyx-6 · 2024-01-23T21:19:16Z

This PR adds the support for paged kv cache for single batch chat module.

tqchen · 2024-01-24T00:40:00Z

would be great to have two things

benchmark speed different before/after on cuda to see regressions
broad test vulkan, metal, rocm to see if we can observe regresions

MasterJH5574

Thanks @cyx-6! Overall looks good. Since #1650 was just merged, could you please rebase this branch onto main? I think there are some changes about

the interface of create_kv_cache_func_,
we can use attention_with_fused_qkv in llama attention.

cpp/llm_chat.cc

This PR adds the support for paged kv cache for single batch chat module.

tqchen · 2024-02-02T03:43:41Z

cc @CharlieFRuan on webllm runtime update

CharlieFRuan · 2024-02-03T20:40:38Z

This PR breaks llama on ROCm (single GPU):

Compiling and running llama-2-7b-q4f16_1 runs into:

  File "/home/cfruan/tvm-unity/src/runtime/relax_vm/paged_kv_cache.cc", line 725
TVMError: Check failed: total_seq_length == qkv_data->shape[0] (68 vs. 38) :

Edit: above fixed by #1710, not just ROCm; but ROCm still observes accuracy issue

Compiling and running llama-2-7b-q4f32_1 runs into Segmentation fault (core dumped) with no extra info. (Edit: due to tir attention kernel does not support fp32 yet)

Confirmed that no issues is observed if rebased to #1627.

Update: all issues fixed now.

PagedKVCache is introduced in MLC-LLM a while back to unite the interface for KVCache. This PR makes WebLLM compatible with the new PagedKVCache interface, encapsulating it with the goal that WebLLM users will not notice any difference. This PR is equivalent to the changes to `llm_chat.cc` in mlc-ai/mlc-llm#1651, and should address issues like mlc-ai/mlc-llm#1628. There are still existing model compilation issues regarding `workgroup_size` (since WebGPU, unlike most other backends, can only support 256 number of threads). We will address this issue more elegantly soon; for now, compiling llama-based models require manually changing kernel sizes as shown in [this branch](https://github.com/CharlieFRuan/mlc-llm/tree/local-workgroupSize-webLLM-kvCache). This PR is also largely dependent on apache/tvm#16554.

MasterJH5574 reviewed Jan 25, 2024

View reviewed changes

cpp/llm_chat.cc Outdated Show resolved Hide resolved

cyx-6 added 2 commits January 27, 2024 10:20

Support paged kv cache for single batch chat module

c51cf49

This PR adds the support for paged kv cache for single batch chat module.

apply code review suggestions

6a2cb33

cyx-6 force-pushed the single-kv-cache branch from a74c545 to 6a2cb33 Compare January 27, 2024 22:53

cyx-6 and others added 3 commits January 29, 2024 21:09

default kvcache onfig

008bb6c

blank line

ff7562b

Update llm_chat.cc

df1e579

MasterJH5574 approved these changes Feb 1, 2024

View reviewed changes

MasterJH5574 merged commit 370dcbf into mlc-ai:main Feb 2, 2024
1 check passed

CharlieFRuan mentioned this pull request Feb 11, 2024

[LLMChat] Make llm_chat compatible with PagedKVCache mlc-ai/web-llm#293

Merged

This was referenced Feb 13, 2024

[Tracking] Model definition migration to PagedKVCache #1749

Closed

[WebGPU] Update all wasms to reflect PagedKVCache and ApplyPresenceAndRequencyPenalty mlc-ai/binary-mlc-llm-libs#90

Merged

Kartik14 mentioned this pull request Feb 15, 2024

[Tracking] Migrate Rest API to MLC Serve #1767

Closed

4 tasks

yuxuanchiadm mentioned this pull request Feb 19, 2024

[Question] Chat is not active after Android build. #1785

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support paged kv cache for single batch chat module #1651

Support paged kv cache for single batch chat module #1651

cyx-6 commented Jan 23, 2024

tqchen commented Jan 24, 2024 •

edited

Loading

MasterJH5574 left a comment

tqchen commented Feb 2, 2024

CharlieFRuan commented Feb 3, 2024 •

edited

Loading

Support paged kv cache for single batch chat module #1651

Support paged kv cache for single batch chat module #1651

Conversation

cyx-6 commented Jan 23, 2024

tqchen commented Jan 24, 2024 • edited Loading

MasterJH5574 left a comment

Choose a reason for hiding this comment

tqchen commented Feb 2, 2024

CharlieFRuan commented Feb 3, 2024 • edited Loading

tqchen commented Jan 24, 2024 •

edited

Loading

CharlieFRuan commented Feb 3, 2024 •

edited

Loading