Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support paged kv cache for single batch chat module #1651

Merged
merged 5 commits into from
Feb 2, 2024

Conversation

cyx-6
Copy link
Contributor

@cyx-6 cyx-6 commented Jan 23, 2024

This PR adds the support for paged kv cache for single batch chat module.

@tqchen
Copy link
Contributor

tqchen commented Jan 24, 2024

cc @MasterJH5574 .

would be great to have two things

  • benchmark speed different before/after on cuda to see regressions
  • broad test vulkan, metal, rocm to see if we can observe regresions

Copy link
Member

@MasterJH5574 MasterJH5574 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cyx-6! Overall looks good. Since #1650 was just merged, could you please rebase this branch onto main? I think there are some changes about

  • the interface of create_kv_cache_func_,
  • we can use attention_with_fused_qkv in llama attention.

cyx-6 added 2 commits January 27, 2024 10:20
This PR adds the support for paged kv cache for single batch chat module.
@MasterJH5574 MasterJH5574 merged commit 370dcbf into mlc-ai:main Feb 2, 2024
1 check passed
@tqchen
Copy link
Contributor

tqchen commented Feb 2, 2024

cc @CharlieFRuan on webllm runtime update

@CharlieFRuan
Copy link
Contributor

CharlieFRuan commented Feb 3, 2024

This PR breaks llama on ROCm (single GPU):

Compiling and running llama-2-7b-q4f16_1 runs into:

  File "/home/cfruan/tvm-unity/src/runtime/relax_vm/paged_kv_cache.cc", line 725
TVMError: Check failed: total_seq_length == qkv_data->shape[0] (68 vs. 38) :

Edit: above fixed by #1710, not just ROCm; but ROCm still observes accuracy issue

Compiling and running llama-2-7b-q4f32_1 runs into Segmentation fault (core dumped) with no extra info. (Edit: due to tir attention kernel does not support fp32 yet)

Confirmed that no issues is observed if rebased to #1627.

Update: all issues fixed now.

CharlieFRuan added a commit to mlc-ai/web-llm that referenced this pull request Feb 13, 2024
PagedKVCache is introduced in MLC-LLM a while back to unite the
interface for KVCache. This PR makes WebLLM compatible with the new
PagedKVCache interface, encapsulating it with the goal that WebLLM users
will not notice any difference.

This PR is equivalent to the changes to `llm_chat.cc` in
mlc-ai/mlc-llm#1651, and should address issues
like mlc-ai/mlc-llm#1628.

There are still existing model compilation issues regarding
`workgroup_size` (since WebGPU, unlike most other backends, can only
support 256 number of threads). We will address this issue more
elegantly soon; for now, compiling llama-based models require manually
changing kernel sizes as shown in [this
branch](https://github.com/CharlieFRuan/mlc-llm/tree/local-workgroupSize-webLLM-kvCache).

This PR is also largely dependent on
apache/tvm#16554.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants