-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support paged kv cache for single batch chat module #1651
Conversation
cc @MasterJH5574 . would be great to have two things
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR adds the support for paged kv cache for single batch chat module.
a74c545
to
6a2cb33
Compare
cc @CharlieFRuan on webllm runtime update |
This PR breaks llama on ROCm (single GPU): Compiling and running
Edit: above fixed by #1710, not just ROCm; but ROCm still observes accuracy issue Compiling and running Confirmed that no issues is observed if rebased to #1627. Update: all issues fixed now. |
PagedKVCache is introduced in MLC-LLM a while back to unite the interface for KVCache. This PR makes WebLLM compatible with the new PagedKVCache interface, encapsulating it with the goal that WebLLM users will not notice any difference. This PR is equivalent to the changes to `llm_chat.cc` in mlc-ai/mlc-llm#1651, and should address issues like mlc-ai/mlc-llm#1628. There are still existing model compilation issues regarding `workgroup_size` (since WebGPU, unlike most other backends, can only support 256 number of threads). We will address this issue more elegantly soon; for now, compiling llama-based models require manually changing kernel sizes as shown in [this branch](https://github.com/CharlieFRuan/mlc-llm/tree/local-workgroupSize-webLLM-kvCache). This PR is also largely dependent on apache/tvm#16554.
This PR adds the support for paged kv cache for single batch chat module.