[WebGPU] Update all wasms to reflect PagedKVCache and ApplyPresenceAndRequencyPenalty #90

CharlieFRuan · 2024-02-15T06:45:04Z

There are two main changes that this PR's WASMs include:

To support the new GenerationConfig, we introduced ApplyPresenceAndRequencyPenalty() in tvmjs, requiring us to rebuild all wasms.
- However, if the tvmjs in runtime is not up to date, it is not breaking
This PR updates Llama-variants (both Llama and TinyLlama) to use PagedKVCache, a recent change in mlc-llm: Support paged kv cache for single batch chat module mlc-llm#1651
- This is breaking: requires users to update their WebLLM npm, otherwise, old npms cannot handle these WASMs

This PR adds `GenerationConfig`, which allows per-generation configs. See `get-started.ts` for its example usage: ```typescript let genConfig: webllm.GenerationConfig = { presence_penalty: 0.5, frequency_penalty: 0.5, max_gen_len: 20, // stop: ["is", "Canada"] // for demonstration purpose } const prompt0 = "What is the capital of Canada?"; const reply0 = await chat.generate(prompt0, generateProgressCallback, 1, genConfig); ``` In addition to the existing fields in `mlc-chat-config.json`, we also support OpenAI-like fields `frequency_penalty`, `presence_penalty`, and `stop` to prepare for the incoming OpenAI-like APIs. This PR also sets up unit tests; use `npm test` to run tests. However, some work needs to be done to support end-to-end testing (e.g. accessing WebGPU in a test environment). All prebuilt WASMs are updated correspondingly: mlc-ai/binary-mlc-llm-libs#90 as we introduced a new API in tvmjs's `runtime.ts` via apache/tvm#16504. Note that the update of Llama WASMs is breaking in the sense that users will have to update their WebLLM npm.

The new version includes two main changes: 1. We now support models compiled with `PagedKVCache` (only Llama-variants for now) - This is breaking in the sense that, in order to use the update Llama WASMs, an update to the WebLLM npm is required - WASMs updated here: mlc-ai/binary-mlc-llm-libs#90 - For more see #293 2. We now support `GenerationConfig`, allowing configuring each generation (e.g. repetition penalty, temperature, etc.) - All WASMs needed to be recompiled since we included new function `ApplyPresenceAndRequencyPenalty()` in tvmjs - For more see #298 Other changes include: - #285

Update all wasms

3d9c17b

CharlieFRuan marked this pull request as draft February 15, 2024 06:45

CharlieFRuan mentioned this pull request Feb 15, 2024

[ChatModule] Add GenerationConfig and set up unit tests mlc-ai/web-llm#298

Merged

CharlieFRuan mentioned this pull request Feb 15, 2024

[Version] Bump version to 0.2.19 mlc-ai/web-llm#299

Merged

CharlieFRuan marked this pull request as ready for review February 15, 2024 07:04

CharlieFRuan merged commit b985c7d into mlc-ai:main Feb 15, 2024

CharlieFRuan mentioned this pull request Feb 27, 2024

[BUG] Cannot find global function vm.builtin.apply_presence_and_frequency_penalty mlc-ai/web-llm#315

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WebGPU] Update all wasms to reflect PagedKVCache and ApplyPresenceAndRequencyPenalty #90

[WebGPU] Update all wasms to reflect PagedKVCache and ApplyPresenceAndRequencyPenalty #90

CharlieFRuan commented Feb 15, 2024

[WebGPU] Update all wasms to reflect PagedKVCache and ApplyPresenceAndRequencyPenalty #90

[WebGPU] Update all wasms to reflect PagedKVCache and ApplyPresenceAndRequencyPenalty #90

Conversation

CharlieFRuan commented Feb 15, 2024