Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WebGPU] Update all wasms to reflect PagedKVCache and ApplyPresenceAndRequencyPenalty #90

Merged
merged 1 commit into from
Feb 15, 2024

Conversation

CharlieFRuan
Copy link
Contributor

There are two main changes that this PR's WASMs include:

  • To support the new GenerationConfig, we introduced ApplyPresenceAndRequencyPenalty() in tvmjs, requiring us to rebuild all wasms.
    • However, if the tvmjs in runtime is not up to date, it is not breaking
  • This PR updates Llama-variants (both Llama and TinyLlama) to use PagedKVCache, a recent change in mlc-llm: Support paged kv cache for single batch chat module mlc-llm#1651
    • This is breaking: requires users to update their WebLLM npm, otherwise, old npms cannot handle these WASMs

@CharlieFRuan CharlieFRuan marked this pull request as draft February 15, 2024 06:45
CharlieFRuan added a commit to mlc-ai/web-llm that referenced this pull request Feb 15, 2024
This PR adds `GenerationConfig`, which allows per-generation configs.
See `get-started.ts` for its example usage:

```typescript
let genConfig: webllm.GenerationConfig = {
  presence_penalty: 0.5,
  frequency_penalty: 0.5,
  max_gen_len: 20,
  // stop: ["is", "Canada"]  // for demonstration purpose
}

const prompt0 = "What is the capital of Canada?";
const reply0 = await chat.generate(prompt0, generateProgressCallback, 1, genConfig);
```

In addition to the existing fields in `mlc-chat-config.json`, we also
support OpenAI-like fields `frequency_penalty`, `presence_penalty`, and
`stop` to prepare for the incoming OpenAI-like APIs.

This PR also sets up unit tests; use `npm test` to run tests. However,
some work needs to be done to support end-to-end testing (e.g. accessing
WebGPU in a test environment).

All prebuilt WASMs are updated correspondingly:
mlc-ai/binary-mlc-llm-libs#90 as we introduced a
new API in tvmjs's `runtime.ts` via
apache/tvm#16504.

Note that the update of Llama WASMs is breaking in the sense that users
will have to update their WebLLM npm.
CharlieFRuan added a commit to mlc-ai/web-llm that referenced this pull request Feb 15, 2024
The new version includes two main changes:
1. We now support models compiled with `PagedKVCache` (only
Llama-variants for now)
- This is breaking in the sense that, in order to use the update Llama
WASMs, an update to the WebLLM npm is required
- WASMs updated here:
mlc-ai/binary-mlc-llm-libs#90
  - For more see #293
2. We now support `GenerationConfig`, allowing configuring each
generation (e.g. repetition penalty, temperature, etc.)
- All WASMs needed to be recompiled since we included new function
`ApplyPresenceAndRequencyPenalty()` in tvmjs
  - For more see #298

Other changes include: 
- #285
@CharlieFRuan CharlieFRuan marked this pull request as ready for review February 15, 2024 07:04
@CharlieFRuan CharlieFRuan merged commit b985c7d into mlc-ai:main Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant