Skip to content

Commit

Permalink
link xgrammar
Browse files Browse the repository at this point in the history
  • Loading branch information
mmoskal committed Dec 23, 2024
1 parent ce184ed commit 0bcfac5
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ The sampling can be constrained by the [Low-Level Guidance library](https://gith

There is no significant startup cost for all realistic sizes of grammars (no measurable impact on time to first token (TTFT)). The overhead on generation speed (median time between tokens (TBT)) is typically 1-3% (and comes mostly from apply masking kernels on the GPU). The mask computation takes on the order of 100 us of single-core CPU time per token per sequence in the batch. Thus, with 16 cores and a TBT of around 10 ms, batch sizes of up to 1600 are not CPU-bound. Typically, the unconstrained TBT is higher at such batch sizes, and more cores are available, so batch size is not a problem in production.

This approach differs from [Outlines](https://github.com/dottxt-ai/outlines) (which pre-computes masks, resulting in a startup cost and limits on schema complexity) and is more similar in spirit to [llama.cpp grammars](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md), though it is much faster due to the use of a custom lexer with [derivative-based regexes](https://github.com/microsoft/derivre), an Earley parser, and a [highly optimized](https://github.com/microsoft/llguidance/blob/main/docs/toktrie.md) token prefix tree.
This approach differs from [Outlines](https://github.com/dottxt-ai/outlines) and [XGrammar](https://github.com/mlc-ai/xgrammar) (which both pre-compute masks, resulting in a startup cost and limits on schema complexity) and is more similar in spirit to [llama.cpp grammars](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md), though it is much faster due to the use of a custom lexer with [derivative-based regexes](https://github.com/microsoft/derivre), an Earley parser, and a [highly optimized](https://github.com/guidance-ai/llguidance/blob/main/docs/optimizations.md) token prefix tree.

## Requirements

Expand Down

0 comments on commit 0bcfac5

Please sign in to comment.