-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grammars: 1.5x faster inference w/ complex grammars (vector reserves / reuses) #6609
Conversation
Similar to the integration tests, Reading through this PR, I'm amazed that such a simple change provides such a dramatic speedup. I still just have a hard time believing it's as effective as it is. :) |
Done, thanks!
Yeah I tried half a dozen similar rewrites and only these lucky two struck a chord :-D (let's hope for much more dramatic speedups w/ upcoming changes #4218 (comment)) |
llama.cpp
Outdated
for (auto it = code_points.begin(), end = code_points.end() - 1; it != end; ++it) { | ||
grammar->stacks = llama_grammar_accept(grammar->rules, grammar->stacks, *it); | ||
llama_grammar_accept(grammar->rules, grammar->stacks, *it, tmp_new_stacks); | ||
tmp_new_stacks.swap(grammar->stacks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this better than saying grammar->stacks = tmp_new_stacks;
? Because new_stacks
is .clear()
'd on 11921, it seems like we don't need to save its value here, and we could save a small step (?).
Mainly though, the recursive nature of the swap here was making my eyes cross when trying to follow exactly what this change was doing and how the contents of grammar->stacks
and tmp_new_stacks
were ping-ponging back and forth in this loop, so getting rid of the .swap()
might make it a bit easier to read as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I tried making this change (into a local grammar-speedup4
branch), and it didn't significantly improve things, but it wasn't slower, and I think the code is a bit more readable:
Benchmark 1: ./main \
-mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
--grammar-file json_numbers.grammar \
-p "List of 20 integers starting from 0" \
--seed 12344 (branch = grammar-speedup4)
Time (mean ± σ): 12.586 s ± 0.698 s [User: 8.488 s, System: 1.799 s]
Range (min … max): 12.012 s … 13.726 s 5 runs
Benchmark 2: ./main \
-mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
--grammar-file json_numbers.grammar \
-p "List of 20 integers starting from 0" \
--seed 12344 (branch = grammar-speedup3)
Time (mean ± σ): 12.904 s ± 0.854 s [User: 8.583 s, System: 1.954 s]
Range (min … max): 11.846 s … 13.963 s 5 runs
Summary
./main \
-mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
--grammar-file json_numbers.grammar \
-p "List of 20 integers starting from 0" \
--seed 12344 (branch = grammar-speedup4) ran
1.03 ± 0.09 times faster than ./main \
-mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
--grammar-file json_numbers.grammar \
-p "List of 20 integers starting from 0" \
--seed 12344 (branch = grammar-speedup3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, turns out my eyes-crossing swap wasn't even making things faster, removed it / looks simpler thanks.
FWIW, I've independently confirmed the (rather dramatic) speedup results of 1.71x on my system (wow!):
Macbook Pro, Apple M1 Pro, 32 GB of RAM, and about a billion open Firefox tabs. Really awesome work, @ochafik ! |
Other than my minor suggestion re: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks good to me!
…eserves / reuses) (ggerganov#6609)" This reverts commit cbaadc9.
…/ reuses) (ggerganov#6609) * grammars: reserve rejects & next candidates * grammars: reuse new_stacks * grammars: fix missing sig change in llama.h * grammars: fix test (api changed) * grammars: update gbnf-validator.cpp * grammars: simpler syntax (no swap)
See discussion in #4218 (comment)
Here's simple code to repro 1.6x faster inference speedup (on Metal) w/ a nested repetition-heavy grammar from #6555 (for JSON schema
{"items": {"type": "number"}, "maxItems": 100}
):Show results
cc/ @HanClinto