Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support batched generation for llama.cpp #102

Closed
the-crypt-keeper opened this issue Oct 11, 2023 · 2 comments
Closed

Support batched generation for llama.cpp #102

the-crypt-keeper opened this issue Oct 11, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@the-crypt-keeper
Copy link
Owner

Batch generation has landed: ggerganov/llama.cpp#3228

This should make our test suite ~10x faster on GGUF models.

@the-crypt-keeper the-crypt-keeper added the enhancement New feature or request label Oct 11, 2023
@the-crypt-keeper
Copy link
Owner Author

the-crypt-keeper commented Oct 13, 2023

Looks like the example parallel.cpp has 3 problems:

  1. there's a hard-coded input template

  2. there's a hard-coded reverse prompt (maybe less of an issue):

  3. there is a system prompt evaluated separately from the other prompts and this cannot be disabled

@the-crypt-keeper the-crypt-keeper changed the title Support batched completions for llama.cpp Support batched generation for llama.cpp Nov 3, 2023
@the-crypt-keeper
Copy link
Owner Author

Turns out performance of lammacpp batched generation on CUDA devices is pretty awful, top speed-up is ~2x on an A6000. Not going to bother for so little gain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant