Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Session caching CLI #41

Merged
merged 7 commits into from
Mar 26, 2023
Merged

Session caching CLI #41

merged 7 commits into from
Mar 26, 2023

Conversation

philpax
Copy link
Collaborator

@philpax philpax commented Mar 18, 2023

Fixes #38.

I've had to make a few slightly controversial changes here:

  • InferenceSession now stores all tokens that have been processed, not just the last N tokens
  • inference_with_prompt now plays back all tokens that have been processed
  • I've simplified the loop in inference_with_prompt (I realised that the while condition was unnecessary since we can just return the error anyway)

On the plus side, it basically works as you'd expect:

> cargo run --release -- --model-path ggml-alpaca-7b-q4.bin -t 8 -p "This is why I love open-source software: " -n 16 --persist-session test.llama
# [...]
This is why I love open-source software: 58 people in one room, all working together to make a project better.
[2023-03-18T16:51:23Z INFO  llama_cli] Successfully wrote session to "test.llama"

> cargo run --release -- --model-path ggml-alpaca-7b-q4.bin -t 8 -p " Those 58 people can change the world, together, and here's how: " -n 16 --persist-session test.llama
# [...]
[2023-03-18T16:52:01Z INFO  llama_cli] Loaded inference session from "test.llama"
This is why I love open-source software: 58 people in one room, all working together to make a project better. Those 58 people can change the world, together, and here's how: 1) they are not doing it for money (they could be);
[2023-03-18T16:52:07Z INFO  llama_cli] Successfully wrote session to "test.llama"

Copy link
Collaborator

@setzer22 setzer22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! Just have a couple of comments that I feel are worth discussing

llama-cli/src/cli_args.rs Outdated Show resolved Hide resolved
llama-rs/src/lib.rs Show resolved Hide resolved
@KerfuffleV2
Copy link
Contributor

KerfuffleV2 commented Mar 25, 2023

I've been messing around with adding compression to this. Weirdly, the natural approach of just wrapping the reader/write in compression functions is unbearably slow. (Including loading/decompressing.) However, compressing to a buffer directly is basically instant.

Something pretty weird is going on.

@KerfuffleV2
Copy link
Contributor

@philpax

Pullception: https://github.com/philpax/llama-rs/pull/1

Commenting here in case you want to keep discussion in one place.

@philpax philpax requested a review from setzer22 March 25, 2023 16:19
@philpax philpax mentioned this pull request Mar 25, 2023
3 tasks
Copy link
Collaborator

@setzer22 setzer22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍 I think this one's ready to merge now

@philpax philpax merged commit 08b875c into rustformers:main Mar 26, 2023
@philpax philpax deleted the session-caching-cli branch March 26, 2023 12:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace prompt caching with session caching in the CLI
3 participants