Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Replace prompt caching with session caching in the CLI #38

Closed
philpax opened this issue Mar 18, 2023 · 3 comments · Fixed by #41
Closed

Replace prompt caching with session caching in the CLI #38

philpax opened this issue Mar 18, 2023 · 3 comments · Fixed by #41
Labels
issue:enhancement New feature or request

Comments

@philpax
Copy link
Collaborator

philpax commented Mar 18, 2023

(Will take this soon)

At present, we have --cache-prompt and --restore-prompt, but these are a little ambiguous in how they work. The former will exit immediately after saving the prompt, and the latter can actually be used with any prompt (not just the one that was saved to disk).

To better communicate what they do and to make them more general, I propose replacing them with --load-session, --save-session and --persist-session (which is an alias for loading and saving to the same path).

  • --load-session is identical to --restore-prompt in that it loads a saved inference snapshot, but it better communicates what it's doing.
  • --save-session will save the results of inference to disk, similar to --cache-prompt, but it will also include whatever was inferred, allowing you to continue on from a response. --cache-prompt PATH is equivalent to --save-session PATH -n 0. (This could be documented, or another flag could be added... but it almost feels like another "mode" to me. Should figure out how we want to do that for Add basic alpaca REPL mode #29, too.)
  • --persist-session loads a session from an existing path (if it exists) and saves to the path afterwards.

This would allow you to have ongoing conversations over an extended period of time:

llama-cli --persist-session conversation.llama -p "How do I make bread?"
...
llama-cli --persist-session conversation.llama -p "How long should I let the dough rest at room temperature?"
...
llama-cli --persist-session conversation.llama -p "Can I keep the dough in the fridge?"
@KerfuffleV2
Copy link
Contributor

KerfuffleV2 commented Mar 21, 2023

If it's helpful information, I've been doing a bit of experimenting and came up with some information about the session state:

My first idea was that possibly a lot of the items in memory_k and memory_v wouldn't get changed. However that's probably not a worthwhile approach:

Using

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

With context length 512 (so the memory arrays are 268,435,456 bytes results in 15,674,302 bytes in memory_k changed, 15,644,933 in memory_v (compared to the initial state right after the model is loaded).

Using a chunk of the Wikipedia article on foxes:

Foxes are small to medium-sized, omnivorous mammals belonging to several genera of the family Canidae. They have a flattened skull, upright, triangular ears, a pointed, slightly upturned snout, and a long bushy tail (or brush).

Twelve species belong to the monophyletic "true foxes" group of genus Vulpes. Approximately another 25 current or extinct species are always or sometimes called foxes; these foxes are either part of the paraphyletic group of the South American foxes, or of the outlying group, which consists of the bat-eared fox, gray fox, and island fox.[1]

Foxes live on every continent except Antarctica. The most common and widespread species of fox is the red fox (Vulpes vulpes) with about 47 recognized subspecies.[2] The global distribution of foxes, together with their widespread reputation for cunning, has contributed to their prominence in popular culture and folklore in many societies around the world. The hunting of foxes with packs of hounds, long an established pursuit in Europe, especially in the British Isles, was exported by European settlers to various parts of the New World.

Results in 153,621,119 bytes in memory_k changed, 153,300,565 in memory_v.


Then I tried to compress the two saved states (saved normally via the commandline flag) with zstd. P1 is the first prompt, P2 is the longer one. Low compression used the command zstd --verbose --keep -T8 -1 --fast=100 --progress, max compression used the command zstd --verbose --keep -T8 --ultra -22 --progress. The input file was ~512 MiB.

edit: Add P3, same data as P2 duplicated to ~511 tokens. Added tokens column to table.

prompt tokens compress output % output size
P1 29 low 5.96% 30.5 MiB
P1 29 high 5.45% 27.9 MiB
P2 295 low 57.36% 294 MiB
P2 295 high 52.96% 271 MiB
P3 511 low 99.02% 507 MiB
P3 511 high 91.07% 466 MiB

Based on this, compressing the data really seems worth it. Short prompts/sessions compress amazingly well and even longer ones seem to compress quite well. Also, using the fastest, lowest possible compression compared to the maximum settings only makes a small difference.

Just running the data through the zstd crate at a low setting would make a huge difference in usability. If someone sets their context length to 2048 (about the maximum reasonable) and uses the 7B model the session data for a prompt with a couple words will be half the size of the entire model, which seems crazy. Ref: https://crates.io/crates/zstd

@philpax
Copy link
Collaborator Author

philpax commented Mar 21, 2023

Mm, @setzer22 experimented with compressing the cache but decided to 'ship' without it. With compression ratios that high, though, I'm inclined to say we should look at it again!

@KerfuffleV2
Copy link
Contributor

KerfuffleV2 commented Mar 21, 2023

@philpax I edited the table in the previous post to add a test with the context limit (512) almost exceeded using a prompt of 511 tokens. In this case, there was barely any compression and even ultra was only able to reduce it by about 8-9%.

So the possible compression seems very directly related to what percentage of the context has been consumed. Someone saving context when they only have a few tokens left seems like an edge case that it's probably not really worth worrying about. Also, zstd compression in fast mode should be a pretty small performance difference. It might not even make a noticeable difference especially if saving to a normal spinny storage medium.

If you really wanted, you could special case it to save a flag in the file and disable compression if >90% of the context was used but personally I don't think it would be worth the trouble.

Saving a fairly short prompt or context state is likely to be the common use case. For example, something like the Alpaca prompt format.

Random detail: The 511 token version resulted in 265,442,338 bytes in memory_k changed, 264,878,694 in memory_v (of 268,4352456). Not too surprising based on the previous ones.

edit: Also, tested P3 with context length of 2048. Initial file is 2048 MiB, compressed with zstd fastest compression as above it is 508 MiB. So the ratio seems extremely predictable if you know the number of tokens in the context and the context size.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue:enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants