Replace prompt caching with session caching in the CLI #38

philpax · 2023-03-18T13:53:07Z

(Will take this soon)

At present, we have --cache-prompt and --restore-prompt, but these are a little ambiguous in how they work. The former will exit immediately after saving the prompt, and the latter can actually be used with any prompt (not just the one that was saved to disk).

To better communicate what they do and to make them more general, I propose replacing them with --load-session, --save-session and --persist-session (which is an alias for loading and saving to the same path).

--load-session is identical to --restore-prompt in that it loads a saved inference snapshot, but it better communicates what it's doing.
--save-session will save the results of inference to disk, similar to --cache-prompt, but it will also include whatever was inferred, allowing you to continue on from a response. --cache-prompt PATH is equivalent to --save-session PATH -n 0. (This could be documented, or another flag could be added... but it almost feels like another "mode" to me. Should figure out how we want to do that for Add basic alpaca REPL mode #29, too.)
--persist-session loads a session from an existing path (if it exists) and saves to the path afterwards.

This would allow you to have ongoing conversations over an extended period of time:

llama-cli --persist-session conversation.llama -p "How do I make bread?"
...
llama-cli --persist-session conversation.llama -p "How long should I let the dough rest at room temperature?"
...
llama-cli --persist-session conversation.llama -p "Can I keep the dough in the fridge?"

The text was updated successfully, but these errors were encountered:

KerfuffleV2 · 2023-03-21T11:31:40Z

If it's helpful information, I've been doing a bit of experimenting and came up with some information about the session state:

My first idea was that possibly a lot of the items in memory_k and memory_v wouldn't get changed. However that's probably not a worthwhile approach:

Using

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

With context length 512 (so the memory arrays are 268,435,456 bytes results in 15,674,302 bytes in memory_k changed, 15,644,933 in memory_v (compared to the initial state right after the model is loaded).

Using a chunk of the Wikipedia article on foxes:

Foxes are small to medium-sized, omnivorous mammals belonging to several genera of the family Canidae. They have a flattened skull, upright, triangular ears, a pointed, slightly upturned snout, and a long bushy tail (or brush).

Twelve species belong to the monophyletic "true foxes" group of genus Vulpes. Approximately another 25 current or extinct species are always or sometimes called foxes; these foxes are either part of the paraphyletic group of the South American foxes, or of the outlying group, which consists of the bat-eared fox, gray fox, and island fox.[1]

Foxes live on every continent except Antarctica. The most common and widespread species of fox is the red fox (Vulpes vulpes) with about 47 recognized subspecies.[2] The global distribution of foxes, together with their widespread reputation for cunning, has contributed to their prominence in popular culture and folklore in many societies around the world. The hunting of foxes with packs of hounds, long an established pursuit in Europe, especially in the British Isles, was exported by European settlers to various parts of the New World.

Results in 153,621,119 bytes in memory_k changed, 153,300,565 in memory_v.

Then I tried to compress the two saved states (saved normally via the commandline flag) with zstd. P1 is the first prompt, P2 is the longer one. Low compression used the command zstd --verbose --keep -T8 -1 --fast=100 --progress, max compression used the command zstd --verbose --keep -T8 --ultra -22 --progress. The input file was ~512 MiB.

edit: Add P3, same data as P2 duplicated to ~511 tokens. Added tokens column to table.

prompt	tokens	compress	output %	output size
P1	29	low	5.96%	30.5 MiB
P1	29	high	5.45%	27.9 MiB
P2	295	low	57.36%	294 MiB
P2	295	high	52.96%	271 MiB
P3	511	low	99.02%	507 MiB
P3	511	high	91.07%	466 MiB

Based on this, compressing the data really seems worth it. Short prompts/sessions compress amazingly well and even longer ones seem to compress quite well. Also, using the fastest, lowest possible compression compared to the maximum settings only makes a small difference.

Just running the data through the zstd crate at a low setting would make a huge difference in usability. If someone sets their context length to 2048 (about the maximum reasonable) and uses the 7B model the session data for a prompt with a couple words will be half the size of the entire model, which seems crazy. Ref: https://crates.io/crates/zstd

philpax · 2023-03-21T11:37:49Z

Mm, @setzer22 experimented with compressing the cache but decided to 'ship' without it. With compression ratios that high, though, I'm inclined to say we should look at it again!

KerfuffleV2 · 2023-03-21T11:55:59Z

@philpax I edited the table in the previous post to add a test with the context limit (512) almost exceeded using a prompt of 511 tokens. In this case, there was barely any compression and even ultra was only able to reduce it by about 8-9%.

So the possible compression seems very directly related to what percentage of the context has been consumed. Someone saving context when they only have a few tokens left seems like an edge case that it's probably not really worth worrying about. Also, zstd compression in fast mode should be a pretty small performance difference. It might not even make a noticeable difference especially if saving to a normal spinny storage medium.

If you really wanted, you could special case it to save a flag in the file and disable compression if >90% of the context was used but personally I don't think it would be worth the trouble.

Saving a fairly short prompt or context state is likely to be the common use case. For example, something like the Alpaca prompt format.

Random detail: The 511 token version resulted in 265,442,338 bytes in memory_k changed, 264,878,694 in memory_v (of 268,4352456). Not too surprising based on the previous ones.

edit: Also, tested P3 with context length of 2048. Initial file is 2048 MiB, compressed with zstd fastest compression as above it is 508 MiB. So the ratio seems extremely predictable if you know the number of tokens in the context and the context size.

philpax mentioned this issue Mar 18, 2023

Session caching CLI #41

Merged

KerfuffleV2 mentioned this issue Mar 21, 2023

Store KV cache of computed prompts to disk to avoid re-compute in follow-up runs ggerganov/llama.cpp#64

Closed

philpax added the issue:enhancement New feature or request label Mar 24, 2023

philpax closed this as completed in #41 Mar 26, 2023

sgoll mentioned this issue Mar 30, 2023

Reducing the time needed to reload a piece of text into the model by caching the state ggerganov/llama.cpp#202

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace prompt caching with session caching in the CLI #38

Replace prompt caching with session caching in the CLI #38

philpax commented Mar 18, 2023

KerfuffleV2 commented Mar 21, 2023 •

edited

Loading

philpax commented Mar 21, 2023

KerfuffleV2 commented Mar 21, 2023 •

edited

Loading

Replace prompt caching with session caching in the CLI #38

Replace prompt caching with session caching in the CLI #38

Comments

philpax commented Mar 18, 2023

KerfuffleV2 commented Mar 21, 2023 • edited Loading

philpax commented Mar 21, 2023

KerfuffleV2 commented Mar 21, 2023 • edited Loading

KerfuffleV2 commented Mar 21, 2023 •

edited

Loading

KerfuffleV2 commented Mar 21, 2023 •

edited

Loading