-
Notifications
You must be signed in to change notification settings - Fork 369
Replace prompt caching with session caching in the CLI #38
Comments
If it's helpful information, I've been doing a bit of experimenting and came up with some information about the session state: My first idea was that possibly a lot of the items in Using
With context length 512 (so the memory arrays are 268,435,456 bytes results in 15,674,302 bytes in Using a chunk of the Wikipedia article on foxes:
Results in 153,621,119 bytes in Then I tried to compress the two saved states (saved normally via the commandline flag) with zstd. P1 is the first prompt, P2 is the longer one. Low compression used the command edit: Add P3, same data as P2 duplicated to ~511 tokens. Added tokens column to table.
Based on this, compressing the data really seems worth it. Short prompts/sessions compress amazingly well and even longer ones seem to compress quite well. Also, using the fastest, lowest possible compression compared to the maximum settings only makes a small difference. Just running the data through the zstd crate at a low setting would make a huge difference in usability. If someone sets their context length to 2048 (about the maximum reasonable) and uses the 7B model the session data for a prompt with a couple words will be half the size of the entire model, which seems crazy. Ref: https://crates.io/crates/zstd |
Mm, @setzer22 experimented with compressing the cache but decided to 'ship' without it. With compression ratios that high, though, I'm inclined to say we should look at it again! |
@philpax I edited the table in the previous post to add a test with the context limit (512) almost exceeded using a prompt of 511 tokens. In this case, there was barely any compression and even ultra was only able to reduce it by about 8-9%. So the possible compression seems very directly related to what percentage of the context has been consumed. Someone saving context when they only have a few tokens left seems like an edge case that it's probably not really worth worrying about. Also, zstd compression in fast mode should be a pretty small performance difference. It might not even make a noticeable difference especially if saving to a normal spinny storage medium. If you really wanted, you could special case it to save a flag in the file and disable compression if >90% of the context was used but personally I don't think it would be worth the trouble. Saving a fairly short prompt or context state is likely to be the common use case. For example, something like the Alpaca prompt format. Random detail: The 511 token version resulted in 265,442,338 bytes in edit: Also, tested P3 with context length of 2048. Initial file is 2048 MiB, compressed with zstd fastest compression as above it is 508 MiB. So the ratio seems extremely predictable if you know the number of tokens in the context and the context size. |
(Will take this soon)
At present, we have
--cache-prompt
and--restore-prompt
, but these are a little ambiguous in how they work. The former will exit immediately after saving the prompt, and the latter can actually be used with any prompt (not just the one that was saved to disk).To better communicate what they do and to make them more general, I propose replacing them with
--load-session
,--save-session
and--persist-session
(which is an alias for loading and saving to the same path).--load-session
is identical to--restore-prompt
in that it loads a saved inference snapshot, but it better communicates what it's doing.--save-session
will save the results of inference to disk, similar to--cache-prompt
, but it will also include whatever was inferred, allowing you to continue on from a response.--cache-prompt PATH
is equivalent to--save-session PATH -n 0
. (This could be documented, or another flag could be added... but it almost feels like another "mode" to me. Should figure out how we want to do that for Add basic alpaca REPL mode #29, too.)--persist-session
loads a session from an existing path (if it exists) and saves to the path afterwards.This would allow you to have ongoing conversations over an extended period of time:
The text was updated successfully, but these errors were encountered: