-
Notifications
You must be signed in to change notification settings - Fork 369
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall! Just have a couple of comments that I feel are worth discussing
I've been messing around with adding compression to this. Weirdly, the natural approach of just wrapping the reader/write in compression functions is unbearably slow. (Including loading/decompressing.) However, compressing to a buffer directly is basically instant. Something pretty weird is going on. |
Pullception: https://github.com/philpax/llama-rs/pull/1 Commenting here in case you want to keep discussion in one place. |
Add zstd compression support for session loading/saving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 👍 I think this one's ready to merge now
Fixes #38.
I've had to make a few slightly controversial changes here:
InferenceSession
now stores all tokens that have been processed, not just the last N tokensinference_with_prompt
now plays back all tokens that have been processedinference_with_prompt
(I realised that thewhile
condition was unnecessary since we can just return the error anyway)On the plus side, it basically works as you'd expect: