-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[User] Implement Streaming LLM - Make the inference more efficient #3440
Comments
Yup, I've already read the paper and the good news is that after #3228 this technique is trivial to support in |
They also demonstrate how memory efficient this technique is (without it, it runs OOM very early). Does that basically mean infinite context with steady and low VRAM usage? That would be revolutionary and render vectorDBs obsolete. |
This isn't infinite context the way you're reading it. It uses a novel modification to attention so that you can generate "infinite" output |
If somebody can confirm if I'm understanding that paper right, I'd be grateful: They are proposing a solution for infinite text length, not infinite context length, right ? And their observation is that the consistency of "internal state" depends heavily on first tokens, so naively keeping initial tokens and implementing sliding context window on the rest, allows the LLM to keep its sanity intact, and first tokens are "important" because appending tokens from one side, makes the first token "exist" for N iterations, second token for N-1 iterations etc, so the first tokens are "seen" by all subsequent iterations but not the other way around ? |
Yes, it took me a while to understand and their newly posted FAQ clarifies it. I have edited the title. I believe it is a good technique to be part of this project. |
True. This one however... https://github.com/tomaarsen/attention_sinks is potentially relevant. It's about the input sequence length, so the context this time. Look at that steady memory usage! |
The key point is the input sequence length can be very long but the context the model is considering stays constant. So you could feed it a book, have it write a book worth of content but it won't "remember" or take into account what was in the sequence 4,096 tokens ago or whatever. |
The VRAM usage is impressive. |
I think we need Assuming that the llama.cpp shifting of the k-cache matches the behavior described in the paper, where the keys are rotated according to their position in the cache rather than the text (which I believe they are but not certain) |
That's correct, although I don't think one really needs to shift each new token. The benefit would be marginal. |
I think the biggest impact this would give is to remove the expensive reevaluation that is currently being done but the effect is the same as This is the RoPE handling part: modify_llama.py. I need time to understand it... |
There is no longer expensive re-evaluation in |
Just wondering, does this give us the option to choose where the sliding window begins? e.g. I have a prompt template as seen here:
Could I anchor this portion:
And have the kv cache shift only over the chat portion? Or am I just misunderstanding things? |
Yes - count the tokens in that portion and set |
And is n_keep configurable during inference time? One of the features I was planning on, was integrating an ensemble LLM which can modify the prompt template at specific points during inference. E.g. to change the current task, or change the system prompt to align with the current problem that is being worked on in the response, and then resume inference for example. So the number of tokens in that window may change. |
You can modify it - the API is very flexible. Though to achieve your goal, it would take more than just updating |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
@ggerganov Is it possible to use StreamingLLM / Windowed Attention with Attention Sinks method as defined in paper via main, currently? If so - how? Something like |
The discussion above is still valid, you can use the Line 1386 in 4399f13
The llama.cpp/examples/main/main.cpp Line 554 in 4399f13
As discussed earlier, it is likely required to set
You might be missing that even though the PPL is low, it does no mean that the entire information from the processed context is "visible". The model will still "see" just the last |
Prerequisites
Context length limit is an issue on all LLMs. The following repository and associated paper is demonstrating that keeping the 4 initial tokens will enable a infinite context length on most common LLMs without sacrificing performance or efficiency.
Code : https://github.com/mit-han-lab/streaming-llm
Paper reference inside the repo which demonstrates the attention-sink effect of LLMs and how to take advantage of it.
Current Behavior
There is a limit on context length defined mostly by pre-training. Other approaches like rope or sliding window have their pros and cons, none of them can get to a higher context length than this apporach.
The text was updated successfully, but these errors were encountered: