Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[User] Implement Streaming LLM - Make the inference more efficient #3440

Closed
errorsandwarnings opened this issue Oct 2, 2023 · 19 comments
Closed
Labels

Comments

@errorsandwarnings
Copy link

Prerequisites

Context length limit is an issue on all LLMs. The following repository and associated paper is demonstrating that keeping the 4 initial tokens will enable a infinite context length on most common LLMs without sacrificing performance or efficiency.

Code : https://github.com/mit-han-lab/streaming-llm

Paper reference inside the repo which demonstrates the attention-sink effect of LLMs and how to take advantage of it.

Current Behavior

There is a limit on context length defined mostly by pre-training. Other approaches like rope or sliding window have their pros and cons, none of them can get to a higher context length than this apporach.

@ggerganov
Copy link
Owner

Yup, I've already read the paper and the good news is that after #3228 this technique is trivial to support in llama.cpp.
It's technically already implemented by setting n_keep == 4 in main

@Dampfinchen
Copy link

Dampfinchen commented Oct 2, 2023

They also demonstrate how memory efficient this technique is (without it, it runs OOM very early). Does that basically mean infinite context with steady and low VRAM usage? That would be revolutionary and render vectorDBs obsolete.

@nathanodle
Copy link

They also demonstrate how memory efficient this technique is (without it, it runs OOM very early). Does that basically mean infinite context with steady and low VRAM usage? That would be revolutionary and render vectorDBs obsolete.

This isn't infinite context the way you're reading it. It uses a novel modification to attention so that you can generate "infinite" output

@staviq
Copy link
Contributor

staviq commented Oct 2, 2023

If somebody can confirm if I'm understanding that paper right, I'd be grateful:

They are proposing a solution for infinite text length, not infinite context length, right ? And their observation is that the consistency of "internal state" depends heavily on first tokens, so naively keeping initial tokens and implementing sliding context window on the rest, allows the LLM to keep its sanity intact, and first tokens are "important" because appending tokens from one side, makes the first token "exist" for N iterations, second token for N-1 iterations etc, so the first tokens are "seen" by all subsequent iterations but not the other way around ?

@errorsandwarnings errorsandwarnings changed the title [User] Implement Steaming LLM - Let's remove limit on context length [User] Implement Steaming LLM - Make the inference more efficient Oct 2, 2023
@errorsandwarnings
Copy link
Author

They also demonstrate how memory efficient this technique is (without it, it runs OOM very early). Does that basically mean infinite context with steady and low VRAM usage? That would be revolutionary and render vectorDBs obsolete.

This isn't infinite context the way you're reading it. It uses a novel modification to attention so that you can generate "infinite" output

Yes, it took me a while to understand and their newly posted FAQ clarifies it. I have edited the title. I believe it is a good technique to be part of this project.

@Dampfinchen
Copy link

Dampfinchen commented Oct 3, 2023

True. This one however...

https://github.com/tomaarsen/attention_sinks

is potentially relevant. It's about the input sequence length, so the context this time. Look at that steady memory usage!
272347418-3a4c5634-cc1b-42d1-a35a-afb376a4f970

@KerfuffleV2
Copy link
Collaborator

It's about the input sequence length, so the context this time. Look at that steady memory usage!

The key point is the input sequence length can be very long but the context the model is considering stays constant. So you could feed it a book, have it write a book worth of content but it won't "remember" or take into account what was in the sequence 4,096 tokens ago or whatever.

@errorsandwarnings
Copy link
Author

The VRAM usage is impressive.

@phillip-kravtsov
Copy link
Contributor

I think we need n_keep=4 but also n_discard=1 here to correctly implement StreamingLLM: https://github.com/ggerganov/llama.cpp/blob/master/examples/main/main.cpp#L504.

Assuming that the llama.cpp shifting of the k-cache matches the behavior described in the paper, where the keys are rotated according to their position in the cache rather than the text (which I believe they are but not certain)

@ggerganov
Copy link
Owner

I think we need n_keep=4 but also n_discard=1 here to correctly implement StreamingLLM:

That's correct, although I don't think one really needs to shift each new token. The benefit would be marginal.

@SlyEcho SlyEcho changed the title [User] Implement Steaming LLM - Make the inference more efficient [User] Implement Streaming LLM - Make the inference more efficient Oct 4, 2023
@SlyEcho
Copy link
Collaborator

SlyEcho commented Oct 4, 2023

I think the biggest impact this would give is to remove the expensive reevaluation that is currently being done but the effect is the same as --n-keep 4.

This is the RoPE handling part: modify_llama.py. I need time to understand it...

@ggerganov
Copy link
Owner

ggerganov commented Oct 4, 2023

There is no longer expensive re-evaluation in main and server since #3228 was merged. Instead of re-evaluating, we are now "shifting" the KV cache which is a relatively cheap operation and in some sense is equivalent to the approach proposed in the paper. I would say we even have an advantage because we have the option to set n_discard == 8 for example which would make RoPE recalculation 1 every 8 tokens instead of on each token as it is done in StreamingLLM

@Tostino
Copy link

Tostino commented Oct 5, 2023

Just wondering, does this give us the option to choose where the sliding window begins? e.g. I have a prompt template as seen here:

<#meta#>
- Date: 2023-10-05
- Task: chat
<#system#>
You are a conversational AI having a turn based chat with a user.
<#chat#>
<#user#>
Message 1
<#bot#>
Response 1
<#user#>
Message 2
<#user_context#>
Some Context
<#bot#>
Response 2
<#user#>
Message 3
<#bot#>
Response 3

Could I anchor this portion:

<#meta#>
- Date: 2023-10-05
- Task: chat
<#system#>
You are a conversational AI having a turn based chat with a user.
<#chat#>

And have the kv cache shift only over the chat portion?

Or am I just misunderstanding things?

@ggerganov
Copy link
Owner

Yes - count the tokens in that portion and set n_keep equal to that number

@Tostino
Copy link

Tostino commented Oct 5, 2023

And is n_keep configurable during inference time? One of the features I was planning on, was integrating an ensemble LLM which can modify the prompt template at specific points during inference. E.g. to change the current task, or change the system prompt to align with the current problem that is being worked on in the response, and then resume inference for example. So the number of tokens in that window may change.

@ggerganov
Copy link
Owner

You can modify it - the API is very flexible. Though to achieve your goal, it would take more than just updating n_keep. One way is to have separate context for each prompt that you evaluate once at the start. Or you can have one context and different sequence ids for the different prompts.

Copy link
Contributor

github-actions bot commented Apr 3, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 3, 2024
@MoonRide303
Copy link

@ggerganov Is it possible to use StreamingLLM / Windowed Attention with Attention Sinks method as defined in paper via main, currently? If so - how? Something like --keep 4 or --keep 8 will be enough? I don't see --n_keep or --n-discard in main options list. Results in paper looked really impressive - basically infite chats with low PPL (they've tested it up to 4 million token), without extra computational / memory costs - shouldn't it be default setting for compatible models? Or I am missing something, and there are some downsides not mentioned in the paper?

@ggerganov
Copy link
Owner

The discussion above is still valid, you can use the --keep argument to control the sink size:

printf(" --keep N number of tokens to keep from the initial prompt (default: %d, -1 = all)\n", params.n_keep);

The n_discard is currently not exposed for modification through the CLI, but it is easy to adjust in the code:

const int n_discard = n_left/2;

As discussed earlier, it is likely required to set n_discard == 1 in order to match the implementation from the paper.

Results in paper looked really impressive .. Or I am missing something, and there are some downsides not mentioned in the paper?

You might be missing that even though the PPL is low, it does no mean that the entire information from the processed context is "visible". The model will still "see" just the last n_ctx tokens. More info:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants