Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement customizable RoPE #2054

Merged
merged 10 commits into from
Jul 15, 2023
Merged

Implement customizable RoPE #2054

merged 10 commits into from
Jul 15, 2023

Conversation

jxy
Copy link
Contributor

@jxy jxy commented Jun 30, 2023

The original RoPE has pre-defined parameters

theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2]

Our customizable RoPE, ggml_rope_custom_inplace, uses

theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2]

with the default matches the original

scale = 1.0
base = 10000

The new command line arguments
--rope-freq-base
--rope-freq-scale
set the two new RoPE parameter.

Recent researches show changing these two parameters extends the context limit with minimal loss.

  1. Extending Context to 8K kaiokendev https://kaiokendev.github.io/til#extending-context-to-8k

  2. Extending Context Window of Large Language Models via Positional Interpolation Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian https://arxiv.org/abs/2306.15595

  3. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/user/bloc97 https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

For the bold, try adding the following command line parameters to your favorite model: -c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5

@1980Dragon
Copy link

Just came across another RoPE adjustment method on Reddit. Thought it might be helpful, so here's the link!
Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning

image

@maddes8cht
Copy link
Contributor

This still means i will get better perplexity-performance when usingvthis pr with @TheBloke 's supehot model-variants I guess?

@FNsi
Copy link
Contributor

FNsi commented Jun 30, 2023

Somehow exciting?

I use my merged 13b model (wizard vicuña + starcoder + superhot 16k)

With that 16k command.

  1. 5.3710, 2. 5.7513

Looks reasonable.

And and , what if I want to test 32k or higher, how to set both parameters? Any ideas?

@jxy
Copy link
Contributor Author

jxy commented Jun 30, 2023

The --rope-freq-scale is the same scale used in "superhot/rope interpolation". superhot 8k lora corresponds to --rope-freq-scale 0.25 -c 8192, which is a factor of 4 increase. Similarly superhot 16k lora and longchat 16k corresponds to --rope-freq-scale 0.125 -c 16384, for a factor of 8.

The --rope-freq-base simplifies the "NTK-Aware Scaled RoPE". The base number here corresponds to 10000*alpha**(64/63) using the alpha introduced in the reddit post. I'm not aware of any direct translation of how context length corresponds to the base or alpha. My limited testing with 13B models show a rough quadratic correspondence, C = -0.0652*b*b + 0.862*b + 0.203, for C the factor of context length increase, and b the factor of base increase, roughly

base effective ctx factor effective ctx
20000 1.66 3400
26000 2 4096
40000 2.6 5300
57200 3.0 6144

I found base>60000 didn't feel good, though I've no hard numbers to back this up.

Empirically, without fine tune, you could try

  • -c 4096 --rope-freq-scale 0.83 --rope-freq-base 20000
  • -c 6144 --rope-freq-scale 0.86 --rope-freq-base 40000
  • -c 8192 --rope-freq-scale 0.75 --rope-freq-base 57200

With superhot 16k or longchat 13b, perhaps you could try (KV cache alone requires 25GB!!)

  • -c 32768 --rope-freq-scale 0.125 --rope-freq-base 26000
  • or dial up base more?

@jxy
Copy link
Contributor Author

jxy commented Jun 30, 2023

I used some numbers posted by @JohannesGaessler, and made changes in scratch0 size in this PR. I can rebase this PR on their PR #2056 if needed.

@JohannesGaessler
Copy link
Collaborator

I think the numbers that I determined for the VRAM scratch buffer will probably work for the RAM scratch buffer but I would still advise you to be cautious since the two types of scratch buffer work differently (the VRAM scratch buffer has tighter limits).

@trap20
Copy link

trap20 commented Jun 30, 2023

I tried the -c 4096 --rope-freq-scale 0.83 --rope-freq-base 20000 configuration with the wizardlm-33b-v1.0-uncensored.ggmlv3.q5_K_M.bin model and got ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 490733568, available 482344960). Even with 0 layers offloaded.

guanaco-65B.ggmlv3.q4_K_M.bin works with those settings.

@jxy
Copy link
Contributor Author

jxy commented Jun 30, 2023

I tried the -c 4096 --rope-freq-scale 0.83 --rope-freq-base 20000 configuration with the wizardlm-33b-v1.0-uncensored.ggmlv3.q5_K_M.bin model and got ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 490733568, available 482344960).

I can't reproduce. Is CUDA build different? I can't test CUDA build. The number 482344960 seems to be computed from

{ MODEL_30B, ((size_t) n_ctx / 10ull + 256ull) * MB }

with n_ctx = 2048. Do you use main or other method/executable to run it?

@time-less-ness
Copy link

time-less-ness commented Jun 30, 2023

I am trying this out, and it is working fine for me so far, though I've only tried:

model: gpt4-alpaca-lora-30b.ggmlv3.q4_0.bin
context: -c 6144 --rope-freq-scale 0.86 --rope-freq-base 40000

I'll keep trying other models and context sizes and seeing how it goes. Not sure if other fixes are included, but this seems to make inference way faster (via fewer long pauses to do CPU-related tasks) on my machine as well. Possibly just due to not having to recompute context as I hit the 2048-byte mark so often?

EDIT: Also working just fine for me:

model: gpt4-alpaca-lora-30b.ggmlv3.q4_0.bin 
context: -c 8192 --rope-freq-scale 0.75 --rope-freq-base 57200

@trap20
Copy link

trap20 commented Jun 30, 2023

I can't reproduce. Is CUDA build different?[...} Do you use main or other method/executable to run it?

Yes, it's a CUDA build for a 1080ti and I really should have used main for reporting, but didn't. I'll try main tomorrow.

@jxy
Copy link
Contributor Author

jxy commented Jul 1, 2023

I can't reproduce. Is CUDA build different?[...} Do you use main or other method/executable to run it?

Yes, it's a CUDA build for a 1080ti and I really should have used main for reporting, but didn't. I'll try main tomorrow.

If you could give me a stack trace of when MEM_REQ_SCRATCH0 is called, I could try to figure out what is wrong with the CUDA build. Otherwise, I'll see if I can get a system somewhere with cuda.

@trap20
Copy link

trap20 commented Jul 1, 2023

Can't reproduce the error today, no idea what I did exactly to trigger it...

@ggerganov
Copy link
Owner

This looks great, but similar to #1967 - let's wait for a while before merging.
There are new RoPE scaling techniques popping up by the hour each one better the other. No reason to commit to something just yet

@FNsi
Copy link
Contributor

FNsi commented Jul 2, 2023

  • or dial up base more?

Test with my merged 13b vicuña model(wizardvicuña + starcoder Lora + gpt4tools + 16k superhot)

16k With perplexity
Chunks decrease to 20 in 16k

Base 70000 scale 0.4 [1] 5.5564

Base 57200 scale 0.5 [1] 6.7699
base 68000 scale 0.5 [1] 5.3758
Base 70000 scale 0.5 [1] 5.3508
Base 75000 scale 0.5 [1] 5.3529
Base 76000 scale 0.5 [1] 5.3532
Base 78000 scale 0.5 [1] 5.3573
base 80000 scale 0.5 [1] 5.3710
base 84000 scale 0.5 [1] 5.4351
base 100000 scale 0.5 [1] 5.6484
Base 120000 scale 0.5 [1] 5.7999

the chunks decrease while ctx enlarged, that might be the reason for some perplexity problem? but obviously not here.


20k cause the chunks decrease to 16.

20k
Base 68000 scale 0.4 [1] 5.7306
base 70000 scale 0.4 [1] 5.7083
Base 72000 scale 0.4 [1] 5.7550
Base 11000 scale 0.4 [1] 6.2897
Base 150000 scale 0.4 [1] 6.6441
base 100000 scale 0.5 [1] 5.7545
base 110000 scale 0.5 [1] 5.7393
base 120000 scale 0.5 [1] 5.8566

32k

I believe 13b MEM_REQ_EVAL is not enough to test🤷

@FNsi
Copy link
Contributor

FNsi commented Jul 2, 2023

Running perplexity Openllama 3b with -C 16384 scale 0.5 base 90000

No enough space in the contexts memory pool needed 543758112 available 536870912

13b c 32768 scale 0.25 base 120000
Segment fault needed 108054424 available 1073741824

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jul 2, 2023

KV cache alone requires 25GB!!

Could we quantize the KV cache?

@FNsi
Copy link
Contributor

FNsi commented Jul 2, 2023

KV cache alone requires 25GB!!

Could we quantize the KV cache?

Another solution #1955


Btw I just saw
falcon v split

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jul 2, 2023

KV cache alone requires 25GB!!

Could we quantize the KV cache?

I think this was tried, and resulted in bad results. It should already be in f16.
but i dont remember, if we tried 8bit quantization...

edit: do we use flashattention for the forward pass?

@ardfork
Copy link
Contributor

ardfork commented Jul 2, 2023

For some reason, the server example output some random unicode characters when using --rope-freq-scale 0.25 -c 8192 but --rope-freq-scale 0.25 -c 4096 work correctly and --rope-freq-scale 0.25 -c 8192 work on cli.

@jxy
Copy link
Contributor Author

jxy commented Jul 3, 2023

server gives me 413 when the json data is large. We need help from those who contributed server code.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jul 3, 2023

server gives me 413 when the json data is large. We need help from those who contributed server code.

I believe CPPHTTPLIB_RECV_BUFSIZ needs to be increased, right now it is 4K.

@digiwombat
Copy link
Contributor

Yeah SlyEcho is right based on what I saw in the lib, setting
#define CPPHTTPLIB_RECV_BUFSIZ size_t(<SOME NUMBER HERE>)
before the httplib.h import should be the correct way to increase it, I believe.

@jxy
Copy link
Contributor Author

jxy commented Jul 3, 2023

Running perplexity Openllama 3b with -C 16384 scale 0.5 base 90000

No enough space in the contexts memory pool needed 543758112 available 536870912

The only thing that I know of allocating 512 MB (536870912) is from MEM_REQ_EVAL, which this PR didn't change. Maybe try changing the line

{ MODEL_3B, 512ull * MB },

to something like

{ MODEL_3B, 600ull * MB },

and see if it helps?

@jxy
Copy link
Contributor Author

jxy commented Jul 4, 2023

I believe CPPHTTPLIB_RECV_BUFSIZ needs to be increased, right now it is 4K.

It looks like a simple read buffer to me, and it's separate from the overall size limit.

@digiwombat
Copy link
Contributor

It looks like a simple read buffer to me, and it's separate from the overall size limit.

Server::set_payload_max_length(uint64_t length) might be what we're after then.

svr.set_payload_max_length(1024 * 1024 * 1); would set it to 1MB (left the 1 in for example purposes)

@jxy
Copy link
Contributor Author

jxy commented Jul 4, 2023

It looks like a simple read buffer to me, and it's separate from the overall size limit.

Server::set_payload_max_length(uint64_t length) might be what we're after then.

svr.set_payload_max_length(1024 * 1024 * 1); would set it to 1MB (left the 1 in for example purposes)

the default is actually

#define CPPHTTPLIB_PAYLOAD_MAX_LENGTH ((std::numeric_limits<size_t>::max)())

The original RoPE has pre-defined parameters

theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2]

Our customizable RoPE, ggml_rope_custom_inplace, uses

theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2]

with the default matches the original

scale = 1.0
base = 10000

The new command line arguments
--rope-freq-base
--rope-freq-scale
set the two new RoPE parameter.

Recent researches show changing these two parameters extends the context limit with minimal loss.

1. Extending Context to 8K
   kaiokendev
   https://kaiokendev.github.io/til#extending-context-to-8k

2. Extending Context Window of Large Language Models via Positional Interpolation
   Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian
   https://arxiv.org/abs/2306.15595

3. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
   https://www.reddit.com/user/bloc97
   https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

For the bold, try adding the following command line parameters to your favorite model:
-c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5
@ggerganov
Copy link
Owner

What is the latest state of this approach - is it worth merging and supporting?

@time-less-ness
Copy link

I've been using this on a Mac M1 Max since the PR was raised and it's working fine for me. I've been hoping it will get merged so I can go back to compiling from master again. Really enjoying having 8k context.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jul 14, 2023

Let's merge and maybe then improve later.

@@ -15759,7 +15759,7 @@ static void ggml_compute_backward(struct ggml_context * ctx, struct ggml_tensor
{
if (src0->grad) {
assert(src1->type == GGML_TYPE_I32);
assert(ggml_nelements(src1) == 4);
assert(ggml_nelements(src1) == 3);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be 6? Based on the code immediately after it should be at least 4, I think, not 3.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the code an I also can't see why it's 3 when the lines just below it show it clearly taking 4 elements and looks like it designed to fail the assertion

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed in 513f861

@LostRuins
Copy link
Collaborator

Is there any reason why the following lines are unmodified and still use the hardcoded 10000.0 and 1.0 rope frequency and scale?

https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu#L2955
https://github.com/ggerganov/llama.cpp/blob/master/ggml.c#L12418
https://github.com/ggerganov/llama.cpp/blob/master/ggml.c#L12517

@Green-Sky
Copy link
Collaborator

Is there any reason why the following lines are unmodified and still use the hardcoded 10000.0 and 1.0 rope frequency and scale?

for the ggml.c lines, those appear to be the rope backwards passes, confusingly named forward_rope_back

@jxy
Copy link
Contributor Author

jxy commented Jul 18, 2023

I left the backward code untouched because I wasn't sure how I could correctly modify it and test it.

I'm also not sure about cuda bits.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jul 18, 2023

The CUDA part is broken right now, it should be fixed.

@bilal-aamer
Copy link

How do I implement this with RoPE and without it with current LLMs?

@abc-nix
Copy link

abc-nix commented Jan 21, 2024

How do I implement this with RoPE and without it with current LLMs?

You can read a bit more about RoPE use in llama.cpp in the llama.cpp/examples/main/README.md

Though I would recommend you try out the new Self-Extend support added in commit #4815 which I think is better, as you don't need to retrain the model to get better results.

@bilal-aamer
Copy link

How do I implement this with RoPE and without it with current LLMs?

You can read a bit more about RoPE use in llama.cpp in the llama.cpp/examples/main/README.md

Though I would recommend you try out the new Self-Extend support added in commit #4815 which I think is better, as you don't need to retrain the model to get better results.

Thanks @abc-nix!

What about the implementation of customized RoPE

@abc-nix
Copy link

abc-nix commented Jan 21, 2024

Sorry, @bilal-aamer, I am not sure what you are trying to ask here.

This PR adds customized RoPE support. Latter, YaRN RoPE scaling was added in PR #2268 and some other fixes were added after that.

main's help has this to say about how the options and parameter to make use of RoPE/YaRN:

  --rope-scaling {none,linear,yarn}
                        RoPE frequency scaling method, defaults to linear unless specified by the model
  --rope-scale N        RoPE context scaling factor, expands context by a factor of N
  --rope-freq-base N    RoPE base frequency, used by NTK-aware scaling (default: loaded from model)
  --rope-freq-scale N   RoPE frequency scaling factor, expands context by a factor of 1/N
  --yarn-orig-ctx N     YaRN: original context size of model (default: 0 = model training context size)
  --yarn-ext-factor N   YaRN: extrapolation mix factor (default: 1.0, 0.0 = full interpolation)
  --yarn-attn-factor N  YaRN: scale sqrt(t) or attention magnitude (default: 1.0)
  --yarn-beta-slow N    YaRN: high correction dim or alpha (default: 1.0)
  --yarn-beta-fast N    YaRN: low correction dim or beta (default: 32.0)

I am not sure what you are trying to achieve or what exactly you are asking. Hopefully someone else isn't as obtuse as me and can help you out.

@madiarabis
Copy link

madiarabis commented Mar 29, 2024

Is there any documentation on how to implement this or an example? I am kind of new in the field and I am fine tuning code llama 2 and I want to increase the context length. But between all these posts I am sort of confused how to implement it actually.

This is my implementation:
accelerate launch --config_file "./fsdp_config.yaml" fsdp_acc2.py
--rope_scaling 0.25

this is the error I am getting:
RuntimeError: The size of tensor a (16384) must match the size of tensor b (16385) at non-singleton dimension

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.