Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Oversight] -> Ideal Rope for CodeLLama 2 based models differs vastly from LLama 2. #3090

Closed
SabinStargem opened this issue Sep 9, 2023 · 7 comments
Labels

Comments

@SabinStargem
Copy link

I did not discover this. A user of KoboldCPP posted that auto-rope for Code Llama was incorrect. Just in case this applies to LlamaCPP, I wanted to draw attention to the issue. Here is a quote of their findings.

Nexesenex

CodeLlama 2 models are loaded with an automatic rope base frequency similar to Llama 2 when the rope is not specificed in the command line launch.
But the initial Base Rope frequency for CL2 is 1000000, not 10000.

I couldn't find nor figure out the formula to calculate a proper rope base frequency for CL2 accordingly to context length (if you have some ideads..), I'm lame in algebra, but from empirical perplexity tests, the best base rope frequency seem to revolve around 100000 if the rope scale is left at 1 up to a context of 12288.

I observed that the variance between 10000, 100000 and 1000000 is a curve with 0.2 perplexity amplitude at 512 ctx and 0.02 perplexity around 12288, with 100000 having the lowest perplexity.

I could make more tests on a 7b model with a proper command/script logging on llama.cpp the perplexities found with different rope base frequency/scale config up to 32768 or even higher, as some developpers seem to use on ggermanov reddit, but I didn't find the script (and I'm on Windows).

Once Johannes Gaessler PR about the kv cache quantized in q8_0 is accepted, we can probably test up to 100,000 ctx on 7b with a single 24GB graphic card.

@KerfuffleV2
Copy link
Collaborator

Looks like there's a rope_theta value in config.json for CodeLLama (2?) models. We probably don't have to worry about calculating the best setting ourselves, just include that when converting and use that if available when loading the GGUF model.

@SabinStargem
Copy link
Author

SabinStargem commented Sep 9, 2023

Here is my log for booting up c34b.

Question: Is the value "1.0e-05" in my log correct? There is a LlamaCPP thread where Slaren said this:

Slaren

The CodeLlama models can now be converted to gguf using convert.py, but to operate properly they require the parameter --rope-freq-base 1e6. This parameter needs to be added to the gguf model file metadata.

.
.
.
.

Welcome to KoboldCpp - Version 1.43
For command line arguments, please refer to --help
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll
Overriding thread count, using 6 threads instead.
Namespace(bantokens=None, blasbatchsize=2048, blasthreads=6, config=None, contextsize=16384, debugmode=False, forceversion=0, gpulayers=0, highpriority=False, hordeconfig=None, host='', launch=True, lora=None, model=None, model_param='C:/KoboldCPP/Models/airoboros-c34b-2.1.Q6_K.gguf', noavx2=False, noblas=False, nommap=False, port=5001, port_param=5001, psutil_set_threads=True, ropeconfig=[0.0, 10000.0], skiplauncher=False, smartcontext=False, stream=False, tensor_split=None, threads=6, unbantokens=False, useclblast=None, usecublas=['normal', '0', 'mmq'], usemirostat=None, usemlock=True)

Loading model: C:\KoboldCPP\Models\airoboros-c34b-2.1.Q6_K.gguf
[Threads: 6, BlasThreads: 6, SmartContext: False]
Identified as LLAMA model: (ver 6)
Attempting to Load...

Using automatic RoPE scaling (scale:1.000, base:26000.0)
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6
llama_model_loader: loaded meta data with 20 key-value pairs and 435 tensors from C:\KoboldCPP\Models\airoboros-c34b-2.1.Q6_K.gg�会oッ瑛lm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: n_ctx = 16384
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 48
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 22016
llm_load_print_meta: freq_base = 26000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 34B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model size = 33.74 B
llm_load_print_meta: general.name = jondurbin_airoboros-c34b-2.1
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.14 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 26400.83 MB (+ 3072.00 MB per state)
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/51 layers to GPU
llm_load_tensors: VRAM used: 0 MB
....................................................................................................
llama_new_context_with_model: kv self size = 3072.00 MB
llama_new_context_with_model: compute buffer total size = 8385.48 MB
llama_new_context_with_model: VRAM scratch buffer: 8384.01 MB
Load Model OK: True
Embedded Kobold Lite loaded.

Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001/

@KerfuffleV2
Copy link
Collaborator

I see. There isn't any "automatic" rope scaling stuff in base llama.cpp as far as I know. However as of #2793 it should respect the parameters if they're in config.json

Just for example: https://huggingface.co/Phind/Phind-CodeLlama-34B-Python-v1/blob/main/config.json

Has:

  "rope_scaling": null,
  "rope_theta": 1000000,

Assuming the model was converted with a version that included the pull I mentioned, it should include the correct rope scaling in the .gguf file and use it when loading the model. Long story short, I don't think llama.cpp should be affected by this issue. (I'm not an expert on this so it's possible I'm wrong.)

@SabinStargem
Copy link
Author

SabinStargem commented Sep 9, 2023

Checking the pytorch for Airoboros c34b, looks to be the right value for Theta. On that front, it looks to be a KoboldCPP issue.

However, that still leaves a point of concern for me.

Slaren said that "--rope-freq-base 1e6", is what CodeLLama uses. I am seeing in Phind and Airoboros's pytorch files using "rms_norm_eps": 1e-05. Assuming that I am not misunderstanding, the llamacpp tools might be assigning the wrong rms_norm_eps. In KoboldCPP, 1e-05 pops up for both Airo and WizardLM 34b.

The Bloke said it should be 1e6 and that it should bake straight into the GGUF. That was about 15 days ago, but the Airo and WizardLM that I downloaded are from about 4 days old according to the github.

Knowing me, I am likely to be wrong. Still, I wanted to bring that up, just in case.

@KerfuffleV2
Copy link
Collaborator

Slaren said that "--rope-freq-base 1e6", is what CodeLLama uses. I am seeing in Phind and Airoboros's pytorch files using "rms_norm_eps": 1e-05.

I'm not sure I understand. Aside from using scientific notation to express the number, there's no relationship between rms_norm_eps and the rope frequency as far as I know.

Also, the log messages you pasted from loading the model seem to have the correct EPS values:

llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05

It's the same in config.json for the model:

  "rms_norm_eps": 1e-05,

@SabinStargem
Copy link
Author

In that case, I stand corrected. Thank you. :)

Copy link
Contributor

github-actions bot commented Apr 3, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants