Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request support for LLaMA-2-7B-32K #2530

Closed
apcameron opened this issue Aug 6, 2023 · 13 comments
Closed

Request support for LLaMA-2-7B-32K #2530

apcameron opened this issue Aug 6, 2023 · 13 comments

Comments

@apcameron
Copy link
Contributor

LLaMA-2-7B-32K
Model Description

LLaMA-2-7B-32K is an open-source, long context language model developed by Together, fine-tuned from Meta's original Llama-2 7B model. This model represents our efforts to contribute to the rapid progress of the open-source ecosystem for large language models. The model has been extended to a context length of 32K with position interpolation, allowing applications on multi-document QA, long text summarization, etc
The model is available here

@klosax
Copy link
Contributor

klosax commented Aug 6, 2023

It should work by using the parameter --rope-freq-scale 8.0

@apcameron
Copy link
Contributor Author

@klosax Have you tried it.
If so what exactly did you do as it does not work for me

@klosax
Copy link
Contributor

klosax commented Aug 7, 2023

No I have not tried it, I was just looking at the model config.json .
What does not work? Have you tried without quantization using F32 or F16?

@apcameron
Copy link
Contributor Author

Here is what it does

./main --rope-freq-scale 8.0 -m models/ggml-model-f16.bin -p "What is a Llama?"
main: warning: scaling RoPE frequency by 8 (default 1.0)
main: build = 963 (93356bd)
main: seed  = 1691415624
llama.cpp: loading model from models/ggml-model-f16.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 5504
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 8
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 12853.10 MB (+  256.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: compute buffer total size =   71.84 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 What is a Llama?!?!

@Igoorx
Copy link

Igoorx commented Aug 7, 2023

@apcameron Actually, it isn't --rope-freq-scale 8.0, it should be --rope-freq-scale 0.125 (i.e. 1/8)

@klosax
Copy link
Contributor

klosax commented Aug 7, 2023

Actually, it isn't --rope-freq-scale 8.0, it should be --rope-freq-scale 0.125 (i.e. 1/8)

You are right, looking at the PR #2054 it sure looks that I missed something.

So extending context length from 4k to 32k is a ctx_scale of 8.0.
Now we have 2 parameters to set for that to work according to the PR:

--rope-freq-scale = 1/ctx_scale = 1/8.0 = 0.125
--rope-freq-base = 10000 x ctx_scale = 80000

If this work as it should, we should consider adding a parameter for scaling directly using the fine-tuned ctx. I dont know if the rope-freq-base parameter is needed but please report your findings.

@apcameron
Copy link
Contributor Author

Thank you --rope-freq-scale 0.125 works

@Igoorx
Copy link

Igoorx commented Aug 7, 2023

If this work as it should, we should consider adding a parameter for scaling directly using the fine-tuned ctx. I dont know if the rope-freq-base parameter is needed but please report your findings.

rope-freq-base shouldn't be used together with rope-freq-scale, rope-freq-base is used for NTK-Aware scaling and rope-freq-scale is used for linear scaling, so if you use the two together you're basically applying a 64x scaling.

@klosax
Copy link
Contributor

klosax commented Aug 7, 2023

rope-freq-base shouldn't be used together with rope-freq-scale

Ok. Thank you.

--rope-freq-scale 0.125 works

Great. I think we should have a parameter that is the inverse of this, since it would make more sense and be in line with the parameters in the HF config.json:

"rope_scaling": {
    "factor": 8.0,
    "type": "linear"
  }

@klosax
Copy link
Contributor

klosax commented Aug 7, 2023

PR added #2544

@MUZAMMILPERVAIZ
Copy link

Hi ,
Can anyone share a sample code how to use these scaling parameters while loading llama 2 13b chat model from. Huggingface?

@klosax
Copy link
Contributor

klosax commented Aug 23, 2023

@MUZAMMILPERVAIZ
Copy link

MUZAMMILPERVAIZ commented Aug 23, 2023

thanks for your response. but I want this without llama.pp. Like in this code:

`from transformers import GenerationConfig, LlamaForCausalLM, LlamaTokenizer

MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)

model = LlamaForCausalLM.from_pretrained(
MODEL_NAME,
device_map="auto",
trust_remote_code=True,
quantization_config=bnb_config,
)

tokenizer = LlamaTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants