Request support for LLaMA-2-7B-32K #2530

apcameron · 2023-08-06T10:09:30Z

LLaMA-2-7B-32K
Model Description

LLaMA-2-7B-32K is an open-source, long context language model developed by Together, fine-tuned from Meta's original Llama-2 7B model. This model represents our efforts to contribute to the rapid progress of the open-source ecosystem for large language models. The model has been extended to a context length of 32K with position interpolation, allowing applications on multi-document QA, long text summarization, etc
The model is available here

klosax · 2023-08-06T13:24:47Z

It should work by using the parameter --rope-freq-scale 8.0

apcameron · 2023-08-07T10:58:03Z

@klosax Have you tried it.
If so what exactly did you do as it does not work for me

klosax · 2023-08-07T11:08:48Z

No I have not tried it, I was just looking at the model config.json .
What does not work? Have you tried without quantization using F32 or F16?

apcameron · 2023-08-07T13:47:49Z

Here is what it does

./main --rope-freq-scale 8.0 -m models/ggml-model-f16.bin -p "What is a Llama?"
main: warning: scaling RoPE frequency by 8 (default 1.0)
main: build = 963 (93356bd)
main: seed  = 1691415624
llama.cpp: loading model from models/ggml-model-f16.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 5504
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 8
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 12853.10 MB (+  256.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: compute buffer total size =   71.84 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 What is a Llama?!?!

Igoorx · 2023-08-07T14:17:01Z

@apcameron Actually, it isn't --rope-freq-scale 8.0, it should be --rope-freq-scale 0.125 (i.e. 1/8)

klosax · 2023-08-07T15:21:32Z

Actually, it isn't --rope-freq-scale 8.0, it should be --rope-freq-scale 0.125 (i.e. 1/8)

You are right, looking at the PR #2054 it sure looks that I missed something.

So extending context length from 4k to 32k is a ctx_scale of 8.0.
Now we have 2 parameters to set for that to work according to the PR:

--rope-freq-scale = 1/ctx_scale = 1/8.0 = 0.125
--rope-freq-base = 10000 x ctx_scale = 80000

If this work as it should, we should consider adding a parameter for scaling directly using the fine-tuned ctx. I dont know if the rope-freq-base parameter is needed but please report your findings.

apcameron · 2023-08-07T15:33:49Z

Thank you --rope-freq-scale 0.125 works

Igoorx · 2023-08-07T15:35:08Z

If this work as it should, we should consider adding a parameter for scaling directly using the fine-tuned ctx. I dont know if the rope-freq-base parameter is needed but please report your findings.

rope-freq-base shouldn't be used together with rope-freq-scale, rope-freq-base is used for NTK-Aware scaling and rope-freq-scale is used for linear scaling, so if you use the two together you're basically applying a 64x scaling.

klosax · 2023-08-07T15:47:01Z

rope-freq-base shouldn't be used together with rope-freq-scale

Ok. Thank you.

--rope-freq-scale 0.125 works

Great. I think we should have a parameter that is the inverse of this, since it would make more sense and be in line with the parameters in the HF config.json:

"rope_scaling": {
    "factor": 8.0,
    "type": "linear"
  }

klosax · 2023-08-07T16:12:46Z

PR added #2544

MUZAMMILPERVAIZ · 2023-08-23T13:03:54Z

Hi ,
Can anyone share a sample code how to use these scaling parameters while loading llama 2 13b chat model from. Huggingface?

klosax · 2023-08-23T13:13:35Z

https://github.com/ggerganov/llama.cpp/tree/master/examples/main#extended-context-size

MUZAMMILPERVAIZ · 2023-08-23T16:58:52Z

thanks for your response. but I want this without llama.pp. Like in this code:

`from transformers import GenerationConfig, LlamaForCausalLM, LlamaTokenizer

MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)

model = LlamaForCausalLM.from_pretrained(
MODEL_NAME,
device_map="auto",
trust_remote_code=True,
quantization_config=bnb_config,
)

tokenizer = LlamaTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token`

apcameron closed this as completed Aug 7, 2023

rozek mentioned this issue Aug 27, 2023

has anybody already converted Together.AI's LLaMA-2 variant with 32k context? #2841

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request support for LLaMA-2-7B-32K #2530

Request support for LLaMA-2-7B-32K #2530

apcameron commented Aug 6, 2023

klosax commented Aug 6, 2023

apcameron commented Aug 7, 2023

klosax commented Aug 7, 2023

apcameron commented Aug 7, 2023

Igoorx commented Aug 7, 2023

klosax commented Aug 7, 2023

apcameron commented Aug 7, 2023

Igoorx commented Aug 7, 2023 •

edited

Loading

klosax commented Aug 7, 2023

klosax commented Aug 7, 2023

MUZAMMILPERVAIZ commented Aug 23, 2023

klosax commented Aug 23, 2023

MUZAMMILPERVAIZ commented Aug 23, 2023 •

edited

Loading

Request support for LLaMA-2-7B-32K #2530

Request support for LLaMA-2-7B-32K #2530

Comments

apcameron commented Aug 6, 2023

klosax commented Aug 6, 2023

apcameron commented Aug 7, 2023

klosax commented Aug 7, 2023

apcameron commented Aug 7, 2023

Igoorx commented Aug 7, 2023

klosax commented Aug 7, 2023

apcameron commented Aug 7, 2023

Igoorx commented Aug 7, 2023 • edited Loading

klosax commented Aug 7, 2023

klosax commented Aug 7, 2023

MUZAMMILPERVAIZ commented Aug 23, 2023

klosax commented Aug 23, 2023

MUZAMMILPERVAIZ commented Aug 23, 2023 • edited Loading

Igoorx commented Aug 7, 2023 •

edited

Loading

MUZAMMILPERVAIZ commented Aug 23, 2023 •

edited

Loading