-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request support for LLaMA-2-7B-32K #2530
Comments
It should work by using the parameter |
@klosax Have you tried it. |
No I have not tried it, I was just looking at the model config.json . |
Here is what it does ./main --rope-freq-scale 8.0 -m models/ggml-model-f16.bin -p "What is a Llama?" main: warning: scaling RoPE frequency by 8 (default 1.0) main: build = 963 (93356bd) main: seed = 1691415624 llama.cpp: loading model from models/ggml-model-f16.bin llama_model_load_internal: format = ggjt v1 (pre #1405) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 5504 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_head_kv = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 1 llama_model_load_internal: rnorm_eps = 5.0e-06 llama_model_load_internal: n_ff = 11008 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 8 llama_model_load_internal: ftype = 1 (mostly F16) llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: mem required = 12853.10 MB (+ 256.00 MB per state) llama_new_context_with_model: kv self size = 256.00 MB llama_new_context_with_model: compute buffer total size = 71.84 MB system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0 What is a Llama?!?! |
@apcameron Actually, it isn't |
You are right, looking at the PR #2054 it sure looks that I missed something. So extending context length from 4k to 32k is a ctx_scale of 8.0.
If this work as it should, we should consider adding a parameter for scaling directly using the fine-tuned ctx. I dont know if the |
Thank you --rope-freq-scale 0.125 works |
|
Ok. Thank you.
Great. I think we should have a parameter that is the inverse of this, since it would make more sense and be in line with the parameters in the HF config.json:
|
PR added #2544 |
Hi , |
thanks for your response. but I want this without llama.pp. Like in this code: `from transformers import GenerationConfig, LlamaForCausalLM, LlamaTokenizer MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf" bnb_config = BitsAndBytesConfig( model = LlamaForCausalLM.from_pretrained( tokenizer = LlamaTokenizer.from_pretrained(MODEL_NAME) |
LLaMA-2-7B-32K
Model Description
LLaMA-2-7B-32K is an open-source, long context language model developed by Together, fine-tuned from Meta's original Llama-2 7B model. This model represents our efforts to contribute to the rapid progress of the open-source ecosystem for large language models. The model has been extended to a context length of 32K with position interpolation, allowing applications on multi-document QA, long text summarization, etc
The model is available here
The text was updated successfully, but these errors were encountered: