Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try whether OpenLLaMa works #1291

Closed
prusnak opened this issue May 2, 2023 · 82 comments
Closed

Try whether OpenLLaMa works #1291

prusnak opened this issue May 2, 2023 · 82 comments
Labels
model Model specific stale 🦙. llama

Comments

@prusnak
Copy link
Collaborator

prusnak commented May 2, 2023

... or whether we need to tweak some settings

GitHub: https://github.com/openlm-research/open_llama

HuggingFace: https://huggingface.co/openlm-research/open_llama_7b_preview_300bt


edit: GGML models uploaded to HH by @vihangd => https://huggingface.co/vihangd/open_llama_7b_300bt_ggml

@Green-Sky Green-Sky added model Model specific 🦙. llama labels May 2, 2023
@Green-Sky
Copy link
Collaborator

Green-Sky commented May 2, 2023

Other than the 7B model, we are also training a smaller 3B model in hope of facilitating language model usage in low resource use cases.

sounds good.

I see inference tests in the CI coming

@Green-Sky
Copy link
Collaborator

Green-Sky commented May 3, 2023

./main -m models/open_llama_7b_preview_200bt/ggml-model-q5_1.bin -n 100 -s 3
main: build = 491 (7dffb0d)
main: seed  = 3
llama.cpp: loading model from models/open_llama_7b_preview_200bt/ggml-model-q5_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  68.20 KB
llama_model_load_internal: mem required  = 6612.58 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 100, n_keep = 0


 2016, the company has been awarded a contract to build the city's fifth water plant. The project is being constructed by NEPCO Constructors, which was recently acquired by Bechtel. The company was selected in May of 2016 to build a new wastewater treatment facility for the city and is also currently working on construction of the city's first solar farm. The solar array will supply energy to water utility operations at the site. "This project exemplifies
llama_print_timings:        load time =   421.38 ms
llama_print_timings:      sample time =    69.40 ms /   100 runs   (    0.69 ms per run)
llama_print_timings: prompt eval time =   206.36 ms /     2 tokens (  103.18 ms per token)
llama_print_timings:        eval time = 13842.44 ms /    99 runs   (  139.82 ms per run)
llama_print_timings:       total time = 14355.46 ms

@Green-Sky
Copy link
Collaborator

Green-Sky commented May 3, 2023

(for 200bt)

$ sha256sum *.bin
9119b65346b0b503b29e04d75ca444c626d7c3c5f886ef7a52aa67ac44102d60  ggml-model-f16.bin
ec0d36db9435481b91c6c8e351406a16b19ce328334a64fc27bd52b1c9a3f3d6  ggml-model-q5_1.bin

@ghost
Copy link

ghost commented May 3, 2023

The fact that this new model works here is great as it means that we can move beyond the leaked Llama into a truly open model while keeping the existing GGML code and applications.

@Green-Sky were any changes to llama.cpp/convert.py required to get this model to load?

@Green-Sky
Copy link
Collaborator

Ok, so, Perplexity seems to spiral out of control. Something must be wrong.
[1]11.4237,[2]275.3643,[3]942.8642,[4]1526.5208,[5]1918.0205,[6]2343.1441

@eiery no, no changes, but I made sure to call the convert.py from the model subdirectory, to make sure it picks the correct tokenizer.

@ggerganov
Copy link
Owner

@Green-Sky Should be fixed by bf4b22f

@Green-Sky
Copy link
Collaborator

Green-Sky commented May 3, 2023

@ggerganov I was on bf4b22f

$ bin/perplexity -m ../models/open_llama_7b_preview_200bt/ggml-model-q5_1.bin -f ../wikitext-2-raw/wiki.test.raw
main: build = 489 (bf4b22f)
main: seed  = 1683073191
llama.cpp: loading model from ../models/open_llama_7b_preview_200bt/ggml-model-q5_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  68.20 KB
llama_model_load_internal: mem required  = 6612.58 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity : calculating perplexity over 616 chunks, batch_size=512
31.14 seconds per pass - ETA 5 hours 19 minutes
[1]11.4237,[2]275.3643,[3]942.8642,[4]1526.5208,[5]1918.0205,[6]2343.1441,[7]2831.9216,[8]3417.0539,[9]3617.7688,[10]3698.6451,[11]4032.9414,

it was segfaulting before 😄

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 3, 2023

It seems to break down after resetting the context.

@ggerganov
Copy link
Owner

ggerganov commented May 3, 2023

The OpenLLaMA generation fails when the prompt does not start with the BOS token 1.
For main a workaround is to use --keep 1 or more. This will guarantee that during context swap, the first token will remain BOS.

For perplexity - there is no workaround. The fix is to change the chunks to always start with BOS token.
I am currently evaluating how this affects the existing perplexity results for LLaMA.

Does anyone know if OpenLLaMA's behavior is correct?
I mean, shouldn't it still work, even if the first token is not BOS?


LLaMA (vanilla)

Model Format PPL (Original) PPL (With BOS)
7B F16 5.9565 5.9066
7B Q4_0 6.2103 6.1621
7B Q5_1 5.9934 5.9481

@Green-Sky
Copy link
Collaborator

Note that we use BOS (beginning of sentence) token (id=1) during training, so it is important to prepend this token for best performance during few-shot evaluation.

but why waste an extra slot in context?

@ggerganov
Copy link
Owner

ggerganov commented May 3, 2023

but why waste an extra slot in context?

I guess it makes the generation more accurate. I think it depends on how was the training performed.
If all training sequences had a BOS at the start, then we have to satisfy this requirement during inference to get a correct result.

The good news is that the fix is trivial
The bad news is that we need to redo all Perplexity calculations

@TemporalAgent7
Copy link

TemporalAgent7 commented May 3, 2023

FWIW, this set of instructions worked for me on a Windows 11 machine; they did not work on an Intel mac:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release
python3 -m pip install -r requirements.txt
cd models
git clone https://huggingface.co/openlm-research/open_llama_7b_preview_200bt
cd..
python3 convert-pth-to-ggml.py models\open_llama_7b_preview_200bt\open_llama_7b_preview_200bt_transformers_weights 1
build\bin\Release\quantize.exe models\open_llama_7b_preview_200bt\open_llama_7b_preview_200bt_transformers_weights\ggml-model-f16.bin models\open_llama_7b_preview_200bt_q5_0.ggml q5_0
build\bin\Release\main.exe -m models\open_llama_7b_preview_200bt_q5_0.ggml --ignore-eos -n 1280 --mlock -p "Building a website can be done in 10 simple steps:"

And these worked for me to convert/quantize the newly released 300b model:

git clone https://huggingface.co/openlm-research/open_llama_7b_preview_300bt
cd ..
python3 convert-pth-to-ggml.py models\open_llama_7b_preview_300bt\open_llama_7b_preview_300bt_transformers_weights 1
build\bin\Release\quantize.exe models\open_llama_7b_preview_300bt\open_llama_7b_preview_300bt_transformers_weights\ggml-model-f16.bin models\open_llama_7b_preview_300btt_q5_1.ggml q5_1

The results are slightly better but still pretty bad. We need an RLHF / Alpaca on top of openllama.

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 4, 2023

The good news is that the fix is trivial
The bad news is that we need to redo all Perplexity calculations

I wouldn't say it affects LLaMa, it is a different model that works differently, maybe there could be a command line option --always_bos although, for users' convenience, it could also be specified in the model file already.

EDIT: never mind, I didn't see the perplexity is lower with this.

@limcheekin
Copy link

Appreciate if you can share the converted ggml model and show how it works in a colab notebook.

Thank you.

@sciafri
Copy link

sciafri commented May 5, 2023

edit: solved this. git-lfs wasn't installed, this is a new WSL distro I've setup and forgot about that. seems potentially a good check to check the file size, contents or something similar and throw a more verbose warning in that case.

Getting the following error when trying to run the convert-pth-to-ggml.py script that @TemporalAgent7 had success running. Not sure it matters but I'm in WSL using Ubuntu 22.04 and python 3.11. Please let me know if I should open a separate ticket:

~/llama.cpp$ python convert-pth-to-ggml.py models/open_llama_7b_preview_300bt/open_llama_7b_preview_300bt_transformers_weights 1
Loading model file models/open_llama_7b_preview_300bt/open_llama_7b_preview_300bt_transformers_weights/pytorch_model-00001-of-00002.bin
Traceback (most recent call last):
  File "~/llama.cpp/convert-pth-to-ggml.py", line 11, in <module>
    convert.main(['--outtype', 'f16' if args.ftype == 1 else 'f32', '--', args.dir_model])
  File "~/llama.cpp/convert.py", line 1145, in main
    model_plus = load_some_model(args.model)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/llama.cpp/convert.py", line 1071, in load_some_model
    models_plus.append(lazy_load_file(path))
                       ^^^^^^^^^^^^^^^^^^^^
  File "~/llama.cpp/convert.py", line 873, in lazy_load_file
    raise ValueError(f"unknown format: {path}")
ValueError: unknown format: models/open_llama_7b_preview_300bt/open_llama_7b_preview_300bt_transformers_weights/pytorch_model-00001-of-00002.bin

@vihangd
Copy link

vihangd commented May 5, 2023

Uploaded the ggml weights to huggingface https://huggingface.co/vihangd/open_llama_7b_300bt_ggml

@Green-Sky
Copy link
Collaborator

Uploaded the ggml weights to huggingface https://huggingface.co/vihangd/open_llama_7b_300bt_ggml

confirmed sha256 for the q5_1, so probably good

@Green-Sky
Copy link
Collaborator

Green-Sky commented May 17, 2023

https://github.com/openlm-research/open_llama#update-05152023

Update 05/15/2023

After receiving feedback from the community, we discovered that the tokenizer of our previous checkpoint release was configured incorrectly so that new lines are not preserved. To fix this problem, we have retrained our tokenizer and restarted the model training. We’ve also observed lower training loss with this new tokenizer.


They have also released new previews for 3B and 7B variants.

@vihangd
Copy link

vihangd commented May 17, 2023

Uploaded the quantized weights for the latest 7B 400bt variant at https://huggingface.co/vihangd/open_llama_7b_400bt_ggml

@ghost
Copy link

ghost commented May 17, 2023

If we get good results here please consider using OpenLlama as an official recommended model for llama.cpp that can be publicly shared. For development we don't need the latest and greatest model, we need something which is compatible with Llama and can be used for running regressions and the like.

I'm not sure how consistent the Github CI system is in terms of performance but having new PRs actually run a real-world test on this model might be useful in the long run.

@Green-Sky
Copy link
Collaborator

the 7B is still too large for CI, but the 3B ... maybe, with action cache, it looks kinda attractive...

@limcheekin
Copy link

Uploaded the quantized weights for the latest 7B 400bt variant at https://huggingface.co/vihangd/open_llama_7b_400bt_ggml

Thanks. It works!

@Green-Sky
Copy link
Collaborator

The 3B is a new variant of the llama architecture. so you need to modify the code to make it work.
hyperparameters:

llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 3200
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 25
llama_model_load_internal: n_layer    = 26
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 8640
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 3B

@ggerganov the n_ff calculated here

uint32_t n_ff = ((2*(4*hparams.n_embd)/3 + hparams.n_mult - 1)/hparams.n_mult)*hparams.n_mult;
is wrong for 3B.
the model file uses 8640, while that formula calculates 8704.

Also running perplexity on the 3B is very bad.

[1]467.9299,[2]881.3974,[3]938.6996,[4]968.3665,[5]880.2046,[6]915.8446,[7]947.3328,[8]948.1425,[9]987.2643,[10]996.8526,[11]1044.6316,[12]1066.0830,[13]1058.7516,[14]1079.4397,[15]1067.0952,[16]1033.7049,[17]1047.4951,[18]1016.5692,[19]1028.3408,[20]1008.3351,[21]1025.7143,[22]996.6403,[23]995.1431,[24]971.5839,[25]942.8443,[26]920.9702,[27]886.4688,[28]864.4770,[29]896.0682,[30]877.3518,[31]866.7190,[32]873.4632,[33]861.8478,[34]859.4715,[35]830.4909,[36]826.7048

so there might be more changes needed, or @young-geng and team did an oopsy again.

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 17, 2023

Should n_head not be 32, @Green-Sky? config.json, just a guess.

@limcheekin
Copy link

I fine-tuned instruction-following 3B model using OpenLLaMA. It is on my HuggingFace now.

For your information, you may refer to the following similar works:

By the way, appreciate if you could share the link of the hugging face repo here.

Thanks.

@Sovenok-Hacker
Copy link

I fine-tuned instruction-following 3B model using OpenLLaMA. It is on my HuggingFace now.

For your information, you may refer to the following similar works:

By the way, appreciate if you could share the link of the hugging face repo here.

Thanks.

https://huggingface.co/Sovenok-Hacker/nanoalpaca-3b

@klosax
Copy link
Collaborator

klosax commented May 30, 2023

Perplexity of OpenLLaMA vs LLaMA:

openllama_perplexity

Chart shows the perplexity of each chunk - not the cumulative average.

@xingchensong
Copy link
Contributor

Perplexity of OpenLLaMA vs LLaMA:

openllama_perplexity

Chart shows the perplexity of each chunk - not the cumulative average.

image

According to LLAMA paper, the downstream task performance keeps improving even at 1T tokens (figure above), I think the gap between openllama-7b-700bt and llama-7b-1000bt is reasonable and the OpenLLaMA team confirmed that they are planning to further train the model, I'm happy to stay tuned for more updates from OpenLLaMA team~

@klosax
Copy link
Collaborator

klosax commented May 31, 2023

According to LLAMA paper, the downstream task performance keeps improving even at 1T tokens (figure above)

Yes indeed. Simply fitting the data from figure 2 to curves gives a clearer picture. But at some point the accuracy will stop increasing depending on the model size. Maybe OpenLLaMA will continue training the models until that point is found.

llama_hellaswag

@ssenthilanand
Copy link

I am trying to run llama.cpp on a laptop with Ryzen 4500u with Vega integrated graphics and 8 GB ram. It runs 7B models fine but I wanted to test the opencl acceleration. Turning on opencl takes away a portion of RAM and slows down generation. Wanted to check using the 3B model since it will give me more free ram. Downloaded the latest openllama 3B checkpoint.

I converted the 3B model and got a nice 1.79GB q4_0 model. On running gave a :

LLAMA_ASSERT: D:\a\llama.cpp\llama.cpp\llama.cpp:906: false

Looking at the code it looks like 3B is not supported. Is there any plan to add support for the 3B models ?

Now that support for 3B models it available, I tested it using latest release files and the weights from:
https://huggingface.co/Sovenok-Hacker/nanoalpaca-3b


1. Pure CPU
PS C:\llamacpp> .\main.exe -m .\nano-alpaca-3b-q4_0-ggml.bin -p "The capital city of India is:" -n 100
...
llama_print_timings: load time = 948.03 ms
llama_print_timings: sample time = 22.70 ms / 100 runs ( 0.23 ms per token)
llama_print_timings: prompt eval time = 719.71 ms / 8 tokens ( 89.96 ms per token)
llama_print_timings: eval time = 8163.96 ms / 99 runs ( 82.46 ms per token)
llama_print_timings: total time = 9150.66 ms

2. 2 layers to IGP
PS C:\llamacpp> .\main.exe -m .\nano-alpaca-3b-q4_0-ggml.bin -p "The capital city of India is:" -n 100 -ngl 2
...
ggml_opencl: offloading 2 layers to GPU
ggml_opencl: total VRAM used: 132 MB
...
llama_print_timings: load time = 2413.82 ms
llama_print_timings: sample time = 12.56 ms / 55 runs ( 0.23 ms per token)
llama_print_timings: prompt eval time = 2016.63 ms / 8 tokens ( 252.08 ms per token)
llama_print_timings: eval time = 6233.72 ms / 54 runs ( 115.44 ms per token)
llama_print_timings: total time = 8669.34 ms

3. 12 layers to IGP
PS C:\llamacpp> .\main.exe -m .\nano-alpaca-3b-q4_0-ggml.bin -p "The capital city of India is:" -n 100 -ngl 12
...
ggml_opencl: offloading 12 layers to GPU
ggml_opencl: total VRAM used: 797 MB
...
llama_print_timings: load time = 5908.14 ms
llama_print_timings: sample time = 23.18 ms / 100 runs ( 0.23 ms per token)
llama_print_timings: prompt eval time = 4221.16 ms / 8 tokens ( 527.65 ms per token)
llama_print_timings: eval time = 44238.62 ms / 99 runs ( 446.85 ms per token)
llama_print_timings: total time = 50187.72 ms

4. All layers in IGP
PS C:\llamacpp> .\main.exe -m .\nano-alpaca-3b-q4_0-ggml.bin -p "The capital city of India is:" -n 100 -ngl 26
main: build = 607 (ffb06a3)
main: seed = 1685525567
ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: selecting device: 'gfx90c'
ggml_opencl: device FP16 support: true
...
ggml_opencl: offloading 26 layers to GPU
ggml_opencl: total VRAM used: 1728 MB
...
The capital city of India is: Delhi [end of text]

llama_print_timings: load time = 8076.56 ms
llama_print_timings: sample time = 0.44 ms / 2 runs ( 0.22 ms per token)
llama_print_timings: prompt eval time = 4088.64 ms / 8 tokens ( 511.08 ms per token)
llama_print_timings: eval time = 796.24 ms / 1 runs ( 796.24 ms per token)
llama_print_timings: total time = 8873.74 ms

5. All layers in IGP with a prompt requiring longer response.
PS C:\llamacpp> .\main.exe -m .\nano-alpaca-3b-q4_0-ggml.bin -p "A poem about the pollution in New Delhi" -n 100 -ngl 26
...
llama_print_timings: load time = 7136.19 ms
llama_print_timings: sample time = 16.11 ms / 59 runs ( 0.27 ms per token)
llama_print_timings: prompt eval time = 4295.58 ms / 9 tokens ( 477.29 ms per token)
llama_print_timings: eval time = 52979.77 ms / 58 runs ( 913.44 ms per token)
llama_print_timings: total time = 60143.18 ms


The first 3 runs had the answer, followed by three different random texts with plenty of hallucinations. In test 4 I was surprised by its on point answer and tried a different prompt.

The prompts are probably not in the correct format but the relative performance degradation can be seen.

Baed on this rather unscientific tests, I am going to use CPU only inference. Though the IGP can access upto 2GB of RAM, it makes no difference. going through the IGP slows things down for my laptop.

@klosax
Copy link
Collaborator

klosax commented Jun 7, 2023

New releases:

OpenLLaMA 3B 1000bt final
OpenLLaMA 7B 1000bt final
OpenLLaMA 13B 600bt

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jun 7, 2023

I have ggml versions:

All uploaded now.

@ghost
Copy link

ghost commented Jun 7, 2023

With 3B and 7B released it would be nice for someone with a beefy machine to get perplexity results for the most popular quants.

@klosax
Copy link
Collaborator

klosax commented Jun 7, 2023

With 3B and 7B released it would be nice for someone with a beefy machine to get perplexity results for the most popular quants.

Perplexity on wiki.test.raw:

openllama-3b-q5_1 : 7.84273862
openllama-7b-q5_1 : 7.03177645

Remember, perplexity is a measure of how "unsure" the model is at predicting the text in the specified file. This is ok when comparing a model with itself, like different quantization formats. A better measure if you want to compare a model with another model would be Language Model Evaluation Harness . See Open LLM Leaderboard .

@ghost
Copy link

ghost commented Jun 8, 2023

Remember, perplexity is a measure of how "unsure" the model is at predicting the text in the specified file. This is ok when comparing a model with itself, like different quantization formats. A better measure if you want to compare a model with another model would be Language Model Evaluation Harness . See Open LLM Leaderboard .

That was my intention all along, as I wanted to see how well the model quantizes against the f16 baseline. Ideally results should be similar to the original Llama but you don't know until you try...

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jun 8, 2023

OK, I did a perplexity run of the new 3B, you can see how it compares to the last one.

Q chunk 600BT 1000BT
F16 [616] 8.4656 7.7861
Q8_0 [616] 8.4667 7.7874
Q5_1 [616] 8.5072 7.8424
Q5_0 [616] 8.5156 7.8474
Q4_1 [616] 8.6102 8.0483
Q4_0 [616] 8.6674 8.0962

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jun 8, 2023

Some more comparative perplexity analysis done by @gjmulder: https://github.com/openlm-research/open_llama/discussions/41

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jun 8, 2023

7B run done:

Q score
Q2_K 8.5152
Q3_K_S 7.6623
Q3_K 7.3837
Q3_K_L 7.3043
Q4_0 7.2116
Q4_1 7.1609
Q4_K_S 7.1516
Q4_K 7.1116
Q5_0 7.0353
Q5_K_S 7.0325
Q5_1 7.0318
Q5_K 7.0272
Q6_K 7.0050
Q8_0 6.9968
F16 6.9966

@raffienficiaud
Copy link

Hi there, wonderful work!

Has this:

We found the problem, it was in the conversion code, there was another n_head = n_embd / 128 type assumption, which I didn't catch at first.

Running perplexity right now...

landed to master? I am still having the issue #1291 (comment)

error loading model: llama.cpp: tensor 'layers.0.feed_forward.w1.weight' has wrong shape; expected  3200 x  8704, got  3200 x  8640

when doing the conversion/quantization myself from master (ae9663f) and from HF/openlm-research 3B (q8_0), while the model you posted on HF works for me.

A missing backport?

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jun 9, 2023

@raffienficiaud we merged the 3B changes without the python conversion script changes.
#1588 (comment)

@SlyEcho is there an open pr with the hacky python changes?

@m1chae1bx
Copy link

@SlyEcho I'm trying out the F16 7B model but I'm getting not so good output. I'm using ctransformers. May I know what values did you use for the config?

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jun 9, 2023

when doing the conversion/quantization myself from master (ae9663f) and from HF/openlm-research 3B (q8_0), while the #1291 (comment) on HF works for me.

There is a diff file there where the hacks are used. The whole conversion workflow should be possible to do without much dependencies just using the Makefile there.

is there an open pr with the hacky python changes?

No, I don't think so.

I see a couple of options to fix it:

  1. Read the config file and get its values.
  2. Allow some values to be overridden from CLI arguments.
  3. Compute a suitable n_mult value that calculates the correct n_ff value.
  4. Add n_ff to the model file (maybe something to consider for the next format?)
  5. Pad the tensors to 256 (I think it should work but I haven't tested it). This may fix the K-quants as well.

I'm trying out the F16 7B model but I'm getting not so good output. I'm using ctransformers. May I know what values did you use for the config?

I didn't create the model, so I don't really know but it may have something to do with the tokenizer:

Please note that it is advised to avoid using the Hugging Face fast tokenizer for now, as we’ve observed that the auto-converted fast tokenizer sometimes gives incorrect tokenizations. This can be achieved by directly using the LlamaTokenizer class, or passing in the use_fast=False option for the AutoTokenizer class.

@ochafik
Copy link
Collaborator

ochafik commented Jun 27, 2023

Re/ weird outputs, OpenLLaMA seems to have extra dropout layers in attention and feed-forward (here's something I hacked on tinygrad to make it work).

And potentially some versions have an extra layernorm after the embedding layer (see HF's OpenLlamaModel and how it differs from their LlamaModel).

@young-geng
Copy link

@ochafik Those dropouts are never used during the pre-training of the model, so I believe that they can be safely ignored. The corresponding model on transformers should be the standard LLaMA instead of Open-Llama.

@ochafik
Copy link
Collaborator

ochafik commented Jun 27, 2023

@young-geng ahhh, now it make sense, thank you!

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jul 7, 2023

new open llama just dropped https://huggingface.co/openlm-research/open_llama_7b_v2

Update 07/07/2023
We are happy to release an OpenLLaMA 7Bv2 model, which is trained on a mixture of Falcon refined-web dataset, mixed with the starcoder dataset, and the wikipedia, arxiv and books and stackexchange from RedPajama. The 3Bv2 model is coming soon.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jul 10, 2023

I have ggml files for v2: https://huggingface.co/SlyEcho/open_llama_7b_v2_ggml/tree/main

@klosax
Copy link
Collaborator

klosax commented Jul 17, 2023

Version 2 of the 3b open llama model: https://huggingface.co/openlm-research/open_llama_3b_v2

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jul 18, 2023

Uploading 3Bv2: https://huggingface.co/SlyEcho/open_llama_3b_v2_ggml

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model Model specific stale 🦙. llama
Projects
None yet
Development

No branches or pull requests