Try whether OpenLLaMa works #1291

prusnak · 2023-05-02T21:53:20Z

... or whether we need to tweak some settings

GitHub: https://github.com/openlm-research/open_llama

HuggingFace: https://huggingface.co/openlm-research/open_llama_7b_preview_300bt

edit: GGML models uploaded to HH by @vihangd => https://huggingface.co/vihangd/open_llama_7b_300bt_ggml

Green-Sky · 2023-05-02T22:09:17Z

Other than the 7B model, we are also training a smaller 3B model in hope of facilitating language model usage in low resource use cases.

sounds good.

I see inference tests in the CI coming

Green-Sky · 2023-05-03T00:04:02Z

./main -m models/open_llama_7b_preview_200bt/ggml-model-q5_1.bin -n 100 -s 3
main: build = 491 (7dffb0d)
main: seed  = 3
llama.cpp: loading model from models/open_llama_7b_preview_200bt/ggml-model-q5_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  68.20 KB
llama_model_load_internal: mem required  = 6612.58 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 100, n_keep = 0


 2016, the company has been awarded a contract to build the city's fifth water plant. The project is being constructed by NEPCO Constructors, which was recently acquired by Bechtel. The company was selected in May of 2016 to build a new wastewater treatment facility for the city and is also currently working on construction of the city's first solar farm. The solar array will supply energy to water utility operations at the site. "This project exemplifies
llama_print_timings:        load time =   421.38 ms
llama_print_timings:      sample time =    69.40 ms /   100 runs   (    0.69 ms per run)
llama_print_timings: prompt eval time =   206.36 ms /     2 tokens (  103.18 ms per token)
llama_print_timings:        eval time = 13842.44 ms /    99 runs   (  139.82 ms per run)
llama_print_timings:       total time = 14355.46 ms

Green-Sky · 2023-05-03T00:08:52Z

(for 200bt)

$ sha256sum *.bin
9119b65346b0b503b29e04d75ca444c626d7c3c5f886ef7a52aa67ac44102d60  ggml-model-f16.bin
ec0d36db9435481b91c6c8e351406a16b19ce328334a64fc27bd52b1c9a3f3d6  ggml-model-q5_1.bin

ghost · 2023-05-03T00:19:47Z

The fact that this new model works here is great as it means that we can move beyond the leaked Llama into a truly open model while keeping the existing GGML code and applications.

@Green-Sky were any changes to llama.cpp/convert.py required to get this model to load?

Green-Sky · 2023-05-03T00:24:33Z

Ok, so, Perplexity seems to spiral out of control. Something must be wrong.
[1]11.4237,[2]275.3643,[3]942.8642,[4]1526.5208,[5]1918.0205,[6]2343.1441

@eiery no, no changes, but I made sure to call the convert.py from the model subdirectory, to make sure it picks the correct tokenizer.

ggerganov · 2023-05-03T04:24:22Z

@Green-Sky Should be fixed by bf4b22f

Green-Sky · 2023-05-03T10:17:02Z

@ggerganov I was on bf4b22f

$ bin/perplexity -m ../models/open_llama_7b_preview_200bt/ggml-model-q5_1.bin -f ../wikitext-2-raw/wiki.test.raw
main: build = 489 (bf4b22f)
main: seed  = 1683073191
llama.cpp: loading model from ../models/open_llama_7b_preview_200bt/ggml-model-q5_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  68.20 KB
llama_model_load_internal: mem required  = 6612.58 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity : calculating perplexity over 616 chunks, batch_size=512
31.14 seconds per pass - ETA 5 hours 19 minutes
[1]11.4237,[2]275.3643,[3]942.8642,[4]1526.5208,[5]1918.0205,[6]2343.1441,[7]2831.9216,[8]3417.0539,[9]3617.7688,[10]3698.6451,[11]4032.9414,

it was segfaulting before 😄

SlyEcho · 2023-05-03T12:11:33Z

It seems to break down after resetting the context.

ggerganov · 2023-05-03T15:35:41Z

The OpenLLaMA generation fails when the prompt does not start with the BOS token 1.
For main a workaround is to use --keep 1 or more. This will guarantee that during context swap, the first token will remain BOS.

For perplexity - there is no workaround. The fix is to change the chunks to always start with BOS token.
I am currently evaluating how this affects the existing perplexity results for LLaMA.

Does anyone know if OpenLLaMA's behavior is correct?
I mean, shouldn't it still work, even if the first token is not BOS?

LLaMA (vanilla)

Model	Format	PPL (Original)	PPL (With BOS)
7B	F16	5.9565	5.9066
7B	Q4_0	6.2103	6.1621
7B	Q5_1	5.9934	5.9481

Green-Sky · 2023-05-03T15:39:42Z

Note that we use BOS (beginning of sentence) token (id=1) during training, so it is important to prepend this token for best performance during few-shot evaluation.

but why waste an extra slot in context?

ggerganov · 2023-05-03T15:52:26Z

but why waste an extra slot in context?

I guess it makes the generation more accurate. I think it depends on how was the training performed.
If all training sequences had a BOS at the start, then we have to satisfy this requirement during inference to get a correct result.

The good news is that the fix is trivial
The bad news is that we need to redo all Perplexity calculations

TemporalAgent7 · 2023-05-03T22:03:48Z

FWIW, this set of instructions worked for me on a Windows 11 machine; they did not work on an Intel mac:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release
python3 -m pip install -r requirements.txt
cd models
git clone https://huggingface.co/openlm-research/open_llama_7b_preview_200bt
cd..
python3 convert-pth-to-ggml.py models\open_llama_7b_preview_200bt\open_llama_7b_preview_200bt_transformers_weights 1
build\bin\Release\quantize.exe models\open_llama_7b_preview_200bt\open_llama_7b_preview_200bt_transformers_weights\ggml-model-f16.bin models\open_llama_7b_preview_200bt_q5_0.ggml q5_0
build\bin\Release\main.exe -m models\open_llama_7b_preview_200bt_q5_0.ggml --ignore-eos -n 1280 --mlock -p "Building a website can be done in 10 simple steps:"

And these worked for me to convert/quantize the newly released 300b model:

git clone https://huggingface.co/openlm-research/open_llama_7b_preview_300bt
cd ..
python3 convert-pth-to-ggml.py models\open_llama_7b_preview_300bt\open_llama_7b_preview_300bt_transformers_weights 1
build\bin\Release\quantize.exe models\open_llama_7b_preview_300bt\open_llama_7b_preview_300bt_transformers_weights\ggml-model-f16.bin models\open_llama_7b_preview_300btt_q5_1.ggml q5_1

The results are slightly better but still pretty bad. We need an RLHF / Alpaca on top of openllama.

SlyEcho · 2023-05-04T07:44:45Z

The good news is that the fix is trivial
The bad news is that we need to redo all Perplexity calculations

I wouldn't say it affects LLaMa, it is a different model that works differently, maybe there could be a command line option --always_bos although, for users' convenience, it could also be specified in the model file already.

EDIT: never mind, I didn't see the perplexity is lower with this.

limcheekin · 2023-05-05T03:21:54Z

Appreciate if you can share the converted ggml model and show how it works in a colab notebook.

Thank you.

sciafri · 2023-05-05T03:49:40Z

edit: solved this. git-lfs wasn't installed, this is a new WSL distro I've setup and forgot about that. seems potentially a good check to check the file size, contents or something similar and throw a more verbose warning in that case.

Getting the following error when trying to run the convert-pth-to-ggml.py script that @TemporalAgent7 had success running. Not sure it matters but I'm in WSL using Ubuntu 22.04 and python 3.11. Please let me know if I should open a separate ticket:

~/llama.cpp$ python convert-pth-to-ggml.py models/open_llama_7b_preview_300bt/open_llama_7b_preview_300bt_transformers_weights 1
Loading model file models/open_llama_7b_preview_300bt/open_llama_7b_preview_300bt_transformers_weights/pytorch_model-00001-of-00002.bin
Traceback (most recent call last):
  File "~/llama.cpp/convert-pth-to-ggml.py", line 11, in <module>
    convert.main(['--outtype', 'f16' if args.ftype == 1 else 'f32', '--', args.dir_model])
  File "~/llama.cpp/convert.py", line 1145, in main
    model_plus = load_some_model(args.model)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/llama.cpp/convert.py", line 1071, in load_some_model
    models_plus.append(lazy_load_file(path))
                       ^^^^^^^^^^^^^^^^^^^^
  File "~/llama.cpp/convert.py", line 873, in lazy_load_file
    raise ValueError(f"unknown format: {path}")
ValueError: unknown format: models/open_llama_7b_preview_300bt/open_llama_7b_preview_300bt_transformers_weights/pytorch_model-00001-of-00002.bin

vihangd · 2023-05-05T04:47:49Z

Uploaded the ggml weights to huggingface https://huggingface.co/vihangd/open_llama_7b_300bt_ggml

Green-Sky · 2023-05-05T11:50:13Z

Uploaded the ggml weights to huggingface https://huggingface.co/vihangd/open_llama_7b_300bt_ggml

confirmed sha256 for the q5_1, so probably good

Green-Sky · 2023-05-17T01:01:38Z

https://github.com/openlm-research/open_llama#update-05152023

Update 05/15/2023

After receiving feedback from the community, we discovered that the tokenizer of our previous checkpoint release was configured incorrectly so that new lines are not preserved. To fix this problem, we have retrained our tokenizer and restarted the model training. We’ve also observed lower training loss with this new tokenizer.

They have also released new previews for 3B and 7B variants.

vihangd · 2023-05-17T03:10:16Z

Uploaded the quantized weights for the latest 7B 400bt variant at https://huggingface.co/vihangd/open_llama_7b_400bt_ggml

ghost · 2023-05-17T03:50:42Z

If we get good results here please consider using OpenLlama as an official recommended model for llama.cpp that can be publicly shared. For development we don't need the latest and greatest model, we need something which is compatible with Llama and can be used for running regressions and the like.

I'm not sure how consistent the Github CI system is in terms of performance but having new PRs actually run a real-world test on this model might be useful in the long run.

Green-Sky · 2023-05-17T11:12:10Z

the 7B is still too large for CI, but the 3B ... maybe, with action cache, it looks kinda attractive...

limcheekin · 2023-05-17T12:14:10Z

Uploaded the quantized weights for the latest 7B 400bt variant at https://huggingface.co/vihangd/open_llama_7b_400bt_ggml

Thanks. It works!

Green-Sky · 2023-05-17T12:48:43Z

The 3B is a new variant of the llama architecture. so you need to modify the code to make it work.
hyperparameters:

llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 3200
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 25
llama_model_load_internal: n_layer    = 26
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 8640
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 3B

@ggerganov the n_ff calculated here

llama.cpp/llama.cpp

Line 900 in 2b26469

    
           uint32_t n_ff = ((2*(4*hparams.n_embd)/3 + hparams.n_mult - 1)/hparams.n_mult)*hparams.n_mult;

is wrong for 3B.
the model file uses 8640, while that formula calculates 8704.

Also running perplexity on the 3B is very bad.

[1]467.9299,[2]881.3974,[3]938.6996,[4]968.3665,[5]880.2046,[6]915.8446,[7]947.3328,[8]948.1425,[9]987.2643,[10]996.8526,[11]1044.6316,[12]1066.0830,[13]1058.7516,[14]1079.4397,[15]1067.0952,[16]1033.7049,[17]1047.4951,[18]1016.5692,[19]1028.3408,[20]1008.3351,[21]1025.7143,[22]996.6403,[23]995.1431,[24]971.5839,[25]942.8443,[26]920.9702,[27]886.4688,[28]864.4770,[29]896.0682,[30]877.3518,[31]866.7190,[32]873.4632,[33]861.8478,[34]859.4715,[35]830.4909,[36]826.7048

so there might be more changes needed, or @young-geng and team did an oopsy again.

SlyEcho · 2023-05-17T13:19:01Z

Should n_head not be 32, @Green-Sky? config.json, just a guess.

limcheekin · 2023-05-30T07:15:11Z

I fine-tuned instruction-following 3B model using OpenLLaMA. It is on my HuggingFace now.

For your information, you may refer to the following similar works:

By the way, appreciate if you could share the link of the hugging face repo here.

Thanks.

Sovenok-Hacker · 2023-05-30T07:19:15Z

I fine-tuned instruction-following 3B model using OpenLLaMA. It is on my HuggingFace now.

For your information, you may refer to the following similar works:

https://github.com/yxuansu/OpenAlpaca

https://github.com/vihangd/alpaca-qlora

By the way, appreciate if you could share the link of the hugging face repo here.

Thanks.

https://huggingface.co/Sovenok-Hacker/nanoalpaca-3b

klosax · 2023-05-30T11:36:29Z

Perplexity of OpenLLaMA vs LLaMA:

Chart shows the perplexity of each chunk - not the cumulative average.

xingchensong · 2023-05-31T02:15:59Z

Perplexity of OpenLLaMA vs LLaMA:

Chart shows the perplexity of each chunk - not the cumulative average.

According to LLAMA paper, the downstream task performance keeps improving even at 1T tokens (figure above), I think the gap between openllama-7b-700bt and llama-7b-1000bt is reasonable and the OpenLLaMA team confirmed that they are planning to further train the model, I'm happy to stay tuned for more updates from OpenLLaMA team~

klosax · 2023-05-31T09:22:18Z

According to LLAMA paper, the downstream task performance keeps improving even at 1T tokens (figure above)

Yes indeed. Simply fitting the data from figure 2 to curves gives a clearer picture. But at some point the accuracy will stop increasing depending on the model size. Maybe OpenLLaMA will continue training the models until that point is found.

ssenthilanand · 2023-05-31T10:13:21Z

I am trying to run llama.cpp on a laptop with Ryzen 4500u with Vega integrated graphics and 8 GB ram. It runs 7B models fine but I wanted to test the opencl acceleration. Turning on opencl takes away a portion of RAM and slows down generation. Wanted to check using the 3B model since it will give me more free ram. Downloaded the latest openllama 3B checkpoint.

I converted the 3B model and got a nice 1.79GB q4_0 model. On running gave a :

LLAMA_ASSERT: D:\a\llama.cpp\llama.cpp\llama.cpp:906: false

Looking at the code it looks like 3B is not supported. Is there any plan to add support for the 3B models ?

Now that support for 3B models it available, I tested it using latest release files and the weights from:
https://huggingface.co/Sovenok-Hacker/nanoalpaca-3b

1. Pure CPU
PS C:\llamacpp> .\main.exe -m .\nano-alpaca-3b-q4_0-ggml.bin -p "The capital city of India is:" -n 100
...
llama_print_timings: load time = 948.03 ms
llama_print_timings: sample time = 22.70 ms / 100 runs ( 0.23 ms per token)
llama_print_timings: prompt eval time = 719.71 ms / 8 tokens ( 89.96 ms per token)
llama_print_timings: eval time = 8163.96 ms / 99 runs ( 82.46 ms per token)
llama_print_timings: total time = 9150.66 ms

2. 2 layers to IGP
PS C:\llamacpp> .\main.exe -m .\nano-alpaca-3b-q4_0-ggml.bin -p "The capital city of India is:" -n 100 -ngl 2
...
ggml_opencl: offloading 2 layers to GPU
ggml_opencl: total VRAM used: 132 MB
...
llama_print_timings: load time = 2413.82 ms
llama_print_timings: sample time = 12.56 ms / 55 runs ( 0.23 ms per token)
llama_print_timings: prompt eval time = 2016.63 ms / 8 tokens ( 252.08 ms per token)
llama_print_timings: eval time = 6233.72 ms / 54 runs ( 115.44 ms per token)
llama_print_timings: total time = 8669.34 ms

3. 12 layers to IGP
PS C:\llamacpp> .\main.exe -m .\nano-alpaca-3b-q4_0-ggml.bin -p "The capital city of India is:" -n 100 -ngl 12
...
ggml_opencl: offloading 12 layers to GPU
ggml_opencl: total VRAM used: 797 MB
...
llama_print_timings: load time = 5908.14 ms
llama_print_timings: sample time = 23.18 ms / 100 runs ( 0.23 ms per token)
llama_print_timings: prompt eval time = 4221.16 ms / 8 tokens ( 527.65 ms per token)
llama_print_timings: eval time = 44238.62 ms / 99 runs ( 446.85 ms per token)
llama_print_timings: total time = 50187.72 ms

4. All layers in IGP
PS C:\llamacpp> .\main.exe -m .\nano-alpaca-3b-q4_0-ggml.bin -p "The capital city of India is:" -n 100 -ngl 26
main: build = 607 (ffb06a3)
main: seed = 1685525567
ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: selecting device: 'gfx90c'
ggml_opencl: device FP16 support: true
...
ggml_opencl: offloading 26 layers to GPU
ggml_opencl: total VRAM used: 1728 MB
...
The capital city of India is: Delhi [end of text]

llama_print_timings: load time = 8076.56 ms
llama_print_timings: sample time = 0.44 ms / 2 runs ( 0.22 ms per token)
llama_print_timings: prompt eval time = 4088.64 ms / 8 tokens ( 511.08 ms per token)
llama_print_timings: eval time = 796.24 ms / 1 runs ( 796.24 ms per token)
llama_print_timings: total time = 8873.74 ms

5. All layers in IGP with a prompt requiring longer response.
PS C:\llamacpp> .\main.exe -m .\nano-alpaca-3b-q4_0-ggml.bin -p "A poem about the pollution in New Delhi" -n 100 -ngl 26
...
llama_print_timings: load time = 7136.19 ms
llama_print_timings: sample time = 16.11 ms / 59 runs ( 0.27 ms per token)
llama_print_timings: prompt eval time = 4295.58 ms / 9 tokens ( 477.29 ms per token)
llama_print_timings: eval time = 52979.77 ms / 58 runs ( 913.44 ms per token)
llama_print_timings: total time = 60143.18 ms

The first 3 runs had the answer, followed by three different random texts with plenty of hallucinations. In test 4 I was surprised by its on point answer and tried a different prompt.

The prompts are probably not in the correct format but the relative performance degradation can be seen.

Baed on this rather unscientific tests, I am going to use CPU only inference. Though the IGP can access upto 2GB of RAM, it makes no difference. going through the IGP slows things down for my laptop.

klosax · 2023-06-07T12:15:27Z

New releases:

OpenLLaMA 3B 1000bt final
OpenLLaMA 7B 1000bt final
OpenLLaMA 13B 600bt

SlyEcho · 2023-06-07T16:57:53Z

I have ggml versions:

All uploaded now.

ghost · 2023-06-07T21:48:31Z

With 3B and 7B released it would be nice for someone with a beefy machine to get perplexity results for the most popular quants.

klosax · 2023-06-07T23:57:34Z

With 3B and 7B released it would be nice for someone with a beefy machine to get perplexity results for the most popular quants.

Perplexity on wiki.test.raw:

openllama-3b-q5_1 : 7.84273862
openllama-7b-q5_1 : 7.03177645

Remember, perplexity is a measure of how "unsure" the model is at predicting the text in the specified file. This is ok when comparing a model with itself, like different quantization formats. A better measure if you want to compare a model with another model would be Language Model Evaluation Harness . See Open LLM Leaderboard .

ghost · 2023-06-08T00:35:55Z

Remember, perplexity is a measure of how "unsure" the model is at predicting the text in the specified file. This is ok when comparing a model with itself, like different quantization formats. A better measure if you want to compare a model with another model would be Language Model Evaluation Harness . See Open LLM Leaderboard .

That was my intention all along, as I wanted to see how well the model quantizes against the f16 baseline. Ideally results should be similar to the original Llama but you don't know until you try...

SlyEcho · 2023-06-08T07:28:22Z

OK, I did a perplexity run of the new 3B, you can see how it compares to the last one.

Q	chunk	600BT	1000BT
F16	[616]	8.4656	7.7861
Q8_0	[616]	8.4667	7.7874
Q5_1	[616]	8.5072	7.8424
Q5_0	[616]	8.5156	7.8474
Q4_1	[616]	8.6102	8.0483
Q4_0	[616]	8.6674	8.0962

SlyEcho · 2023-06-08T13:37:23Z

Some more comparative perplexity analysis done by @gjmulder: https://github.com/openlm-research/open_llama/discussions/41

SlyEcho · 2023-06-08T18:26:04Z

7B run done:

Q	score
Q2_K	8.5152
Q3_K_S	7.6623
Q3_K	7.3837
Q3_K_L	7.3043
Q4_0	7.2116
Q4_1	7.1609
Q4_K_S	7.1516
Q4_K	7.1116
Q5_0	7.0353
Q5_K_S	7.0325
Q5_1	7.0318
Q5_K	7.0272
Q6_K	7.0050
Q8_0	6.9968
F16	6.9966

raffienficiaud · 2023-06-09T14:42:31Z

Hi there, wonderful work!

Has this:

We found the problem, it was in the conversion code, there was another n_head = n_embd / 128 type assumption, which I didn't catch at first.

Running perplexity right now...

landed to master? I am still having the issue #1291 (comment)

error loading model: llama.cpp: tensor 'layers.0.feed_forward.w1.weight' has wrong shape; expected  3200 x  8704, got  3200 x  8640

when doing the conversion/quantization myself from master (ae9663f) and from HF/openlm-research 3B (q8_0), while the model you posted on HF works for me.

A missing backport?

Green-Sky · 2023-06-09T14:58:54Z

@raffienficiaud we merged the 3B changes without the python conversion script changes.
#1588 (comment)

@SlyEcho is there an open pr with the hacky python changes?

m1chae1bx · 2023-06-09T15:19:28Z

@SlyEcho I'm trying out the F16 7B model but I'm getting not so good output. I'm using ctransformers. May I know what values did you use for the config?

SlyEcho · 2023-06-09T18:32:37Z

when doing the conversion/quantization myself from master (ae9663f) and from HF/openlm-research 3B (q8_0), while the #1291 (comment) on HF works for me.

There is a diff file there where the hacks are used. The whole conversion workflow should be possible to do without much dependencies just using the Makefile there.

is there an open pr with the hacky python changes?

No, I don't think so.

I see a couple of options to fix it:

Read the config file and get its values.
Allow some values to be overridden from CLI arguments.
Compute a suitable n_mult value that calculates the correct n_ff value.
Add n_ff to the model file (maybe something to consider for the next format?)
Pad the tensors to 256 (I think it should work but I haven't tested it). This may fix the K-quants as well.

I'm trying out the F16 7B model but I'm getting not so good output. I'm using ctransformers. May I know what values did you use for the config?

I didn't create the model, so I don't really know but it may have something to do with the tokenizer:

Please note that it is advised to avoid using the Hugging Face fast tokenizer for now, as we’ve observed that the auto-converted fast tokenizer sometimes gives incorrect tokenizations. This can be achieved by directly using the LlamaTokenizer class, or passing in the use_fast=False option for the AutoTokenizer class.

ochafik · 2023-06-27T13:01:51Z

Re/ weird outputs, OpenLLaMA seems to have extra dropout layers in attention and feed-forward (here's something I hacked on tinygrad to make it work).

And potentially some versions have an extra layernorm after the embedding layer (see HF's OpenLlamaModel and how it differs from their LlamaModel).

young-geng · 2023-06-27T22:11:17Z

@ochafik Those dropouts are never used during the pre-training of the model, so I believe that they can be safely ignored. The corresponding model on transformers should be the standard LLaMA instead of Open-Llama.

ochafik · 2023-06-27T23:24:35Z

@young-geng ahhh, now it make sense, thank you!

Green-Sky · 2023-07-07T21:49:51Z

new open llama just dropped https://huggingface.co/openlm-research/open_llama_7b_v2

Update 07/07/2023
We are happy to release an OpenLLaMA 7Bv2 model, which is trained on a mixture of Falcon refined-web dataset, mixed with the starcoder dataset, and the wikipedia, arxiv and books and stackexchange from RedPajama. The 3Bv2 model is coming soon.

SlyEcho · 2023-07-10T13:59:44Z

I have ggml files for v2: https://huggingface.co/SlyEcho/open_llama_7b_v2_ggml/tree/main

klosax · 2023-07-17T23:39:29Z

Version 2 of the 3b open llama model: https://huggingface.co/openlm-research/open_llama_3b_v2

SlyEcho · 2023-07-18T15:01:59Z

Uploading 3Bv2: https://huggingface.co/SlyEcho/open_llama_3b_v2_ggml

github-actions · 2024-04-09T01:09:41Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

Green-Sky added model Model specific 🦙. llama labels May 2, 2023

ggerganov mentioned this issue May 3, 2023

llama : require first token to be BOS #1303

Merged

sw mentioned this issue May 3, 2023

Create clear instructions for downloading and converting the models #644

Closed

DannyDaemonic mentioned this issue May 4, 2023

Avoid hardcoding a space at the beginning of the prompt. #1315

Closed

Green-Sky mentioned this issue May 5, 2023

[Announcement] OpenLLaMA 7B 300BT preview ready for llama.cpp #1330

Closed

codesoap mentioned this issue Jun 6, 2023

compatibility with llama.cpp openlm-research/open_llama#38

Closed

Martin-Laclaustra mentioned this issue Jun 21, 2023

Add OpenLLaMA instructions to the README #1954

Merged

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

Try whether OpenLLaMa works #1291

Try whether OpenLLaMa works #1291

Comments

prusnak commented May 2, 2023 • edited Loading

Green-Sky commented May 2, 2023 • edited Loading

Green-Sky commented May 3, 2023 • edited Loading

Green-Sky commented May 3, 2023 • edited Loading

ghost commented May 3, 2023

Green-Sky commented May 3, 2023

ggerganov commented May 3, 2023

Green-Sky commented May 3, 2023 • edited Loading

SlyEcho commented May 3, 2023

ggerganov commented May 3, 2023 • edited Loading

Green-Sky commented May 3, 2023

ggerganov commented May 3, 2023 • edited Loading

TemporalAgent7 commented May 3, 2023 • edited Loading

SlyEcho commented May 4, 2023 • edited Loading

limcheekin commented May 5, 2023

sciafri commented May 5, 2023 • edited Loading

vihangd commented May 5, 2023

Green-Sky commented May 5, 2023

Green-Sky commented May 17, 2023 • edited Loading

Update 05/15/2023

vihangd commented May 17, 2023

ghost commented May 17, 2023 • edited by ghost Loading

Green-Sky commented May 17, 2023

limcheekin commented May 17, 2023

Green-Sky commented May 17, 2023

SlyEcho commented May 17, 2023

limcheekin commented May 30, 2023

Sovenok-Hacker commented May 30, 2023

klosax commented May 30, 2023

xingchensong commented May 31, 2023

klosax commented May 31, 2023

ssenthilanand commented May 31, 2023

klosax commented Jun 7, 2023

SlyEcho commented Jun 7, 2023 • edited Loading

ghost commented Jun 7, 2023 • edited by ghost Loading

klosax commented Jun 7, 2023

ghost commented Jun 8, 2023

SlyEcho commented Jun 8, 2023

SlyEcho commented Jun 8, 2023

SlyEcho commented Jun 8, 2023

raffienficiaud commented Jun 9, 2023

Green-Sky commented Jun 9, 2023 • edited Loading

m1chae1bx commented Jun 9, 2023

SlyEcho commented Jun 9, 2023 • edited Loading

ochafik commented Jun 27, 2023

young-geng commented Jun 27, 2023

ochafik commented Jun 27, 2023

Green-Sky commented Jul 7, 2023 • edited Loading

SlyEcho commented Jul 10, 2023

klosax commented Jul 17, 2023

SlyEcho commented Jul 18, 2023

github-actions bot commented Apr 9, 2024

prusnak commented May 2, 2023 •

edited

Loading

Green-Sky commented May 2, 2023 •

edited

Loading

Green-Sky commented May 3, 2023 •

edited

Loading

Green-Sky commented May 3, 2023 •

edited

Loading

Green-Sky commented May 3, 2023 •

edited

Loading

ggerganov commented May 3, 2023 •

edited

Loading

ggerganov commented May 3, 2023 •

edited

Loading

TemporalAgent7 commented May 3, 2023 •

edited

Loading

SlyEcho commented May 4, 2023 •

edited

Loading

sciafri commented May 5, 2023 •

edited

Loading

Green-Sky commented May 17, 2023 •

edited

Loading

ghost commented May 17, 2023 •

edited by ghost

Loading

SlyEcho commented Jun 7, 2023 •

edited

Loading

ghost commented Jun 7, 2023 •

edited by ghost

Loading

Green-Sky commented Jun 9, 2023 •

edited

Loading

SlyEcho commented Jun 9, 2023 •

edited

Loading

Green-Sky commented Jul 7, 2023 •

edited

Loading