Converting alpaca-native-GPTQ models into ggml models #442

BadisG · 2023-03-23T22:02:03Z

Expected Behavior

Hello,

I wanted to convert the alpaca-native 7b GPTQ file (pt file) into a ggml file with the convert-gptq-to-ggml.py script https://github.com/ggerganov/llama.cpp/blob/master/convert-gptq-to-ggml.py

Current Behavior

The problem is that I have this error

D:\Large Language Models\CONVERTISSEURS\gptq to ggml>python convert-gptq-to-ggml.py alpaca-native-4b
it.pt tokenizer.model out.bin
32000
32001
Traceback (most recent call last):
  File "D:\Large Language Models\CONVERTISSEURS\gptq to ggml\convert-gptq-to-ggml.py", line 35, in <
module>
    assert tokenizer.vocab_size() == n_vocab
AssertionError

32000 is the tokenizer.vocab_size() (Number of tokens on the tokenizer.model)
32001 is the n_vocab (Number of tokens on the model)

The model that is trained with alpaca has 1 more token and it's this one:
"[PAD]": 32000

It looks like that if we want to convert the alpaca native GPTQ models we need to create a new tokenizer.model that has this "PAD" token in it.

The problem is that I have no idea how to do that... if someone can help me on this I'll appreciate!

The text was updated successfully, but these errors were encountered:

Ronsor · 2023-03-23T23:36:12Z

I wrote a tool to add additional tokens to tokenizer.model: https://github.com/Ronsor/llama-tools

The token list:

C [PAD]

would work with the script I wrote.

BadisG · 2023-03-23T23:51:09Z

@Ronsor I used your script and it looks like it did actually add the token on the tokenizer.model

But now I have a new error... looks like the issue is more complex than I thought 😅

D:\Large Language Models\CONVERTISSEURS\gptq to ggml>python convert-gptq-to-ggml.py alpaca-native-4b
it.pt tokenizer.model out.bin
32001
32001
Processing non-Q4 variable: model.embed_tokens.weight with shape:  torch.Size([32001, 4096])  and ty
pe:  torch.float32
Processing non-Q4 variable: model.norm.weight with shape:  torch.Size([4096])  and type:  torch.floa
t32
  Converting to float32
Processing non-Q4 variable: lm_head.weight with shape:  torch.Size([32001, 4096])  and type:  torch.
float32
Traceback (most recent call last):
  File "D:\Large Language Models\CONVERTISSEURS\gptq to ggml\convert-gptq-to-ggml.py", line 153, in
<module>
    convert_q4(f"model.layers.{i}.self_attn.q_proj", f"layers.{i}.attention.wq.weight", permute=True
)
  File "D:\Large Language Models\CONVERTISSEURS\gptq to ggml\convert-gptq-to-ggml.py", line 94, in c
onvert_q4
    zeros = model[f"{src_name}.zeros"].numpy()
KeyError: 'model.layers.0.self_attn.q_proj.zeros'

comex · 2023-03-24T03:38:37Z

Looks like the zeros issue corresponds to a recent commit to GPTQ-for-LLaMa (with a very non-descriptive commit message) which changed the format. The change is not actually specific to Alpaca, but the alpaca-native-GPTQ weights published online were apparently produced with a later version of GPTQ-for-LLaMa. The zeros and scales are now separate for every group of 32 weights, but the zeros are now themselves scaled and quantized… I don't really understand how that makes sense. I'll figure it out when I have a chance.

BadisG · 2023-03-24T12:39:02Z

On the convert_q4(src_name, dst_name, permute=False): function I changed:

    zeros = model[f"{src_name}.zeros"].numpy()
    ...
    qweight = model[f"{src_name}.weight"].numpy().T # transpose

to

    zeros = model[f"{src_name}.qzeros"].numpy()
    ...
    qweight = model[f"{src_name}.qweight"].numpy().T # transpose

That results in those dimensions.

    print(grouped.shape) -> (4096, 128, 4)
    print(scales_rep.shape) -> (32, 524288, 1)
    print(addends_rep.shape) -> (32, 65536, 1)

Which gives an error because we cannot concanetate those objects anymore.

Here's a comparaison with the regular llama-7b-gptq model (that works well with the converter)

    print(grouped.shape) -> (4096, 128, 4)
    print(scales_rep.shape) -> (4096, 128, 1)
    print(addends_rep.shape) -> (4096, 128, 1)

At this point I'm stuck, as I'm uncertain about which elements (groupings, scales, addends) to modify in order to achieve the desired concatenation

BadisG · 2023-03-25T03:05:18Z

@comex I'm not sure it was a good idea to convert your addends and scales into int32, those tensors have really small numbers and we're loosing all the informations like that:

comex · 2023-03-25T04:01:28Z

They're not 'really' int32s. Each int32 is actually 8 4-bit weights packed together. And they're not converted directly from float to integer; they have to be interpreted together with the addends and scales.

daboe01 · 2023-03-25T09:01:35Z

maybe you are lucky with this one?
https://huggingface.co/ozcur/alpaca-native-4bit/tree/main
maybe this was generated just before the zeros patch was merged.

Belluxx · 2023-03-25T14:13:04Z

maybe you are lucky with this one? https://huggingface.co/ozcur/alpaca-native-4bit/tree/main maybe this was generated just before the zeros patch was merged.

Just tried, it fails with KeyError: 'model.layers.0.self_attn.q_proj.zeros'

comex · 2023-03-26T00:13:48Z

I spent some time today working on this but didn't finish.

BadisG · 2023-03-26T12:02:55Z

oobabooga merged a PR that makes the alpaca-7b-4bit-GPTQ-native works now
oobabooga/text-generation-webui@49c10c5

That's funny it worked because it uses the exact same tokenizer model (the one with 32000 token) even though this model has one more.

daboe01 · 2023-03-26T12:38:48Z

cool! do you see any significant improvements from GPTQ?

BadisG · 2023-03-26T12:43:54Z

@daboe01 I have the RTN quantized on llama cpp and the GPTQ quantized on the webui but it would be hard to compare the 2 as they are a bit differents in the way they work.

The best comparaison would be RTN vs GPTQ in llama.cpp with a perplexity test, I'll wait for @comex to do his magic! 👀

comex · 2023-03-27T04:16:32Z

PR is up; please try it and let me know if there are issues.

The PR consists of a new script which is meant to replace the existing ones; run it with a command like:
python convert.py alpaca-native-4bit.pt --vocab-dir VOCAB_DIR
where VOCAB_DIR is a directory containing both tokenizer.model and added_tokens.json (the latter of which is specific to Alpaca).

BadisG · 2023-03-27T11:13:55Z

I just tried it and it works like a charm!! GPTQ quantized models will be the standard and thanks to you the CPU users can enjoy it aswell

Thanks again for your really important work 😄 👍

Belluxx · 2023-03-27T14:13:22Z

@BadisG Did you notice an increase in model size after converting to ggml? The 7B one i converted went from 3.77GB to 5.39GB and inference is significantly slower, but it works.

BadisG · 2023-03-27T14:24:35Z

@Belluxx Yeah, the file went bigger, maybe it could be more optimized Idk, only @comex has the explaination about that 😅

Belluxx · 2023-03-27T14:26:13Z

@BadisG Thanks for the info, at least now i know that it's not just me

comex · 2023-03-27T18:50:05Z

Hmm, it's probably because of the addends (aka zeros). The newer GPTQ-for-LLaMa format quantizes the addends, but llama.cpp doesn't support that, so the script dequantizes them. I didn't realize it would make that big of a difference in size; sounds like it would be useful to add native support for quantized addends to llama.cpp.

But I don't know what you mean by "inference is significantly slower". Compared to what? If the comparison is to a GPU implementation then yes, llama.cpp will be slower.

Belluxx · 2023-03-27T18:53:16Z

@comex Thank you for the explanation. About the slower inference, i forgot to mention that it was due to swap because i just have 8GB of ram.
However it's a bit weird since i didn't have anything opened or in the background.

BadisG · 2023-03-27T19:10:10Z

Yeah it's a bit slower when using the GPTQ:

Regular RTN quantization:

main: seed = 1
system_info: n_threads = 14 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.700000, top_k = 40, top_p = 0.100000, repeat_last_n = 2048, repeat_penalty = 1.250000
generate: n_ctx = 2024, n_batch = 8, n_predict = 2024, n_keep = 0


 Here's 5 reasons that proves video-games are good for your brain:
1. Video games can help improve cognitive skills such as memory, problem solving and reaction time. Studies have found that regular gamers show improved performance in these areas compared to non-players.
2. Research has also shown that playing action or adventure games increases the density of neurons in the hippocampus, which is associated with learning and emotional processing. This suggests that gaming could be beneficial for overall mental health.
3. Playing puzzle and strategy games helps sharpen abstract thinking abilities by requiring players to think ahead and plan strategies. These types of games may even increase creativity levels.
4. In addition, research shows that engaging in mentally challenging activities like gaming can reduce inflammation in the brain, protect against age-related declines in cognition, and slow down the progression of neurodegenerative diseases.
5. Finally, studies suggest that virtual reality (VR) technology offers a unique opportunity to explore how different experiences affect people’s brains. VR provides an immersive experience that allows users to interact with digital environments while being monitored through physiological measures. Through this type of experiment, scientists hope to gain insight into how our minds work and what effects certain stimuli might have on us both psychologically and physiologically. [end of text]

llama_print_timings:        load time =  6639.43 ms
llama_print_timings:      sample time =   955.15 ms /   283 runs   (    3.38 ms per run)
llama_print_timings: prompt eval time =  1715.08 ms /    19 tokens (   90.27 ms per token)
llama_print_timings:        eval time = 60649.22 ms /   282 runs   (  215.07 ms per run)
llama_print_timings:       total time = 70376.90 ms

GPTQ quantization:

main: seed = 1
system_info: n_threads = 14 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.700000, top_k = 40, top_p = 0.100000, repeat_last_n = 2048, repeat_penalty = 1.250000
generate: n_ctx = 2024, n_batch = 500, n_predict = 2024, n_keep = 0


 Here's 5 reasons that proves video-games are good for your brain:
1. Improves problem solving skills - Playing puzzle and strategy games can help improve problem solving skills by requiring players to think logically, strategize and make decisions in order to progress through the game. This type of thinking is useful when applied to real life situations where logical thought processes need to be employed.
2. Enhances spatial awareness – Many action adventure or first person shooter (FPS) games require quick reflexes as well as an understanding of how to maneuver around obstacles on a virtual map. These types of games enhance one’s spatial awareness which helps with navigation in everyday life.
3. Boosts memory retention– Memory retention refers to the ability to remember information over time. Video games have been found to increase short term recall and long term storage of information in the brain. Studies show improved cognitive function after playing certain video games.
4. Strengthens hand eye coordination – Playing fast paced action games such as FPS or fighting games requires excellent hand eye coordination. The act of quickly aiming and shooting at targets has been shown to strengthen this skill set in gamers. Increased accuracy leads to better reaction times in other areas of gaming and even sports.
5. Encourages creative thinking – Creative thinking involves using abstract thoughts to solve problems. Games like brainteasers, logic puzzles and riddles encourage out of the box solutions to complex issues. This encourages innovation and lateral thinking which can lead to new ideas and inventions. [end of text]

llama_print_timings:        load time =  2094.55 ms
llama_print_timings:      sample time =  1084.30 ms /   331 runs   (    3.28 ms per run)
llama_print_timings: prompt eval time =  2227.22 ms /    19 tokens (  117.22 ms per token)
llama_print_timings:        eval time = 87885.16 ms /   330 runs   (  266.32 ms per run)
llama_print_timings:       total time = 93656.60 ms

Something like ~20% slower, that's probably expected because the RTN version has a size of 4.1 GB and the GPTQ version has a size of 5.2 GB (27% difference)

…xtensions-4.7.1 Bump typing-extensions from 4.6.3 to 4.7.1

BadisG changed the title ~~[User] Insert summary of your issue or enhancement..~~ Converting alpaca-native-GPTQ models into ggml models Mar 23, 2023

gjmulder added enhancement New feature or request model Model specific labels Mar 24, 2023

BadisG mentioned this issue Mar 28, 2023

New conversion script #545

Merged

16 tasks

ggerganov closed this as completed Jul 28, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023

Merge pull request ggerganov#442 from abetlen/dependabot/pip/typing-e…

5e0a6b6

…xtensions-4.7.1 Bump typing-extensions from 4.6.3 to 4.7.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting alpaca-native-GPTQ models into ggml models #442

Converting alpaca-native-GPTQ models into ggml models #442

BadisG commented Mar 23, 2023 •

edited

Loading

Ronsor commented Mar 23, 2023 •

edited

Loading

BadisG commented Mar 23, 2023 •

edited

Loading

comex commented Mar 24, 2023 •

edited

Loading

BadisG commented Mar 24, 2023 •

edited

Loading

BadisG commented Mar 25, 2023

comex commented Mar 25, 2023

daboe01 commented Mar 25, 2023

Belluxx commented Mar 25, 2023

comex commented Mar 26, 2023

BadisG commented Mar 26, 2023

daboe01 commented Mar 26, 2023

BadisG commented Mar 26, 2023

comex commented Mar 27, 2023 •

edited

Loading

BadisG commented Mar 27, 2023

Belluxx commented Mar 27, 2023

BadisG commented Mar 27, 2023

Belluxx commented Mar 27, 2023

comex commented Mar 27, 2023

Belluxx commented Mar 27, 2023

BadisG commented Mar 27, 2023 •

edited

Loading

Converting alpaca-native-GPTQ models into ggml models #442

Converting alpaca-native-GPTQ models into ggml models #442

Comments

BadisG commented Mar 23, 2023 • edited Loading

Expected Behavior

Current Behavior

Ronsor commented Mar 23, 2023 • edited Loading

BadisG commented Mar 23, 2023 • edited Loading

comex commented Mar 24, 2023 • edited Loading

BadisG commented Mar 24, 2023 • edited Loading

BadisG commented Mar 25, 2023

comex commented Mar 25, 2023

daboe01 commented Mar 25, 2023

Belluxx commented Mar 25, 2023

comex commented Mar 26, 2023

BadisG commented Mar 26, 2023

daboe01 commented Mar 26, 2023

BadisG commented Mar 26, 2023

comex commented Mar 27, 2023 • edited Loading

BadisG commented Mar 27, 2023

Belluxx commented Mar 27, 2023

BadisG commented Mar 27, 2023

Belluxx commented Mar 27, 2023

comex commented Mar 27, 2023

Belluxx commented Mar 27, 2023

BadisG commented Mar 27, 2023 • edited Loading

BadisG commented Mar 23, 2023 •

edited

Loading

Ronsor commented Mar 23, 2023 •

edited

Loading

BadisG commented Mar 23, 2023 •

edited

Loading

comex commented Mar 24, 2023 •

edited

Loading

BadisG commented Mar 24, 2023 •

edited

Loading

comex commented Mar 27, 2023 •

edited

Loading

BadisG commented Mar 27, 2023 •

edited

Loading