Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting alpaca-native-GPTQ models into ggml models #442

Closed
BadisG opened this issue Mar 23, 2023 · 21 comments
Closed

Converting alpaca-native-GPTQ models into ggml models #442

BadisG opened this issue Mar 23, 2023 · 21 comments
Labels
enhancement New feature or request model Model specific

Comments

@BadisG
Copy link

BadisG commented Mar 23, 2023

Expected Behavior

Hello,

I wanted to convert the alpaca-native 7b GPTQ file (pt file) into a ggml file with the convert-gptq-to-ggml.py script https://github.com/ggerganov/llama.cpp/blob/master/convert-gptq-to-ggml.py

Current Behavior

The problem is that I have this error

D:\Large Language Models\CONVERTISSEURS\gptq to ggml>python convert-gptq-to-ggml.py alpaca-native-4b
it.pt tokenizer.model out.bin
32000
32001
Traceback (most recent call last):
  File "D:\Large Language Models\CONVERTISSEURS\gptq to ggml\convert-gptq-to-ggml.py", line 35, in <
module>
    assert tokenizer.vocab_size() == n_vocab
AssertionError

32000 is the tokenizer.vocab_size() (Number of tokens on the tokenizer.model)
32001 is the n_vocab (Number of tokens on the model)

The model that is trained with alpaca has 1 more token and it's this one:
"[PAD]": 32000

It looks like that if we want to convert the alpaca native GPTQ models we need to create a new tokenizer.model that has this "PAD" token in it.

The problem is that I have no idea how to do that... if someone can help me on this I'll appreciate!

@BadisG BadisG changed the title [User] Insert summary of your issue or enhancement.. Converting alpaca-native-GPTQ models into ggml models Mar 23, 2023
@Ronsor
Copy link
Contributor

Ronsor commented Mar 23, 2023

I wrote a tool to add additional tokens to tokenizer.model: https://github.com/Ronsor/llama-tools

The token list:

C [PAD]

would work with the script I wrote.

@BadisG
Copy link
Author

BadisG commented Mar 23, 2023

@Ronsor I used your script and it looks like it did actually add the token on the tokenizer.model

But now I have a new error... looks like the issue is more complex than I thought 😅

D:\Large Language Models\CONVERTISSEURS\gptq to ggml>python convert-gptq-to-ggml.py alpaca-native-4b
it.pt tokenizer.model out.bin
32001
32001
Processing non-Q4 variable: model.embed_tokens.weight with shape:  torch.Size([32001, 4096])  and ty
pe:  torch.float32
Processing non-Q4 variable: model.norm.weight with shape:  torch.Size([4096])  and type:  torch.floa
t32
  Converting to float32
Processing non-Q4 variable: lm_head.weight with shape:  torch.Size([32001, 4096])  and type:  torch.
float32
Traceback (most recent call last):
  File "D:\Large Language Models\CONVERTISSEURS\gptq to ggml\convert-gptq-to-ggml.py", line 153, in
<module>
    convert_q4(f"model.layers.{i}.self_attn.q_proj", f"layers.{i}.attention.wq.weight", permute=True
)
  File "D:\Large Language Models\CONVERTISSEURS\gptq to ggml\convert-gptq-to-ggml.py", line 94, in c
onvert_q4
    zeros = model[f"{src_name}.zeros"].numpy()
KeyError: 'model.layers.0.self_attn.q_proj.zeros'

@comex
Copy link
Contributor

comex commented Mar 24, 2023

Looks like the zeros issue corresponds to a recent commit to GPTQ-for-LLaMa (with a very non-descriptive commit message) which changed the format. The change is not actually specific to Alpaca, but the alpaca-native-GPTQ weights published online were apparently produced with a later version of GPTQ-for-LLaMa. The zeros and scales are now separate for every group of 32 weights, but the zeros are now themselves scaled and quantized… I don't really understand how that makes sense. I'll figure it out when I have a chance.

@BadisG
Copy link
Author

BadisG commented Mar 24, 2023

On the convert_q4(src_name, dst_name, permute=False): function I changed:

    zeros = model[f"{src_name}.zeros"].numpy()
    ...
    qweight = model[f"{src_name}.weight"].numpy().T # transpose

to

    zeros = model[f"{src_name}.qzeros"].numpy()
    ...
    qweight = model[f"{src_name}.qweight"].numpy().T # transpose

That results in those dimensions.

    print(grouped.shape) -> (4096, 128, 4)
    print(scales_rep.shape) -> (32, 524288, 1)
    print(addends_rep.shape) -> (32, 65536, 1)

Which gives an error because we cannot concanetate those objects anymore.

Here's a comparaison with the regular llama-7b-gptq model (that works well with the converter)

    print(grouped.shape) -> (4096, 128, 4)
    print(scales_rep.shape) -> (4096, 128, 1)
    print(addends_rep.shape) -> (4096, 128, 1)

At this point I'm stuck, as I'm uncertain about which elements (groupings, scales, addends) to modify in order to achieve the desired concatenation

@gjmulder gjmulder added enhancement New feature or request model Model specific labels Mar 24, 2023
@BadisG
Copy link
Author

BadisG commented Mar 25, 2023

@comex I'm not sure it was a good idea to convert your addends and scales into int32, those tensors have really small numbers and we're loosing all the informations like that:

image
image

@comex
Copy link
Contributor

comex commented Mar 25, 2023

They're not 'really' int32s. Each int32 is actually 8 4-bit weights packed together. And they're not converted directly from float to integer; they have to be interpreted together with the addends and scales.

@daboe01
Copy link
Contributor

daboe01 commented Mar 25, 2023

maybe you are lucky with this one?
https://huggingface.co/ozcur/alpaca-native-4bit/tree/main
maybe this was generated just before the zeros patch was merged.

@Belluxx
Copy link

Belluxx commented Mar 25, 2023

maybe you are lucky with this one? https://huggingface.co/ozcur/alpaca-native-4bit/tree/main maybe this was generated just before the zeros patch was merged.

Just tried, it fails with KeyError: 'model.layers.0.self_attn.q_proj.zeros'

@comex
Copy link
Contributor

comex commented Mar 26, 2023

I spent some time today working on this but didn't finish.

@BadisG
Copy link
Author

BadisG commented Mar 26, 2023

oobabooga merged a PR that makes the alpaca-7b-4bit-GPTQ-native works now
oobabooga/text-generation-webui@49c10c5

That's funny it worked because it uses the exact same tokenizer model (the one with 32000 token) even though this model has one more.

@daboe01
Copy link
Contributor

daboe01 commented Mar 26, 2023

cool! do you see any significant improvements from GPTQ?

@BadisG
Copy link
Author

BadisG commented Mar 26, 2023

@daboe01 I have the RTN quantized on llama cpp and the GPTQ quantized on the webui but it would be hard to compare the 2 as they are a bit differents in the way they work.

The best comparaison would be RTN vs GPTQ in llama.cpp with a perplexity test, I'll wait for @comex to do his magic! 👀

@comex
Copy link
Contributor

comex commented Mar 27, 2023

PR is up; please try it and let me know if there are issues.

The PR consists of a new script which is meant to replace the existing ones; run it with a command like:
python convert.py alpaca-native-4bit.pt --vocab-dir VOCAB_DIR
where VOCAB_DIR is a directory containing both tokenizer.model and added_tokens.json (the latter of which is specific to Alpaca).

@BadisG
Copy link
Author

BadisG commented Mar 27, 2023

I just tried it and it works like a charm!! GPTQ quantized models will be the standard and thanks to you the CPU users can enjoy it aswell

Thanks again for your really important work 😄 👍

@Belluxx
Copy link

Belluxx commented Mar 27, 2023

@BadisG Did you notice an increase in model size after converting to ggml? The 7B one i converted went from 3.77GB to 5.39GB and inference is significantly slower, but it works.

@BadisG
Copy link
Author

BadisG commented Mar 27, 2023

@Belluxx Yeah, the file went bigger, maybe it could be more optimized Idk, only @comex has the explaination about that 😅

@Belluxx
Copy link

Belluxx commented Mar 27, 2023

@BadisG Thanks for the info, at least now i know that it's not just me

@comex
Copy link
Contributor

comex commented Mar 27, 2023

Hmm, it's probably because of the addends (aka zeros). The newer GPTQ-for-LLaMa format quantizes the addends, but llama.cpp doesn't support that, so the script dequantizes them. I didn't realize it would make that big of a difference in size; sounds like it would be useful to add native support for quantized addends to llama.cpp.

But I don't know what you mean by "inference is significantly slower". Compared to what? If the comparison is to a GPU implementation then yes, llama.cpp will be slower.

@Belluxx
Copy link

Belluxx commented Mar 27, 2023

@comex Thank you for the explanation. About the slower inference, i forgot to mention that it was due to swap because i just have 8GB of ram.
However it's a bit weird since i didn't have anything opened or in the background.

@BadisG
Copy link
Author

BadisG commented Mar 27, 2023

Yeah it's a bit slower when using the GPTQ:

Regular RTN quantization:

main: seed = 1
system_info: n_threads = 14 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.700000, top_k = 40, top_p = 0.100000, repeat_last_n = 2048, repeat_penalty = 1.250000
generate: n_ctx = 2024, n_batch = 8, n_predict = 2024, n_keep = 0


 Here's 5 reasons that proves video-games are good for your brain:
1. Video games can help improve cognitive skills such as memory, problem solving and reaction time. Studies have found that regular gamers show improved performance in these areas compared to non-players.
2. Research has also shown that playing action or adventure games increases the density of neurons in the hippocampus, which is associated with learning and emotional processing. This suggests that gaming could be beneficial for overall mental health.
3. Playing puzzle and strategy games helps sharpen abstract thinking abilities by requiring players to think ahead and plan strategies. These types of games may even increase creativity levels.
4. In addition, research shows that engaging in mentally challenging activities like gaming can reduce inflammation in the brain, protect against age-related declines in cognition, and slow down the progression of neurodegenerative diseases.
5. Finally, studies suggest that virtual reality (VR) technology offers a unique opportunity to explore how different experiences affect people’s brains. VR provides an immersive experience that allows users to interact with digital environments while being monitored through physiological measures. Through this type of experiment, scientists hope to gain insight into how our minds work and what effects certain stimuli might have on us both psychologically and physiologically. [end of text]

llama_print_timings:        load time =  6639.43 ms
llama_print_timings:      sample time =   955.15 ms /   283 runs   (    3.38 ms per run)
llama_print_timings: prompt eval time =  1715.08 ms /    19 tokens (   90.27 ms per token)
llama_print_timings:        eval time = 60649.22 ms /   282 runs   (  215.07 ms per run)
llama_print_timings:       total time = 70376.90 ms

GPTQ quantization:

main: seed = 1
system_info: n_threads = 14 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.700000, top_k = 40, top_p = 0.100000, repeat_last_n = 2048, repeat_penalty = 1.250000
generate: n_ctx = 2024, n_batch = 500, n_predict = 2024, n_keep = 0


 Here's 5 reasons that proves video-games are good for your brain:
1. Improves problem solving skills - Playing puzzle and strategy games can help improve problem solving skills by requiring players to think logically, strategize and make decisions in order to progress through the game. This type of thinking is useful when applied to real life situations where logical thought processes need to be employed.
2. Enhances spatial awareness – Many action adventure or first person shooter (FPS) games require quick reflexes as well as an understanding of how to maneuver around obstacles on a virtual map. These types of games enhance one’s spatial awareness which helps with navigation in everyday life.
3. Boosts memory retention– Memory retention refers to the ability to remember information over time. Video games have been found to increase short term recall and long term storage of information in the brain. Studies show improved cognitive function after playing certain video games.
4. Strengthens hand eye coordination – Playing fast paced action games such as FPS or fighting games requires excellent hand eye coordination. The act of quickly aiming and shooting at targets has been shown to strengthen this skill set in gamers. Increased accuracy leads to better reaction times in other areas of gaming and even sports.
5. Encourages creative thinking – Creative thinking involves using abstract thoughts to solve problems. Games like brainteasers, logic puzzles and riddles encourage out of the box solutions to complex issues. This encourages innovation and lateral thinking which can lead to new ideas and inventions. [end of text]

llama_print_timings:        load time =  2094.55 ms
llama_print_timings:      sample time =  1084.30 ms /   331 runs   (    3.28 ms per run)
llama_print_timings: prompt eval time =  2227.22 ms /    19 tokens (  117.22 ms per token)
llama_print_timings:        eval time = 87885.16 ms /   330 runs   (  266.32 ms per run)
llama_print_timings:       total time = 93656.60 ms

Something like ~20% slower, that's probably expected because the RTN version has a size of 4.1 GB and the GPTQ version has a size of 5.2 GB (27% difference)

@BadisG BadisG mentioned this issue Mar 28, 2023
16 tasks
Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023
…xtensions-4.7.1

Bump typing-extensions from 4.6.3 to 4.7.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request model Model specific
Projects
None yet
Development

No branches or pull requests

8 participants
@comex @daboe01 @ggerganov @Ronsor @gjmulder @Belluxx @BadisG and others