-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add attention and final logit soft-capping, update scaling factor to Gemma2 #8197
Conversation
This is absolutely awesome! |
Will this require updating the commit of llama.cpp, or just the gguf? |
Co-authored-by: slaren <slarengh@gmail.com>
Perplexity is looking much better now with the 9b base model.
9b-it also improved significantly:
|
The chat template continues adding one line on each reply, there is probably something wrong there. |
Looks really nice. This would require regenerating gguf files as using existing gguf files don't have the required keys. Attempting to load an old gguf file with this PR results in this error:
|
@slaren thank you for checking!
Do you mean like this? I do see that most of the time but occasionally it does generated the eot without adding an extra newline at the end. |
I am using the built-in chat template support, with
Sometimes there is between one or two extra lines, which seems odd. |
That's when you're loading an older gemma2 gguf correct? Yes, those will need to be regenerated. |
Re comment: #8198 (comment) I made a dirty patch to see it that's the root cause or not, but seems like the model still want to output multiple new lines:
Explanation: token 108 But for some reasons, it still output token 111 Edit: this is the 3rd message in the conversation. I did not add |
…cpp into add-gemma2-soft-capping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the new lines - maybe it's best to compare with the reference inference results and see if these are expected
BTW, Now am thinking that this is also probably needed. From the technical paper of Gemma2 :
I am not sure if this is implemented or not( I am still weak in C++, but still I am trying to find out in the code). Edit Looks like HF Transformers has it. Edit2: We don't even have a sliding window attention logic in either llm_build_kv or llm_build_kv_store functions, nor in the attention implementation for gemma2... Am I missing something? |
It seems that there's another difference as well, the attention scaling factor is a custom value of |
I hope this fixes Gemma and Phi models which are both broken now. Getting forever-repeating bug after they exceed their context length. Changing n_predict does not work. |
@qnixsynapse No, it's not implemented, but since ggerganov refactored the KQ masking a while ago the change should be simple, the issue for SWA was also just reopened. I already tried implementing it for gemma2, see my comment here EDIT:
good catch. Well that's gonna be harder. llama.cpp currently shares the mask between layers. Not like it's impossible to just add a second one, actually really simple, but I can't think of a clean way to implement this. I'll hack sth together and see if it works better than just always doing sliding window. |
@qnixsynapse hacked it in here: https://github.com/arlo-phoenix/llama.cpp/tree/gemma2 Doesn't need new ggufs to work. Output is better at the start, but then runs into a repeating issue, so something is probably still missing or I messed something up. I'll stop for today though. |
I've updated the scaling factor approach to compute the value instead of adding yet another key to the gguf. This value is computed using |
@arlo-phoenix Good work! I will take a look at it today(Sunday). Also, just to mention, I am still finding the HF's Transformer's Gemma 2 implementation to be somewhat subpar since the model hosted at HF chat still gives much low quality responses than the model hosted at Google AI studio. Not sure if the transformers library being used there is updated with the latest fixes or not. |
Now that this is merged, if I make a new GGUF, am I good to go? Or are there other fixes I should wait for? |
…tor to Gemma2 (ggerganov#8197) * Add attention and final logit softcapping. * fix * Add custom add_ functions * Disable flash attention for Gemma2 * Update src/llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Add default value for attention and final logit softcap value * Add custom kq scaling from Gemma2Attention * Remove custom pre attention scaling and use computed value instead. --------- Co-authored-by: slaren <slarengh@gmail.com>
@ddh0 Currently, @arlo-phoenix is using hardcoded window size for testing. I think best is to regenerate ggufs once this(alternate SWA) is fixed. |
…tor to Gemma2 (ggerganov#8197) * Add attention and final logit softcapping. * fix * Add custom add_ functions * Disable flash attention for Gemma2 * Update src/llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Add default value for attention and final logit softcap value * Add custom kq scaling from Gemma2Attention * Remove custom pre attention scaling and use computed value instead. --------- Co-authored-by: slaren <slarengh@gmail.com>
…tor to Gemma2 (ggerganov#8197) * Add attention and final logit softcapping. * fix * Add custom add_ functions * Disable flash attention for Gemma2 * Update src/llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Add default value for attention and final logit softcap value * Add custom kq scaling from Gemma2Attention * Remove custom pre attention scaling and use computed value instead. --------- Co-authored-by: slaren <slarengh@gmail.com>
…tor to Gemma2 (ggerganov#8197) * Add attention and final logit softcapping. * fix * Add custom add_ functions * Disable flash attention for Gemma2 * Update src/llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Add default value for attention and final logit softcap value * Add custom kq scaling from Gemma2Attention * Remove custom pre attention scaling and use computed value instead. --------- Co-authored-by: slaren <slarengh@gmail.com>
…tor to Gemma2 (ggerganov#8197) * Add attention and final logit softcapping. * fix * Add custom add_ functions * Disable flash attention for Gemma2 * Update src/llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Add default value for attention and final logit softcap value * Add custom kq scaling from Gemma2Attention * Remove custom pre attention scaling and use computed value instead. --------- Co-authored-by: slaren <slarengh@gmail.com>
…tor to Gemma2 (ggerganov#8197) * Add attention and final logit softcapping. * fix * Add custom add_ functions * Disable flash attention for Gemma2 * Update src/llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Add default value for attention and final logit softcap value * Add custom kq scaling from Gemma2Attention * Remove custom pre attention scaling and use computed value instead. --------- Co-authored-by: slaren <slarengh@gmail.com>
This is a cherry-pick of ggerganov/llama.cpp#8197
@@ -11106,6 +11123,12 @@ struct llm_build_context { | |||
|
|||
// lm_head | |||
cur = ggml_mul_mat(ctx0, model.output, cur); | |||
|
|||
// final logit soft-capping | |||
cur = ggml_scale(ctx0, cur, 1.0f / hparams.f_final_logit_softcapping); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Total nitpick that probably should be ignored. I came here from curiosity, and I know this is merged by now and I have absolutely no place to comment. But
Isn’t this similar logic to lines 7594 - 7596?
While I’m a proponent of the “rule of 3” l think there’s merit in extracting it to something like a separate apply_softcap
method. For educational purposes at least (gives the opportunity to add docs explaining what it does, single responsibility principle and all that, also I know for sure if I had to fix a bug in it, I’d fix it in one place and forget to update the other)
This PR adds the missing attention layer and final logit soft-capping. Implementation referenced from huggingface transformers. Additionally Gemma2 applies a pre-attention scaling of
hidden_size / num_attention_heads
.NOTE: attention soft-capping is not compatible with flash attention so flash attention is disabled when loading the model.
Once this PR is finalised / merged the gguf will need to be generated again to include the soft-capping scales.
@slaren let me kv names / hparams should be changed or if anything stands out to you.