-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GemmaMLP uses 'tanh` approximation for GeLU activation #1004
GemmaMLP uses 'tanh` approximation for GeLU activation #1004
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! You'll have to xfail the HF comparison tests for now
First I need to understand why |
I looked through the code maybe a dozen of times and could find anything wrong. |
Thanks for looking into this! Before merging, should we run a few eval harness tasks to compare the before/after performance? E.g., hellaswag and TruthfulQA should be sufficient. If you don't have access to GPUs, I am happy to do that @Andrei-Aksionov, just let me know. |
I have access, but I'm just busy with the other task (drop interleaved placement in QKV, should unlock your OLMo integration work). |
No worries I can do it |
It doesn't seem to perform differently without and with the PR: # before
hellaswag acc norm: 0.4230233021310496
truthful qa mc1: 0.24724602203182375
# after
hellaswag acc norm: 0.4230233021310496
truthful qa mc1: 0.24724602203182375 Btw the hellaswag score is really bad, with tinyllama I get 0.60. Maybe it's a weird benchmark. |
Arg ... 🤦♂️. Need to find something else ... but in any case, I think the change via the PR seems to ok!? |
see also here: https://twitter.com/danielhanchen/status/1765446273661075609 |
Thanks for the link, @kashif. |
Hi there 👋
Fixes #999
As @carmocca found out, the original Keras implementation of Gemma used
tanh
approximation for GeLU activation, but neither PyTorch implementation nor HF Transformers variant had it.Recently the official PyTorch variant was updated to also use this approximation and this PR reflects this change.