Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GemmaMLP uses 'tanh` approximation for GeLU activation #1004

Merged

Conversation

Andrei-Aksionov
Copy link
Collaborator

Hi there 👋

Fixes #999

As @carmocca found out, the original Keras implementation of Gemma used tanh approximation for GeLU activation, but neither PyTorch implementation nor HF Transformers variant had it.
Recently the official PyTorch variant was updated to also use this approximation and this PR reflects this change.

Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! You'll have to xfail the HF comparison tests for now

@Andrei-Aksionov Andrei-Aksionov marked this pull request as draft March 5, 2024 11:33
@Andrei-Aksionov
Copy link
Collaborator Author

You'll have to xfail the HF comparison tests for now

First I need to understand why test_model.py doesn't fail, but test_convert_lit_checkpoint.py does fail.
Maybe there is something wrong with the conversion script (or the test) and this approximation just makes it more pronounced?

@Andrei-Aksionov Andrei-Aksionov marked this pull request as ready for review March 5, 2024 14:09
@Andrei-Aksionov Andrei-Aksionov requested a review from carmocca March 5, 2024 14:09
@Andrei-Aksionov
Copy link
Collaborator Author

I looked through the code maybe a dozen of times and could find anything wrong.
Still don't understand why when we convert from HF to Lit the test passes, but when from Lit to HF - doesn't.
If in the single test I first convert HF -> Lit and then right after Lit -> HF - test passes 🤷.

@rasbt
Copy link
Collaborator

rasbt commented Mar 5, 2024

Thanks for looking into this! Before merging, should we run a few eval harness tasks to compare the before/after performance? E.g., hellaswag and TruthfulQA should be sufficient. If you don't have access to GPUs, I am happy to do that @Andrei-Aksionov, just let me know.

@Andrei-Aksionov
Copy link
Collaborator Author

I have access, but I'm just busy with the other task (drop interleaved placement in QKV, should unlock your OLMo integration work).
So, if you have an already prepared bash script/notebook to do it (meaning that it won't take too much time for you), then do it.
If not, I can do it, but a bit later.

@rasbt
Copy link
Collaborator

rasbt commented Mar 5, 2024

No worries I can do it

tests/test_model.py Show resolved Hide resolved
@rasbt
Copy link
Collaborator

rasbt commented Mar 5, 2024

It doesn't seem to perform differently without and with the PR:

# before

hellaswag acc norm: 0.4230233021310496
truthful qa mc1: 0.24724602203182375

# after
hellaswag acc norm: 0.4230233021310496
truthful qa mc1: 0.24724602203182375

Btw the hellaswag score is really bad, with tinyllama I get 0.60. Maybe it's a weird benchmark.

@carmocca
Copy link
Contributor

carmocca commented Mar 5, 2024

@rasbt
Copy link
Collaborator

rasbt commented Mar 5, 2024

Arg ... 🤦‍♂️. Need to find something else ... but in any case, I think the change via the PR seems to ok!?

@carmocca carmocca merged commit f241d94 into Lightning-AI:main Mar 5, 2024
8 checks passed
@Andrei-Aksionov Andrei-Aksionov deleted the gemmamlp_gelu_approximate branch March 6, 2024 10:42
@kashif
Copy link
Contributor

kashif commented Mar 7, 2024

@Andrei-Aksionov
Copy link
Collaborator Author

Thanks for the link, @kashif.
I intend to examine this carefully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update gemma to use tanh approximation
4 participants