GemmaMLP uses 'tanh` approximation for GeLU activation #1004

Andrei-Aksionov · 2024-03-05T10:16:12Z

Hi there 👋

Fixes #999

As @carmocca found out, the original Keras implementation of Gemma used tanh approximation for GeLU activation, but neither PyTorch implementation nor HF Transformers variant had it.
Recently the official PyTorch variant was updated to also use this approximation and this PR reflects this change.

carmocca

Thank you! You'll have to xfail the HF comparison tests for now

Andrei-Aksionov · 2024-03-05T11:36:43Z

You'll have to xfail the HF comparison tests for now

First I need to understand why test_model.py doesn't fail, but test_convert_lit_checkpoint.py does fail.
Maybe there is something wrong with the conversion script (or the test) and this approximation just makes it more pronounced?

Andrei-Aksionov · 2024-03-05T14:12:03Z

I looked through the code maybe a dozen of times and could find anything wrong.
Still don't understand why when we convert from HF to Lit the test passes, but when from Lit to HF - doesn't.
If in the single test I first convert HF -> Lit and then right after Lit -> HF - test passes 🤷.

rasbt · 2024-03-05T14:44:50Z

Thanks for looking into this! Before merging, should we run a few eval harness tasks to compare the before/after performance? E.g., hellaswag and TruthfulQA should be sufficient. If you don't have access to GPUs, I am happy to do that @Andrei-Aksionov, just let me know.

Andrei-Aksionov · 2024-03-05T14:49:15Z

I have access, but I'm just busy with the other task (drop interleaved placement in QKV, should unlock your OLMo integration work).
So, if you have an already prepared bash script/notebook to do it (meaning that it won't take too much time for you), then do it.
If not, I can do it, but a bit later.

rasbt · 2024-03-05T14:54:06Z

No worries I can do it

tests/test_model.py

rasbt · 2024-03-05T19:59:22Z

It doesn't seem to perform differently without and with the PR:

# before

hellaswag acc norm: 0.4230233021310496
truthful qa mc1: 0.24724602203182375

# after
hellaswag acc norm: 0.4230233021310496
truthful qa mc1: 0.24724602203182375

Btw the hellaswag score is really bad, with tinyllama I get 0.60. Maybe it's a weird benchmark.

carmocca · 2024-03-05T20:02:33Z

It is: https://www.surgehq.ai/blog/hellaswag-or-hellabad-36-of-this-popular-llm-benchmark-contains-errors

rasbt · 2024-03-05T20:04:04Z

Arg ... 🤦‍♂️. Need to find something else ... but in any case, I think the change via the PR seems to ok!?

kashif · 2024-03-07T10:44:47Z

see also here: https://twitter.com/danielhanchen/status/1765446273661075609

Andrei-Aksionov · 2024-03-07T12:15:49Z

Thanks for the link, @kashif.
I intend to examine this carefully.

GemmaMLP uses 'tanh` approximation for GeLU activation

8dac6cc

Andrei-Aksionov requested review from awaelchli, carmocca and lantiga as code owners March 5, 2024 10:16

carmocca reviewed Mar 5, 2024

View reviewed changes

Andrei-Aksionov marked this pull request as draft March 5, 2024 11:33

Force HF in tests to use tanh approximation

b675de2

Andrei-Aksionov marked this pull request as ready for review March 5, 2024 14:09

Andrei-Aksionov requested a review from carmocca March 5, 2024 14:09

carmocca approved these changes Mar 5, 2024

View reviewed changes

tests/test_model.py Show resolved Hide resolved

carmocca merged commit f241d94 into Lightning-AI:main Mar 5, 2024
8 checks passed

Andrei-Aksionov deleted the gemmamlp_gelu_approximate branch March 6, 2024 10:42

Andrei-Aksionov mentioned this pull request Mar 7, 2024

Gemma: A study of the effect of the new issues #1031

Open

9 tasks

rasbt pushed a commit that referenced this pull request Mar 18, 2024

GemmaMLP uses 'tanh` approximation for GeLU activation (#1004)

3d77652

Andrei-Aksionov mentioned this pull request Mar 22, 2024

GemmaMLP: add missing approximation for LoRA and AdapterV2 variants #1178

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GemmaMLP uses 'tanh` approximation for GeLU activation #1004

GemmaMLP uses 'tanh` approximation for GeLU activation #1004

Andrei-Aksionov commented Mar 5, 2024

carmocca left a comment

Andrei-Aksionov commented Mar 5, 2024

Andrei-Aksionov commented Mar 5, 2024

rasbt commented Mar 5, 2024

Andrei-Aksionov commented Mar 5, 2024

rasbt commented Mar 5, 2024

rasbt commented Mar 5, 2024

carmocca commented Mar 5, 2024

rasbt commented Mar 5, 2024

kashif commented Mar 7, 2024

Andrei-Aksionov commented Mar 7, 2024

GemmaMLP uses 'tanh` approximation for GeLU activation #1004

GemmaMLP uses 'tanh` approximation for GeLU activation #1004

Conversation

Andrei-Aksionov commented Mar 5, 2024

carmocca left a comment

Choose a reason for hiding this comment

Andrei-Aksionov commented Mar 5, 2024

Andrei-Aksionov commented Mar 5, 2024

rasbt commented Mar 5, 2024

Andrei-Aksionov commented Mar 5, 2024

rasbt commented Mar 5, 2024

rasbt commented Mar 5, 2024

carmocca commented Mar 5, 2024

rasbt commented Mar 5, 2024

kashif commented Mar 7, 2024

Andrei-Aksionov commented Mar 7, 2024