Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mDeBERTa on HuggingFace hub does not seem to work #77

Closed
MoritzLaurer opened this issue Dec 4, 2021 · 18 comments · Fixed by huggingface/transformers#24116
Closed

mDeBERTa on HuggingFace hub does not seem to work #77

MoritzLaurer opened this issue Dec 4, 2021 · 18 comments · Fixed by huggingface/transformers#24116

Comments

@MoritzLaurer
Copy link

MoritzLaurer commented Dec 4, 2021

I really like the DeBERTa-v3 models and the monolingual models work very well for me. Weirdly enough, the multilingual model uploaded on the huggingface hub does not seem to work. I have a code for training multilingual models on XNLI, and the training normally works well (e.g. no issue with microsoft/Multilingual-MiniLM-L12-H384), but when I apply the exact same code to mDeBERTa, the model does not seem to learn anything. I don't get an error message, but the training results look like this:
Screenshot 2021-12-04 at 10 43 58

I've double checked by running the exact same code on multilingual-minilm and the training works, which makes me think that it's not an issue in the code (wrongly formatting the input data or something like that), but something went wrong when uploading mDeBERTa to the huggingface hub? Accuracy of exactly random 0.3333, 0 training loss at epoch 2 and NaN validation loss maybe indicates that the data is running through the model, but some parameters are not updating or something like that?

My environment is google colab; Transformers==4.12.5

@BigBird01
Copy link
Contributor

For mDeBERTa, you need to use fp32. There is a fix in our official repo and we are going to port the fix to transformers soon.

@MoritzLaurer
Copy link
Author

Cool, this means that after the fix I can use fp16 as well?

@MoritzLaurer
Copy link
Author

Is there an update on this? I don't think an updated was pushed to the huggingface hub: https://huggingface.co/microsoft/mdeberta-v3-base/commits/main

Would be great to be able to use it with FP16

@jtomek
Copy link

jtomek commented Mar 30, 2022

Have you figured it out, guys?

@abdullahmuaad9
Copy link

ValueError: Tokenizer class DebertaV2Tokenizer does not exist or is not currently imported.
any idea please share
thanks in adavnce

@barschiiii
Copy link

@BigBird01 do you have any update on this by chance?

@jaideep11061982
Copy link

@jtomek @abdullahmuaad9 @BigBird01 @MoritzLaurer @barschiiii
any fix for this . I get the NaN with m deberta

@rfbr
Copy link

rfbr commented Apr 4, 2023

Hello team!
Is there any update on this? @jtomek @abdullahmuaad9 @BigBird01 @MoritzLaurer @barschiiii @jaideep11061982
Thanks!

@abdullahmuaad9
Copy link

abdullahmuaad9 commented Apr 5, 2023 via email

@rfbr
Copy link

rfbr commented Apr 5, 2023

I pinged you just in case you were interested by the future answer from the Microsoft team on the possibility to use fp16 with mDeBERTa.

@rfbr
Copy link

rfbr commented Apr 9, 2023

Hello there!
I have tracked the different modules to find where the under/overflows are happening. The DisentangledSelfAttention module is the culprit, replacing it with the implementation in this repo fixed the issue (I haven't spend the time to find the specific operation causing the NaN).

@sjrl
Copy link

sjrl commented May 7, 2023

Hey @rfbr I tried updating the DisentangledSelfAttention module in HF transformers with the one in this repo, but when fine-tuning on extractive QA (on squad 2.0) with fp16 I was still getting Nan predictions. Do you have an example implementation in the transformers code I could look at?

Update: Actually it seems like I got it to work. It appears the key was calculating the scale like this (using the math library)

scale = 1/math.sqrt(query_layer.size(-1)*scale_factor)
attention_scores = torch.bmm(query_layer, key_layer.transpose(-1, -2)*scale)

instead of whats implemented in transformers
https://github.com/huggingface/transformers/blob/ef42c2c487260c2a0111fa9d17f2507d84ddedea/src/transformers/models/deberta_v2/modeling_deberta_v2.py#L724-L725

scale = torch.sqrt(torch.tensor(query_layer.size(-1), dtype=torch.float) * scale_factor)
attention_scores = torch.bmm(query_layer, key_layer.transpose(-1, -2)) / scale.to(dtype=query_layer.dtype)

which uses all torch functionality.

Could this be because we aren't calling something like detach in the transformers code? Or maybe it has to do with the order of operations (e.g. perform the division before the multiplication as is done in this repo).

@jplu
Copy link

jplu commented May 30, 2023

Hey! Is there any update on this @BigBird01? I'm using the last version of transformers 4.29.2 and I'm still facing the same issue when using fp16. When will you port the fix?

Thanks.

@sjrl
Copy link

sjrl commented May 30, 2023

Hey @jplu I think I was able to port the changes into my forked branch of transformers here. If you'd just like to see the git diff so you can try the same take a look here. I did this by comparing the implementation in this repo compared to the one in transformers.

Doing this I was able to get fp16 training working in transformers.

@jplu
Copy link

jplu commented May 31, 2023

Hey @sjrl! Thanks a lot for sharing this. Indeed I confirm with your code the ability to train with fp16. Did you apply for a PR on the main repo? If not would be nice to have this fix integrated.

@sjrl
Copy link

sjrl commented Jun 8, 2023

@jplu Just opened the PR! I took some time to find the minimal changes needed to get the fp16 training to work. Hopefully that will speed up the review process.

@jplu
Copy link

jplu commented Jun 8, 2023

Awesome this seems perfect! Thanks a lot!

@jtomek
Copy link

jtomek commented Jun 9, 2023

This is honestly perfect, @sjrl. What a clever way to solve the problem! 🤩

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants