Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #144 for defaulting case, e.g. SciBERT without tokenizer config on HuggingFace #156

Merged
merged 2 commits into from
Jan 19, 2023

Conversation

kermitt2
Copy link
Owner

Inferring case from model name if available. This avoids BERT models on huggingFace not having a tokenizer config to default in do_lower_case=True even when it is a "cased" model.

@kermitt2 kermitt2 self-assigned this Jan 18, 2023
@kermitt2
Copy link
Owner Author

Note that now when the DeLFT model is saved, the transformer tokenizer is also correctly saved.

For example with the problematic allenai/scibert_scivocab_cased, we have in the saved model transformer-tokenizer/tokenizer_config.json: "do_lower_case": false, as expected.

Copy link
Collaborator

@lfoppiano lfoppiano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I just added a small change to lowercase the name, in case it's written with an uppercase letter - unlikely indeed

@lfoppiano lfoppiano added the enhancement New feature or request label Jan 19, 2023
@kermitt2
Copy link
Owner Author

For reference, even with a model not really sensitive to case, this improves the result quite a lot with 10-folds:

python3 delft/applications/grobidTagger.py citation train_eval  --architecture BERT  --transformer allenai/scibert_scivocab_cased --input data/sequenceLabelling/grobid/citation/citation-231022.train --fold-count 10            

BEFORE CASE FIX             
            
Average over 10 folds
all (micro avg.)     0.9224    0.9479    0.9350 

AFTER CASE FIX 

Average over 10 folds
all (micro avg.)     0.9307    0.9526    0.9415

@kermitt2 kermitt2 added the bug Something isn't working label Jan 19, 2023
@kermitt2 kermitt2 merged commit 88f084d into master Jan 19, 2023
@lfoppiano
Copy link
Collaborator

lfoppiano commented Jan 19, 2023

Test with the average of 5 train and 5 evaluations with an holdout dataset:

Superconductors:

Scibert Scibert fixed mattpu-scibert-initcheckpoint mattpu-scibert-initcheckpoint fixed
<class> 72.74 71.08 71.37 71.65
<material> 78.84 80.04 78.82 80.17
<me_method> 66.75 65.88 65.42 65.73
<pressure> 42.04 47.15 43.66 44.62
<tc> 79.02 78.96 78.52 78.55
<tcValue> 78.74 78.10 78.46 78.63
All (micro avg.) 76.30 76.69 75.89 76.70

Quantities:

Scibert Scibert fixed mattpu-scibert-initcheckpoint mattpu-scibert-initcheckpoint fixed
<unitLeft> 93.09 91.80 92.14 92.66
<unitRight> 22.99 32.71 28.02 32.77
<valueAtomic> 81.89 86.55 86.13 87.06
<valueBase> 96.50 94.13 91.66 92.96
<valueLeast> 80.70 75.01 75.60 74.99
<valueList> 71.36 57.90 54.48 51.86
<valueMost> 78.61 72.41 73.59 72.51
<valueRange> 100.00 95.52 94.40 94.12
All (micro avg.) 84.82 85.15 85.09 85.53

@kermitt2 kermitt2 changed the title fix #144 for defaulting case fix #144 for defaulting case, e.g. SciBERT without tokenizer config on HuggingFace Jan 19, 2023
@lfoppiano lfoppiano deleted the fix-144 branch March 16, 2023 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants