fix #144 for defaulting case, e.g. SciBERT without tokenizer config on HuggingFace #156

kermitt2 · 2023-01-18T15:36:41Z

Inferring case from model name if available. This avoids BERT models on huggingFace not having a tokenizer config to default in do_lower_case=True even when it is a "cased" model.

kermitt2 · 2023-01-18T15:39:32Z

Note that now when the DeLFT model is saved, the transformer tokenizer is also correctly saved.

For example with the problematic allenai/scibert_scivocab_cased, we have in the saved model transformer-tokenizer/tokenizer_config.json: "do_lower_case": false, as expected.

lfoppiano

Looks good. I just added a small change to lowercase the name, in case it's written with an uppercase letter - unlikely indeed

kermitt2 · 2023-01-19T07:18:45Z

For reference, even with a model not really sensitive to case, this improves the result quite a lot with 10-folds:

python3 delft/applications/grobidTagger.py citation train_eval  --architecture BERT  --transformer allenai/scibert_scivocab_cased --input data/sequenceLabelling/grobid/citation/citation-231022.train --fold-count 10            

BEFORE CASE FIX             
            
Average over 10 folds
all (micro avg.)     0.9224    0.9479    0.9350 

AFTER CASE FIX 

Average over 10 folds
all (micro avg.)     0.9307    0.9526    0.9415

lfoppiano · 2023-01-19T07:26:41Z

Test with the average of 5 train and 5 evaluations with an holdout dataset:

Superconductors:

	Scibert	Scibert fixed	mattpu-scibert-initcheckpoint	mattpu-scibert-initcheckpoint fixed
`<class>`	72.74	71.08	71.37	71.65
`<material>`	78.84	80.04	78.82	80.17
`<me_method>`	66.75	65.88	65.42	65.73
`<pressure>`	42.04	47.15	43.66	44.62
`<tc>`	79.02	78.96	78.52	78.55
`<tcValue>`	78.74	78.10	78.46	78.63
All (micro avg.)	76.30	76.69	75.89	76.70

Quantities:

	Scibert	Scibert fixed	mattpu-scibert-initcheckpoint	mattpu-scibert-initcheckpoint fixed
`<unitLeft>`	93.09	91.80	92.14	92.66
`<unitRight>`	22.99	32.71	28.02	32.77
`<valueAtomic>`	81.89	86.55	86.13	87.06
`<valueBase>`	96.50	94.13	91.66	92.96
`<valueLeast>`	80.70	75.01	75.60	74.99
`<valueList>`	71.36	57.90	54.48	51.86
`<valueMost>`	78.61	72.41	73.59	72.51
`<valueRange>`	100.00	95.52	94.40	94.12
All (micro avg.)	84.82	85.15	85.09	85.53

fix #144 for defaulting case

47fe8a7

kermitt2 requested a review from lfoppiano January 18, 2023 15:36

kermitt2 self-assigned this Jan 18, 2023

Testing the lowercase name, just in case

73dc549

lfoppiano approved these changes Jan 19, 2023

View reviewed changes

lfoppiano added the enhancement New feature or request label Jan 19, 2023

kermitt2 added the bug Something isn't working label Jan 19, 2023

kermitt2 merged commit 88f084d into master Jan 19, 2023

kermitt2 changed the title ~~fix #144 for defaulting case~~ fix #144 for defaulting case, e.g. SciBERT without tokenizer config on HuggingFace Jan 19, 2023

lfoppiano deleted the fix-144 branch March 16, 2023 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix #144 for defaulting case, e.g. SciBERT without tokenizer config on HuggingFace #156

fix #144 for defaulting case, e.g. SciBERT without tokenizer config on HuggingFace #156

kermitt2 commented Jan 18, 2023

kermitt2 commented Jan 18, 2023

lfoppiano left a comment

kermitt2 commented Jan 19, 2023

lfoppiano commented Jan 19, 2023 •

edited

Loading

fix #144 for defaulting case, e.g. SciBERT without tokenizer config on HuggingFace #156

fix #144 for defaulting case, e.g. SciBERT without tokenizer config on HuggingFace #156

Conversation

kermitt2 commented Jan 18, 2023

kermitt2 commented Jan 18, 2023

lfoppiano left a comment

Choose a reason for hiding this comment

kermitt2 commented Jan 19, 2023

lfoppiano commented Jan 19, 2023 • edited Loading

lfoppiano commented Jan 19, 2023 •

edited

Loading