clean_up_tokenization_spaces=False if unset #31938

itazap · 2024-07-12T16:53:23Z

FUTURE DEPRECATION

start of deprecating clean_up_tokenization_spaces. Right now it defaults to True, update to default to False.

BERT based models need clean_up_tokenization_spaces=True, so set this in class init
some models like wav2vec needed test updated since they don't really expect clean_up_tokenization_spaces=True.

HuggingFaceDocBuilderDev · 2024-07-12T17:12:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/transformers/tokenization_utils_base.py

itazap · 2024-07-25T08:25:37Z

tests/test_tokenization_common.py

@@ -4247,52 +4247,6 @@ def test_save_slow_from_fast_and_reload_fast(self):
                    # Should not raise an error
                    self.rust_tokenizer_class.from_pretrained(tmp_dir_2)

-    # TODO This is ran for all models but only tests bert...
-    def test_clean_up_tokenization_spaces(self):
-        tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")


this only ever tested Bert, so I don't think this is valuable to keep or to update to be tested for each model, because it behaves differently with special tokens and might have to be customized for some models. Since we will deprecate, I don't think it's useful to start maintaining this test now for all models!

itazap · 2024-07-26T10:33:03Z

tests/models/clvp/test_tokenization_clvp.py

@@ -79,7 +79,7 @@ def get_tokenizer(self, **kwargs):
    # Copied from transformers.tests.models.gpt2.test_tokenization_gpt2.GPT2TokenizationTest.get_input_output_texts
    def get_input_output_texts(self, tokenizer):
        input_text = "lower newer"
-        output_text = "lower newer"
+        output_text = "lower[SPACE]newer"


testing sets a small vocab here, so this should be the expected behaviour. see unmodified test ClvpTokenizationTest --> test_full_tokenizer for an example where [SPACE] was expected

itazap · 2024-07-26T10:36:55Z

tests/models/wav2vec2/test_tokenization_wav2vec2.py

-        self.assertEqual(batch_tokens, ["HELLO<unk>!?!?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"])
-        self.assertEqual(batch_tokens_2, ["HELO!?!?<new_tokens>", "BYE BYE<new_tokens>"])
+        self.assertEqual(batch_tokens, ["HELLO<unk>!? !?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"])
+        self.assertEqual(batch_tokens_2, ["HELO!? !?<new_tokens>", "BYE BYE<new_tokens>"])


the tokenizer.word_delimiter_token is replaced with . See :

def convert_tokens_to_string(self, tokens: List[str]) -> str: """ Converts a connectionist-temporal-classification (CTC) output tokens into a single string. """ ... # replace delimiter token string = "".join([" " if token == self.word_delimiter_token else token for token in filtered_tokens]).strip() if self.do_lower_case: string = string.lower() return string

we should not have to do this! maybe cleanup tokenization should be true for wav2vec2 no?

I thought so too but the sample_ids have tokenizer.word_delimiter_token_id which is a space " " so I think it would be expected in the output? wdyt @ArthurZucker

ArthurZucker

Looks good, let's make sure we keep the default to True for now!

src/transformers/tokenization_utils_base.py

rishi23root · 2024-09-13T18:15:30Z

i believe the issue is

if "clean_up_tokenization_spaces" not in kwargs:
            warnings.warn(
                "`clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This "
                "behavior will be depracted in transformers v4.45, and will be then set to `False` by default. "
                "For more details check this issue: https://github.com/huggingface/transformers/issues/31884",
                FutureWarning,
            )

# By default, cleaning tokenization spaces for both fast and slow tokenizers
self.clean_up_tokenization_spaces = kwargs.pop("clean_up_tokenization_spaces", True)

by default it setting but i don't understand where to update it to remove the warning completely

itazap · 2024-09-14T07:58:05Z

@rishi23root I'll update the message!

ArthurZucker · 2024-09-25T12:38:17Z

tests/models/wav2vec2/test_tokenization_wav2vec2.py

-        self.assertEqual(batch_tokens, ["HELLO<unk>!?!?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"])
-        self.assertEqual(batch_tokens_2, ["HELO!?!?<new_tokens>", "BYE BYE<new_tokens>"])
+        self.assertEqual(batch_tokens, ["HELLO<unk>!? !?<new_tokens>$$$", "BYE BYE<unk><new_tokens>$$$"])
+        self.assertEqual(batch_tokens_2, ["HELO!? !?<new_tokens>", "BYE BYE<new_tokens>"])


we should not have to do this! maybe cleanup tokenization should be true for wav2vec2 no?

ArthurZucker · 2024-09-26T16:28:04Z

src/transformers/tokenization_utils_base.py

+        warnings.warn(
+            "The `clean_up_tokenization_spaces` argument will soon be deprecated. It currently defaults to False if not passed.",
+            FutureWarning,
+        )


let's not warn, we won't remove it!

Suggested change

warnings.warn(

"The `clean_up_tokenization_spaces` argument will soon be deprecated. It currently defaults to False if not passed.",

FutureWarning,

)

* clean_up_tokenization_spaces=False if unset * deprecate warning * updating param for old models * update models * make fix-copies * fix-copies and update bert models * warning msg * update prophet and clvp * updating test since space before is arbitrarily removed * remove warning for 4.45

itazap requested a review from ArthurZucker July 12, 2024 16:53

ArthurZucker reviewed Jul 22, 2024

View reviewed changes

src/transformers/tokenization_utils_base.py Show resolved Hide resolved

itazap force-pushed the clean_up_tokenization_spaces_false_default branch 4 times, most recently from e689ec3 to ec6f78a Compare July 25, 2024 08:23

itazap commented Jul 25, 2024

View reviewed changes

itazap marked this pull request as ready for review July 26, 2024 10:29

itazap requested a review from ArthurZucker July 26, 2024 10:30

itazap commented Jul 26, 2024

View reviewed changes

ArthurZucker reviewed Jul 31, 2024

View reviewed changes

src/transformers/tokenization_utils_base.py Outdated Show resolved Hide resolved

src/transformers/tokenization_utils_base.py Show resolved Hide resolved

itazap mentioned this pull request Aug 1, 2024

update clean_up_tokenization_spaces warning #32371

Merged

pesuchin mentioned this pull request Sep 5, 2024

Please future prove clean_up_tokenization_spaces UKPLab/sentence-transformers#2922

Open

itazap added 9 commits September 6, 2024 13:27

clean_up_tokenization_spaces=False if unset

676b55e

deprecate warning

50029b2

updating param for old models

cc7d05f

update models

f0b4591

make fix-copies

227d54e

fix-copies and update bert models

8d047bc

warning msg

1995376

update prophet and clvp

1511100

updating test since space before is arbitrarily removed

b7b2b09

itazap force-pushed the clean_up_tokenization_spaces_false_default branch from 45b6a0e to b7b2b09 Compare September 6, 2024 11:29

itazap requested a review from ArthurZucker September 20, 2024 09:21

ArthurZucker approved these changes Sep 26, 2024

View reviewed changes

ArthurZucker reviewed Sep 26, 2024

View reviewed changes

remove warning for 4.45

9c34444

ArthurZucker merged commit 6730485 into main Sep 26, 2024
21 of 24 checks passed

ArthurZucker deleted the clean_up_tokenization_spaces_false_default branch September 26, 2024 17:38

ArthurZucker mentioned this pull request Sep 26, 2024

[clean_up_tokenization_spaces] Pl bart was failing, updating #33735

Merged

itazap mentioned this pull request Sep 27, 2024

remove warning v2 #33761

Merged

baberabb mentioned this pull request Oct 3, 2024

fix tests EleutherAI/lm-evaluation-harness#2380

Merged

itazap mentioned this pull request Nov 22, 2024

depreciating all occurances of clean_up_tokenization_spaces #31232

Closed

ArthurZucker mentioned this pull request Dec 20, 2024

Detokenization discrepancy with Llama3.1 #35175

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clean_up_tokenization_spaces=False if unset #31938

clean_up_tokenization_spaces=False if unset #31938

itazap commented Jul 12, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 12, 2024

itazap Jul 25, 2024 •

edited

Loading

itazap Jul 26, 2024

itazap Jul 26, 2024

ArthurZucker Sep 25, 2024

itazap Sep 26, 2024 •

edited

Loading

ArthurZucker left a comment

rishi23root commented Sep 13, 2024

itazap commented Sep 14, 2024

ArthurZucker Sep 25, 2024

ArthurZucker Sep 26, 2024

clean_up_tokenization_spaces=False if unset #31938

clean_up_tokenization_spaces=False if unset #31938

Conversation

itazap commented Jul 12, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Jul 12, 2024

itazap Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

itazap Jul 26, 2024

Choose a reason for hiding this comment

itazap Jul 26, 2024

Choose a reason for hiding this comment

ArthurZucker Sep 25, 2024

Choose a reason for hiding this comment

itazap Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

rishi23root commented Sep 13, 2024

itazap commented Sep 14, 2024

ArthurZucker Sep 25, 2024

Choose a reason for hiding this comment

ArthurZucker Sep 26, 2024

Choose a reason for hiding this comment

itazap commented Jul 12, 2024 •

edited

Loading

itazap Jul 25, 2024 •

edited

Loading

itazap Sep 26, 2024 •

edited

Loading