Please future prove `clean_up_tokenization_spaces` #2922

PhorstenkampFuzzy · 2024-09-03T21:12:18Z

This is the future warning we are currently reciving:

transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884

The text was updated successfully, but these errors were encountered:

pesuchin · 2024-09-05T12:46:18Z

This Warning occurs when the kwargs in AutoTokenizer.from_pretrained does not include the clean_up_tokenization_spaces setting value.
https://github.com/huggingface/transformers/blob/47b096412da9cbeb9351806e9f0eb70a693b2859/src/transformers/tokenization_utils_base.py#L1601-L1607

To prevent this Warning from being issued, clean_up_tokenization_spaces needs to be added to all AutoTokenizer.from_pretrained calls used within sentence_transformers.
Currently, the default value is True, so specifying clean_up_tokenization_spaces=True can avoid the Warning.

For example, we can confirm that the warning is being generated in examples/unsupervised_learning/TSDAE/train_stsb_tsdae.py.

$ python examples/unsupervised_learning/TSDAE/train_stsb_tsdae.py
/Users/username/project/sentence-transformers/.venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:1600: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(

To avoid the Warning in this process, clean_up_tokenization_spaces=True needs to be specified in tokenizer_kwargs in train_stsb_tsdae.py, and the following three DenoisingAutoEncoderLoss calls need to be modified:

Modify to pass clean_up_tokenization_spaces=True to AutoTokenizer.from_pretrained
- sentence-transformers/sentence_transformers/losses/DenoisingAutoEncoderLoss.py
  
  Line 89 in 0a32ec8
  
  self.tokenizer_decoder = AutoTokenizer.from_pretrained(decoder_name_or_path)
- sentence-transformers/sentence_transformers/losses/DenoisingAutoEncoderLoss.py
  
  Line 109 in 0a32ec8
  
  if len(AutoTokenizer.from_pretrained(encoder_name_or_path)) != len(self.tokenizer_encoder):
Remove clean_up_tokenization_spaces=True from the batch_decode call.
- sentence-transformers/sentence_transformers/losses/DenoisingAutoEncoderLoss.py
  
  Line 143 in 0a32ec8
  
  input_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True

The Warning message states that the default value will be changed to False in future versions, but according to this pull request, clean_up_tokenization_spaces=True seems to be necessary for BERT-based models, so it's unlikely that the default value will be changed to False.

huggingface/transformers#31938

Therefore, it seems that no action is needed for BERT-based models.
However, one concern is that with the current specification of sentence_transformers, a Warning will be issued unless the user explicitly specifies clean_up_tokenization_spaces=True.

As a response, one could consider changing the following locations to specify clean_up_tokenization_spaces=True by default:
- sentence-transformers/sentence_transformers/models/Transformer.py
  
  Line 51 in 0a32ec8
  
  tokenizer_args = {}
- sentence-transformers/sentence_transformers/SentenceTransformer.py
  
  Line 1402 in 0a32ec8
  
  tokenizer_kwargs = shared_kwargs if tokenizer_kwargs is None else {**shared_kwargs, **tokenizer_kwargs}
- sentence-transformers/sentence_transformers/cross_encoder/CrossEncoder.py
  
  Line 73 in 0a32ec8
  
  tokenizer_args = {}

I'm not sure about the full scope of impact, so I can't say whether this response is correct, but I believe this approach would avoid the Warning.

ArthurZucker · 2024-09-06T12:24:53Z

cc @tomaarsen if you need insight on that tell me!

tomaarsen · 2024-09-10T19:39:56Z

@ArthurZucker I'm considering following @pesuchin 's recommendation and adding clean_up_tokenization_spaces=True to avoid the warnings, but I'm very wary that hardcoding this option would create incompatibilities if some future transformers models are trained with clean_up_tokenization_spaces=False. If that model is then loaded into Sentence Transformers (with clean_up_tokenization_spaces=True) then the model is suddenly a lot worse. (I'm assuming here that the tokenization spaces affect tokens)

I think hardcoding clean_up_tokenization_spaces=True would fail because of it. Would love to hear what you think.

cc @itazap

Tom Aarsen

itazap · 2024-09-11T09:36:12Z

Hey! This is the future PR to deprecate to False by default: huggingface/transformers#31938 and it will keep clean_up_tokenization_spaces=True for models that require it (such as Bert-based and some others --> see modified files in PR)

In terms of future models, the clean_up_tokenization_spaces function itself is arbitrarily stripping whitespace (post tokenization), so I would say it is a good practice for future models to need to have it explicitly set if that is the intention, or better yet have it be part of the tokenize logic directly. Let me know what you think 😄

pradip292 · 2024-09-16T06:20:39Z

Please how to solve this error ?

itazap · 2024-09-16T07:12:25Z

@pradip292 what is the error you are experiencing? This warning is expected in order to communicate the future deprecation

pradip292 · 2024-09-16T09:28:09Z

@pradip292 what is the error you are experiencing? This warning is expected in order to communicate the future deprecation

after this warning my streamlit is automatically going stop
waring is like this :- FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(

itazap · 2024-09-17T16:11:45Z

@pradip292 Can you paste the error? Perhaps your streamlit needs to suppress warnings to render but the warning being present shouldn't result in an error

pradip292 · 2024-09-17T16:39:46Z

@pradip292 Can you paste the error? Perhaps your streamlit needs to suppress warnings to render but the warning being present shouldn't result in an error

i did it but still same error i am facing

PhorstenkampFuzzy · 2024-09-17T16:41:51Z

I am getting the warning but my application i working perfectly fine.
Are you sure your problem is related to the warning?
@pradip292

pradip292 · 2024-09-17T16:43:32Z

I am getting the warning but my application i working perfectly fine. Are you sure your problem is related to the warning? @pradip292

i will check and update u after some time

SDArtz · 2024-09-18T04:29:54Z

clean_up_tokenization_spaceswas not set /Library/Frameworks/Python.framew ![Screenshot 2024-09-18 at 9 59 14 AM](https://github.com/user-attachments/assets/e834cb71-9228-48a1-8c8d-805249ae31b5) ork/Versions/3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1617: FutureWarning:clean_up_tokenization_spaceswas not set. It will be set toTrueby default. This behavior will be deprecated in transformers v4.45, and will be then set toFalse` by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(

itazap · 2024-09-18T08:35:21Z

@SDArtz @pradip292 thanks for providing the output, this output is an exptected warning that we want to display, it is not an error

pradip292 · 2024-09-18T08:52:10Z

@SDArtz @pradip292 thanks for providing the output, this output is an exptected warning that we want to display, it is not an error

now this is another error i am facing i have tried many options but it is still showing this error only what should i do -> raise SSLError(e, request=request)
requests.exceptions.SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /sentence-transformers/all-mpnet-base-v2/resolve/main/config.json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)')))"), '(Request ID: 52c928fd-ec37-4b18-ab6a-5f11209595a2)')

tomaarsen · 2024-09-19T09:58:23Z

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json
?

It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

Tom Aarsen

tomaarsen · 2024-09-19T10:01:36Z

@itazap @ArthurZucker

Hey! This is the future PR to deprecate to False by default: huggingface/transformers#31938 and it will keep clean_up_tokenization_spaces=True for models that require it (such as Bert-based and some others --> see modified files in PR)

In terms of future models, the clean_up_tokenization_spaces function itself is arbitrarily stripping whitespace (post tokenization), so I would say it is a good practice for future models to need to have it explicitly set if that is the intention, or better yet have it be part of the tokenize logic directly. Let me know what you think 😄

Thanks for the answer! This sounds like I should indeed defer to transformers and not hardcode anything in Sentence Transformers, under the impression that you will keep it as True for models for which it's required (and thus prevent any breaking changes). Please correct me if I'm wrong.
In that case, I should just wait for a new transformers version where the deprecation is merged and released?

Tom Aarsen

pradip292 · 2024-09-19T17:27:15Z

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ?

It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

Tom Aarsen

Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please

itazap · 2024-09-20T09:23:01Z

@tomaarsen yes exactly - the deprecation will maintain True for models that require it ! 😊

Calabrone76 · 2024-09-27T09:36:20Z

@SDArtz @pradip292 thanks for providing the output, this output is an exptected warning that we want to display, it is not an error

now this is another error i am facing i have tried many options but it is still showing this error only what should i do -> raise SSLError(e, request=request) requests.exceptions.SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /sentence-transformers/all-mpnet-base-v2/resolve/main/config.json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)')))"), '(Request ID: 52c928fd-ec37-4b18-ab6a-5f11209595a2)')

This is probably caused by a firewall (man in the middle) that changes SSL certificates. We had similar errors with our company firewall. Check this out.

keshavvarma · 2024-10-02T14:40:10Z

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ?
It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

Tom Aarsen

Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please

I am also facing the same issue on my laptop and don't know how to solve.

pradip292 · 2024-10-02T14:56:58Z

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ?
It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

Tom Aarsen

Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please

I am also facing the same issue on my laptop and don't know how to solve.

Issue is solved :-)

keshavvarma · 2024-10-02T15:03:11Z

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ?
It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

Tom Aarsen

Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please

I am also facing the same issue on my laptop and don't know how to solve.

Issue is solved :-)

What is the solution? can you please let me know

pradip292 · 2024-10-02T15:11:51Z

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ?
It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

Tom Aarsen

Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please

I am also facing the same issue on my laptop and don't know how to solve.

Issue is solved :-)

What is the solution? can you please let me know

Actually that model is not working so i am using other model of hunggingface there are different models are out there. and about ssl errors just need to download some files that code used, that was remain, i have taken the help of chatgpt and i able to solve that error. -> sorry for my english i am student

keshavvarma · 2024-10-02T15:15:17Z

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ?
It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

Tom Aarsen

Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please

I am also facing the same issue on my laptop and don't know how to solve.

Issue is solved :-)

What is the solution? can you please let me know

Actually that model is not working so i am using other model of hunggingface there are different models are out there. and about ssl errors just need to download some files that code used, that was remain, i have taken the help of chatgpt and i able to solve that error. -> sorry for my english i am student

Okay, I will try it out, thanks!!

RisingInsight · 2024-10-07T03:19:10Z

    self.tokenizer = BertTokenizer.from_pretrained(pretrained_bert_name, clean_up_tokenization_spaces=False)

RisingInsight · 2024-10-07T03:19:52Z

    self.tokenizer = BertTokenizer.from_pretrained(pretrained_bert_name, clean_up_tokenization_spaces=False)

我是直接在BertTokenizer中加入了clean_up_tokenization_spaces=False

PhorstenkampFuzzy · 2024-10-17T09:25:25Z

Any news on a solution for the original issue?

keshavvarma · 2024-10-17T10:02:59Z

Not yet, for the time being I changed the vector database to FAISS and Groq model to llama3-8b-8192

PhorstenkampFuzzy changed the title ~~Please futur prove clean_up_tokenization_spaces~~ Please future prove clean_up_tokenization_spaces Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please future prove `clean_up_tokenization_spaces` #2922

Please future prove `clean_up_tokenization_spaces` #2922

PhorstenkampFuzzy commented Sep 3, 2024

pesuchin commented Sep 5, 2024

ArthurZucker commented Sep 6, 2024

tomaarsen commented Sep 10, 2024 •

edited

Loading

itazap commented Sep 11, 2024

pradip292 commented Sep 16, 2024

itazap commented Sep 16, 2024

pradip292 commented Sep 16, 2024 •

edited

Loading

itazap commented Sep 17, 2024

pradip292 commented Sep 17, 2024

PhorstenkampFuzzy commented Sep 17, 2024

pradip292 commented Sep 17, 2024

SDArtz commented Sep 18, 2024

itazap commented Sep 18, 2024

pradip292 commented Sep 18, 2024

tomaarsen commented Sep 19, 2024

tomaarsen commented Sep 19, 2024

pradip292 commented Sep 19, 2024

itazap commented Sep 20, 2024

Calabrone76 commented Sep 27, 2024

keshavvarma commented Oct 2, 2024

pradip292 commented Oct 2, 2024

keshavvarma commented Oct 2, 2024

pradip292 commented Oct 2, 2024

keshavvarma commented Oct 2, 2024

RisingInsight commented Oct 7, 2024

RisingInsight commented Oct 7, 2024

PhorstenkampFuzzy commented Oct 17, 2024

keshavvarma commented Oct 17, 2024

Please future prove clean_up_tokenization_spaces #2922

Please future prove clean_up_tokenization_spaces #2922

Comments

PhorstenkampFuzzy commented Sep 3, 2024

pesuchin commented Sep 5, 2024

ArthurZucker commented Sep 6, 2024

tomaarsen commented Sep 10, 2024 • edited Loading

itazap commented Sep 11, 2024

pradip292 commented Sep 16, 2024

itazap commented Sep 16, 2024

pradip292 commented Sep 16, 2024 • edited Loading

itazap commented Sep 17, 2024

pradip292 commented Sep 17, 2024

PhorstenkampFuzzy commented Sep 17, 2024

pradip292 commented Sep 17, 2024

SDArtz commented Sep 18, 2024

itazap commented Sep 18, 2024

pradip292 commented Sep 18, 2024

tomaarsen commented Sep 19, 2024

tomaarsen commented Sep 19, 2024

pradip292 commented Sep 19, 2024

itazap commented Sep 20, 2024

Calabrone76 commented Sep 27, 2024

keshavvarma commented Oct 2, 2024

pradip292 commented Oct 2, 2024

keshavvarma commented Oct 2, 2024

pradip292 commented Oct 2, 2024

keshavvarma commented Oct 2, 2024

RisingInsight commented Oct 7, 2024

RisingInsight commented Oct 7, 2024

PhorstenkampFuzzy commented Oct 17, 2024

keshavvarma commented Oct 17, 2024

Please future prove `clean_up_tokenization_spaces` #2922

Please future prove `clean_up_tokenization_spaces` #2922

tomaarsen commented Sep 10, 2024 •

edited

Loading

pradip292 commented Sep 16, 2024 •

edited

Loading