Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please future prove clean_up_tokenization_spaces #2922

Open
PhorstenkampFuzzy opened this issue Sep 3, 2024 · 28 comments
Open

Please future prove clean_up_tokenization_spaces #2922

PhorstenkampFuzzy opened this issue Sep 3, 2024 · 28 comments

Comments

@PhorstenkampFuzzy
Copy link

This is the future warning we are currently reciving:

transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884

@PhorstenkampFuzzy PhorstenkampFuzzy changed the title Please futur prove clean_up_tokenization_spaces Please future prove clean_up_tokenization_spaces Sep 3, 2024
@pesuchin
Copy link
Contributor

pesuchin commented Sep 5, 2024

This Warning occurs when the kwargs in AutoTokenizer.from_pretrained does not include the clean_up_tokenization_spaces setting value.
https://github.com/huggingface/transformers/blob/47b096412da9cbeb9351806e9f0eb70a693b2859/src/transformers/tokenization_utils_base.py#L1601-L1607

To prevent this Warning from being issued, clean_up_tokenization_spaces needs to be added to all AutoTokenizer.from_pretrained calls used within sentence_transformers.
Currently, the default value is True, so specifying clean_up_tokenization_spaces=True can avoid the Warning.

For example, we can confirm that the warning is being generated in examples/unsupervised_learning/TSDAE/train_stsb_tsdae.py.

$ python examples/unsupervised_learning/TSDAE/train_stsb_tsdae.py
/Users/username/project/sentence-transformers/.venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:1600: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(

To avoid the Warning in this process, clean_up_tokenization_spaces=True needs to be specified in tokenizer_kwargs in train_stsb_tsdae.py, and the following three DenoisingAutoEncoderLoss calls need to be modified:

The Warning message states that the default value will be changed to False in future versions, but according to this pull request, clean_up_tokenization_spaces=True seems to be necessary for BERT-based models, so it's unlikely that the default value will be changed to False.

huggingface/transformers#31938

Therefore, it seems that no action is needed for BERT-based models.
However, one concern is that with the current specification of sentence_transformers, a Warning will be issued unless the user explicitly specifies clean_up_tokenization_spaces=True.

I'm not sure about the full scope of impact, so I can't say whether this response is correct, but I believe this approach would avoid the Warning.

@ArthurZucker
Copy link

cc @tomaarsen if you need insight on that tell me!

@tomaarsen
Copy link
Collaborator

tomaarsen commented Sep 10, 2024

@ArthurZucker I'm considering following @pesuchin 's recommendation and adding clean_up_tokenization_spaces=True to avoid the warnings, but I'm very wary that hardcoding this option would create incompatibilities if some future transformers models are trained with clean_up_tokenization_spaces=False. If that model is then loaded into Sentence Transformers (with clean_up_tokenization_spaces=True) then the model is suddenly a lot worse. (I'm assuming here that the tokenization spaces affect tokens)

I think hardcoding clean_up_tokenization_spaces=True would fail because of it. Would love to hear what you think.

cc @itazap

  • Tom Aarsen

@itazap
Copy link

itazap commented Sep 11, 2024

Hey! This is the future PR to deprecate to False by default: huggingface/transformers#31938 and it will keep clean_up_tokenization_spaces=True for models that require it (such as Bert-based and some others --> see modified files in PR)

In terms of future models, the clean_up_tokenization_spaces function itself is arbitrarily stripping whitespace (post tokenization), so I would say it is a good practice for future models to need to have it explicitly set if that is the intention, or better yet have it be part of the tokenize logic directly. Let me know what you think 😄

@pradip292
Copy link

Please how to solve this error ?

@itazap
Copy link

itazap commented Sep 16, 2024

@pradip292 what is the error you are experiencing? This warning is expected in order to communicate the future deprecation

@pradip292
Copy link

pradip292 commented Sep 16, 2024

@pradip292 what is the error you are experiencing? This warning is expected in order to communicate the future deprecation

after this warning my streamlit is automatically going stop
waring is like this :- FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(

@itazap
Copy link

itazap commented Sep 17, 2024

@pradip292 Can you paste the error? Perhaps your streamlit needs to suppress warnings to render but the warning being present shouldn't result in an error

@pradip292
Copy link

@pradip292 Can you paste the error? Perhaps your streamlit needs to suppress warnings to render but the warning being present shouldn't result in an error

i did it but still same error i am facing

@PhorstenkampFuzzy
Copy link
Author

I am getting the warning but my application i working perfectly fine.
Are you sure your problem is related to the warning?
@pradip292

@pradip292
Copy link

I am getting the warning but my application i working perfectly fine. Are you sure your problem is related to the warning? @pradip292

i will check and update u after some time

@SDArtz
Copy link

SDArtz commented Sep 18, 2024

clean_up_tokenization_spaceswas not set /Library/Frameworks/Python.framew ![Screenshot 2024-09-18 at 9 59 14 AM](https://github.com/user-attachments/assets/e834cb71-9228-48a1-8c8d-805249ae31b5) ork/Versions/3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1617: FutureWarning:clean_up_tokenization_spaceswas not set. It will be set toTrueby default. This behavior will be deprecated in transformers v4.45, and will be then set toFalse` by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(

@itazap
Copy link

itazap commented Sep 18, 2024

@SDArtz @pradip292 thanks for providing the output, this output is an exptected warning that we want to display, it is not an error

@pradip292
Copy link

@SDArtz @pradip292 thanks for providing the output, this output is an exptected warning that we want to display, it is not an error

now this is another error i am facing i have tried many options but it is still showing this error only what should i do -> raise SSLError(e, request=request)
requests.exceptions.SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /sentence-transformers/all-mpnet-base-v2/resolve/main/config.json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)')))"), '(Request ID: 52c928fd-ec37-4b18-ab6a-5f11209595a2)')

@tomaarsen
Copy link
Collaborator

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json
?

It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

  • Tom Aarsen

@tomaarsen
Copy link
Collaborator

@itazap @ArthurZucker

Hey! This is the future PR to deprecate to False by default: huggingface/transformers#31938 and it will keep clean_up_tokenization_spaces=True for models that require it (such as Bert-based and some others --> see modified files in PR)

In terms of future models, the clean_up_tokenization_spaces function itself is arbitrarily stripping whitespace (post tokenization), so I would say it is a good practice for future models to need to have it explicitly set if that is the intention, or better yet have it be part of the tokenize logic directly. Let me know what you think 😄

Thanks for the answer! This sounds like I should indeed defer to transformers and not hardcode anything in Sentence Transformers, under the impression that you will keep it as True for models for which it's required (and thus prevent any breaking changes). Please correct me if I'm wrong.
In that case, I should just wait for a new transformers version where the deprecation is merged and released?

  • Tom Aarsen

@pradip292
Copy link

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ?

It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

  • Tom Aarsen

Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please

@itazap
Copy link

itazap commented Sep 20, 2024

@tomaarsen yes exactly - the deprecation will maintain True for models that require it ! 😊

@Calabrone76
Copy link

@SDArtz @pradip292 thanks for providing the output, this output is an exptected warning that we want to display, it is not an error

now this is another error i am facing i have tried many options but it is still showing this error only what should i do -> raise SSLError(e, request=request) requests.exceptions.SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /sentence-transformers/all-mpnet-base-v2/resolve/main/config.json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)')))"), '(Request ID: 52c928fd-ec37-4b18-ab6a-5f11209595a2)')

This is probably caused by a firewall (man in the middle) that changes SSL certificates. We had similar errors with our company firewall. Check this out.

@keshavvarma
Copy link

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ?
It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

  • Tom Aarsen

Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please

I am also facing the same issue on my laptop and don't know how to solve.

@pradip292
Copy link

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ?
It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

  • Tom Aarsen

Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please

I am also facing the same issue on my laptop and don't know how to solve.

Issue is solved :-)

@keshavvarma
Copy link

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ?
It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

  • Tom Aarsen

Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please

I am also facing the same issue on my laptop and don't know how to solve.

Issue is solved :-)

What is the solution? can you please let me know

@pradip292
Copy link

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ?
It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

  • Tom Aarsen

Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please

I am also facing the same issue on my laptop and don't know how to solve.

Issue is solved :-)

What is the solution? can you please let me know

Actually that model is not working so i am using other model of hunggingface there are different models are out there. and about ssl errors just need to download some files that code used, that was remain, i have taken the help of chatgpt and i able to solve that error. -> sorry for my english i am student

@keshavvarma
Copy link

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ?
It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

  • Tom Aarsen

Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please

I am also facing the same issue on my laptop and don't know how to solve.

Issue is solved :-)

What is the solution? can you please let me know

Actually that model is not working so i am using other model of hunggingface there are different models are out there. and about ssl errors just need to download some files that code used, that was remain, i have taken the help of chatgpt and i able to solve that error. -> sorry for my english i am student

Okay, I will try it out, thanks!!

@RisingInsight
Copy link

    self.tokenizer = BertTokenizer.from_pretrained(pretrained_bert_name, clean_up_tokenization_spaces=False)

@RisingInsight
Copy link

    self.tokenizer = BertTokenizer.from_pretrained(pretrained_bert_name, clean_up_tokenization_spaces=False)

我是直接在BertTokenizer中加入了clean_up_tokenization_spaces=False

@PhorstenkampFuzzy
Copy link
Author

Any news on a solution for the original issue?

@keshavvarma
Copy link

Not yet, for the time being I changed the vector database to FAISS and Groq model to llama3-8b-8192

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants