[SigLIP] Add fast tokenizer #29969

NielsRogge · 2024-03-30T20:07:52Z

What does this PR do?

🔴 Breaking change! Updates SiglipTokenizer, specifically strip behaviour in tokenize

To do:

fix remaining tests
add slow integration test

HuggingFaceDocBuilderDev · 2024-03-30T20:28:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/transformers/models/siglip/tokenization_siglip_fast.py

tests/models/siglip/test_tokenization_siglip.py

src/transformers/convert_slow_tokenizer.py

…enizer_bis

NielsRogge · 2024-04-22T18:44:06Z

@ArthurZucker I'm down to these 3 tests failing:

FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_added_tokens_do_lower_case - AssertionError: 'aaaaa bbbbbb ' == 'aaaaa bbbbbb '
FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_special_tokens_initialization - AssertionError: Lists differ: [342, 322, 291, 269, 262, 266, 32100, 507, 4290, 1] != [342, 322, 291, 269, 262, 266, 32100, 12936, 1]
FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_tokenization_python_rust_equals - AssertionError: Sequences differ: [291,[64 chars]62, 232, 141, 158, 232, 141, 163, 232, 142, 16[5335 chars]3, 1] != [291,[64 chars]62, 2, 16577, 266, 2, 1443, 412, 282, 1791, 13[517...

but I don't really know how to fix these. Are you able to look into these?

ArthurZucker

test_tokenization_python_rust_equals is the only one you really need. The others are not well designed TBH

ArthurZucker · 2024-04-30T11:35:28Z

src/transformers/models/siglip/test.py

should be removed

ArthurZucker · 2024-04-30T11:36:00Z

src/transformers/models/siglip/tokenization_siglip_fast.py

+
+        self.vocab_file = vocab_file
+
+    @property


lots of copied from are missing here as well

tests/models/siglip/test_tokenization_siglip.py

tests/test_tokenization_common.py

github-actions · 2024-05-25T08:03:48Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

yxchng · 2024-05-29T04:40:05Z

is this getting merged anytime soon?

ArthurZucker · 2024-06-05T12:11:13Z

Comments need to be adressed. cc @itazap if you want to take this over!

github-actions · 2024-06-30T08:05:31Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

NielsRogge · 2024-07-08T08:59:52Z

Hi @itazap do you have bandwidth to work on this? There were only 2 tests remain to be fixed (for which I'd need some guidance)

itazap · 2024-07-09T07:58:06Z

@NielsRogge Thank you for your patience! I'm looking into the failing tests now

itazap · 2024-07-09T09:28:31Z

@NielsRogge Upon further inspection of the failing tests, the rust tokenizer is not equal to the python tokenizer. There are some key issues/differences, including:

The SiglipConverter.normalizer is dropping all punctuation, including ["<",">"] which need to be kept for the <special> tokens you are adding.

list_normalizers.append(normalizers.Replace(Regex(r"[" + re.escape(string.punctuation) + "]"), ""))

Also in SiglipConvert.normalizer I believe a WhitespaceSplit() should be added like below (see T5)

    def pre_tokenizer(self, replacement, add_prefix_space):
        prepend_scheme = _get_prepend_scheme(add_prefix_space, self.original_tokenizer)
        return pre_tokenizers.Sequence(
            [
                pre_tokenizers.WhitespaceSplit(),
                pre_tokenizers.Metaspace(replacement=replacement, prepend_scheme=prepend_scheme),
            ]
        )

The rust (fast) and python tokenizers tokenize differently, for example if you .tokenize('<special> token) , fast output will include ['token'], while slow will include ['to','ken']. I'm not sure which is expected as I am not too familiar with this model. Do you know which is expected?
[EDIT] do_lower_case should be True for failing test_added_tokens_do_lower_case

This should be a good starting point in regards to where to debug. Let me know if I can answer any questions about the above!

Also, may be helpful to merge the latest main code into this branch!

NielsRogge · 2024-07-09T09:49:42Z

Thanks for looking into it, SiglipTokenizer is exactly the same as T5Tokenizer, except that it includes this function before calling _tokenize (as seen here). I was mainly wondering how this functionality could be incorporated into the fast tokenizer.

itazap · 2024-07-09T10:33:48Z

@NielsRogge in that function, the text = text.strip() is causing the discrepancy. In PreTrainerTokenizer.tokenize() , the input string gets split on special tokens. Then, your canonicalize_text may be called on a substring (such as a single token) in cases including special tokens, or on the whole string/sentence, when there are no special tokens (example: ' token' vs 'hey token'). I'm not exactly sure what the canonicalize_text is trying to achieve with the .strip(), but the link you provided mentions only punctuation and lowercasing. without the .strip it should be more consistent but not sure if that will give the same results.

github-actions · 2024-08-03T08:06:55Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

NielsRogge · 2024-08-12T09:04:02Z

Hi @itazap would you be interested to work on this?

itazap · 2024-08-19T10:37:50Z

@NielsRogge Yes I can look into this! 😄 I may follow up to ask about expected behaviour of siglip 👀

itazap · 2024-08-22T13:29:01Z

@NielsRogge are you able to please rebase when you have a chance? I don't have permission to push a rebase on this!

…enizer_bis

itazap · 2024-08-26T12:48:01Z

Summary of changes:

test_chat_template_return_assistant_tokens_mask skipped because siglip strips the punctuation used in chat templates and this test is too specific with matching punctuation chars like pipe
self.assertNotEqual(sp_tokens, tokens) removed from SigLip and T5. Each of these vars is explicitly tested to Equal an expected value, and the two being NotEqual is not something we actually want to enforce. We already check what each of the two is Equal to!
canonicalize_text function in tokenization_siglip should not strip based on the paper source. This may affect current SigLip slow/python

@ArthurZucker re-requested for review, thanks!

ArthurZucker

THanks, I don't think we need a class, as the key part is to properly convert to a PreTrainedTokenizerFast!

ArthurZucker · 2024-08-28T09:14:08Z

src/transformers/models/siglip/test.py

src/transformers/models/siglip/tokenization_siglip.py

ArthurZucker · 2024-08-28T09:15:46Z

src/transformers/models/siglip/tokenization_siglip_fast.py

not sure we even need that class no? We could just use a PreTrainedTokenizerFast

Oh I didn't consider we could do that! Is there an example I can reference? Not sure what to do with the functions copied over from T5 here.

Also, looking more into the functions here, would it be better to move common functions like save_vocabulary (duplicated in 15 _fast files) to PreTrainedTokenizerFast?

Just a question, wouldn't that confuse users? if there's no dedicated class with the name of the model

@NielsRogge yes I see your point I also had the same thought. Do we have other fast models we support without a dedicated class?@ArthurZucker

We have quite a lot of other models that just use PreTrainedTokenizerFast, including Llama (llama v3) , all mambas etc. Tokenizers are more prone to change than models (you could have Mamba with LlamaTokenizer) so it makes more sense to deprecate slow and model-specific ones

You don't need them they are build on the layer of PreTrainedTokenizerFast + we can embed stuff inside the tokenizer fast itelsf

@ArthurZucker Sorry, not really understanding if you mean to just remove this file entirely and not worry about the functions? (or do we need to embed them somewhere?)

yeah just remove it entirely

@ArthurZucker and would we have to add a tokenizer.json to the hub?

added the fast tokenizer.json file to https://huggingface.co/google/siglip-base-patch16-224/discussions/4

itazap

going to request for merging commits on the hub for the tokenizer.json files!

ArthurZucker · 2024-09-27T15:24:38Z

Sounds good! ˜~

src/transformers/models/auto/tokenization_auto.py

HuggingFaceDocBuilderDev · 2024-10-01T07:31:32Z

Hey! 🤗 Thanks for your contribution to the transformers library!

Before merging this pull request, slow tests CI should be triggered. To enable this:

Add the run-slow label to the PR
When your PR is ready for merge and all reviewers' comments have been addressed, push an empty commit with the command [run-slow] followed by a comma separated list of all the models to be tested, i.e. [run_slow] model_to_test_1, model_to_test_2
- If the pull request affects a lot of models, put at most 10 models in the commit message
A transformers maintainer will then approve the workflow to start the tests

(For maintainers) The documentation for slow tests CI on PRs is here.

src/transformers/models/siglip/__init__.py

tests/models/llama/test_tokenization_llama.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

itazap · 2024-10-02T15:31:11Z

tests/models/auto/test_tokenization_auto.py

@@ -201,6 +201,11 @@ def test_PreTrainedTokenizerFast_from_pretrained(self):
        self.assertEqual(tokenizer.padding_side, "right")
        self.assertEqual(tokenizer.truncation_side, "right")

+    def test_PreTrainedTokenizerFast_inferred(self):


@ArthurZucker added this test for #33751 as Siglip would be the first model that can test this (#33751 (comment))

itazap · 2024-10-08T19:15:45Z

src/transformers/models/auto/tokenization_auto.py

+                    "SiglipTokenizer" if is_sentencepiece_available() else None,
+                    "PreTrainedTokenizerFast" if is_tokenizers_available() else None,
+                ),
+            ),


ref for #33751 (comment)

ArthurZucker

Let's revert the small change to llama test and good to merge!

tests/models/llama/test_tokenization_llama.py

…enizer_tres

First draft

d5d67b7

NielsRogge added 3 commits March 31, 2024 12:05

Fix more tests

cbde88a

Add test

de444e9

Remove print statements

009fdc6

ArthurZucker reviewed Apr 4, 2024

View reviewed changes

NielsRogge added 3 commits April 22, 2024 20:13

Merge remote-tracking branch 'upstream/main' into add_siglip_fast_tok…

f714af0

…enizer_bis

Address comments

6cd05c2

Use regex

d67e40f

ArthurZucker reviewed Apr 30, 2024

View reviewed changes

ArthurZucker mentioned this pull request Jun 10, 2024

Loading tokenizer.model with Rust API huggingface/tokenizers#1518

Closed

github-actions bot closed this Jul 8, 2024

itazap reopened this Jul 9, 2024

github-actions bot closed this Aug 12, 2024

itazap reopened this Aug 19, 2024

Merge remote-tracking branch 'upstream/main' into add_siglip_fast_tok…

f576078

…enizer_bis

typo

e73fa01

itazap requested a review from ArthurZucker August 26, 2024 12:48

ArthurZucker reviewed Aug 28, 2024

View reviewed changes

itazap force-pushed the add_siglip_fast_tokenizer_bis branch from 239665e to e73fa01 Compare September 20, 2024 15:41

removing fast class

cbe0a31

itazap reviewed Sep 24, 2024

View reviewed changes

itazap mentioned this pull request Sep 27, 2024

Load a pretrainedfast tokenizer if fast=true and tokenizer.json exists #33751

Open

itazap and others added 3 commits September 30, 2024 12:43

updated tests for fast

d2b2339

remove dev test file

6379f9d

Merge branch 'main' into add_siglip_fast_tokenizer_bis

6f55733

ArthurZucker reviewed Oct 1, 2024

View reviewed changes

src/transformers/models/auto/tokenization_auto.py Outdated Show resolved Hide resolved

Update src/transformers/models/auto/tokenization_auto.py

05f8b5c

ArthurZucker reviewed Oct 1, 2024

View reviewed changes

src/transformers/models/siglip/__init__.py Outdated Show resolved Hide resolved

src/transformers/models/siglip/__init__.py Outdated Show resolved Hide resolved

tests/models/llama/test_tokenization_llama.py Outdated Show resolved Hide resolved

itazap and others added 4 commits October 1, 2024 15:32

Update tests/models/llama/test_tokenization_llama.py

e296021

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Update src/transformers/models/siglip/__init__.py

0f9669b

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Update src/transformers/models/siglip/__init__.py

0bca141

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

add auto test

80d2f46

itazap reviewed Oct 2, 2024

View reviewed changes

itazap requested a review from ArthurZucker October 2, 2024 15:31

Ita Zaporozhets added 2 commits October 2, 2024 17:46

fix test not to try importing Sigliptokenizerfast

2133809

import pretrained instead of siglip

3a0e825

itazap reviewed Oct 8, 2024

View reviewed changes

ArthurZucker approved these changes Oct 16, 2024

View reviewed changes

tests/models/llama/test_tokenization_llama.py Outdated Show resolved Hide resolved

rm llama change

e975d96

NielsRogge force-pushed the add_siglip_fast_tokenizer_bis branch from 6473a84 to e975d96 Compare October 21, 2024 08:28

NielsRogge added 2 commits October 21, 2024 10:36

Merge remote-tracking branch 'upstream/main' into add_siglip_fast_tok…

dc8ed15

…enizer_tres

Make fixup

dfada5a

[SigLIP] Add fast tokenizer #29969

Are you sure you want to change the base?

[SigLIP] Add fast tokenizer #29969

Conversation

NielsRogge commented Mar 30, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Mar 30, 2024

NielsRogge commented Apr 22, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 25, 2024

yxchng commented May 29, 2024

ArthurZucker commented Jun 5, 2024

github-actions bot commented Jun 30, 2024

NielsRogge commented Jul 8, 2024

itazap commented Jul 9, 2024

itazap commented Jul 9, 2024 • edited Loading

NielsRogge commented Jul 9, 2024 • edited Loading

itazap commented Jul 9, 2024

github-actions bot commented Aug 3, 2024

NielsRogge commented Aug 12, 2024

itazap commented Aug 19, 2024

itazap commented Aug 22, 2024

itazap commented Aug 26, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itazap left a comment

Choose a reason for hiding this comment

ArthurZucker commented Sep 27, 2024

HuggingFaceDocBuilderDev commented Oct 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

NielsRogge commented Mar 30, 2024 •

edited

Loading

itazap commented Jul 9, 2024 •

edited

Loading

NielsRogge commented Jul 9, 2024 •

edited

Loading

itazap commented Aug 26, 2024 •

edited

Loading