SPLIT PR: add user defined symbols and control symbols #31305

itazap · 2024-06-07T07:46:00Z

Fixes portion of #30824 ->

adds user_defined_symbols and control_symbols from proto so that they are not split during encoding / decoding.
tests gemma
removed same logic from GemmaConverter as it is handled in general SpmConverter class
updates common test (!!!) bc fast != slow since fast an read from SPM converter and get user_defined_symbols and control_symbols. @ArthurZucker thoughts on this?
copied test from common to camember and rembert tests
updated deberta v2 tests to not use '.' in test cases' texts because it is a user added token for fast tokenizers, so the spacing around it differs from slow.
add_special_tokens and add_tokens has same effect for now - but fix is already merged in Tokenizers by @ArthurZucker remove enforcement of non special when adding tokens tokenizers#1521

TODO in new PR:

other PR for prefix space (mentioned in SPMConverter does not always add the user defined symbol -> slow fast is thus not equivalent #30824) -> PR open: SPLIT PR: add_prefix_space fix #31315
open issue to inspect other attrs of proto to see if they can be added in conversion

HuggingFaceDocBuilderDev · 2024-06-07T08:05:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…en, so a space is not added

ArthurZucker

LGTM! I think we should check the type of the defined symbols see this comment offline:

            if piece.type == 4:
                tokens_to_add += [piece.piece]

regarding trainer_spec.user_defined_symbols not giving the same results

ArthurZucker · 2024-06-18T12:37:44Z

src/transformers/convert_slow_tokenizer.py

+            AddedToken(token, normalized=False, special=True) for token in self.proto.trainer_spec.control_symbols
+        ]
+
+        tokenizer.add_tokens(user_defined_symbols)


Cool! Related fix to only add once: huggingface/tokenizers@f2ec3b2

maybe using map will be faster? but anyway it's good enough fro what we want !

ArthurZucker · 2024-06-18T12:39:17Z

tests/models/deberta_v2/test_tokenization_deberta_v2.py

why did you change the encoded sequence here?

…e instead of trainer_spec for user_defined_symbols

ArthurZucker

Thanks for iterating! Last nit and good to go

ArthurZucker · 2024-06-20T16:03:41Z

src/transformers/convert_slow_tokenizer.py

+        # Add user defined symbols
+        user_defined_symbols = [
+            AddedToken(token, normalized=False, special=False)
+            for token in [p.piece for p in self.proto.pieces if p.type == 4]


let's add a comment that references the documentation of sentencepiece as to why we use 4 (arbitrary number)

ArthurZucker · 2024-06-20T16:04:08Z

src/transformers/convert_slow_tokenizer.py

+            AddedToken(token, normalized=False, special=True) for token in self.proto.trainer_spec.control_symbols
+        ]
+
+        tokenizer.add_tokens(user_defined_symbols + control_symbols)


Ok this will work with the release of tokenizers, so fine for me! Good job

ArthurZucker · 2024-06-20T16:06:33Z

src/transformers/convert_slow_tokenizer.py

we can also propagate to gemma! (only has this for now:

user_defined_symbols = [ AddedToken(token, normalized=True, special=False) for token in proto.trainer_spec.user_defined_symbols ] tokenizer.add_tokens(user_defined_symbols)

* PR SPLIT: moving origina changes for adding user defined symbols * adding gemma test and generalizing gemma converter * ruff * update common test * update serialization test * deberta v2 tests updates as rust version adds '.' as a user added token, so a space is not added * removing commented lines * applying feedback - user only added_tokens to add and check piece.type instead of trainer_spec for user_defined_symbols * add comment referencing sentencepiece

itazap requested a review from ArthurZucker June 7, 2024 10:17

itazap changed the title ~~PR SPLIT: moving origina changes for adding user defined symbols~~ PR SPLIT: moving original changes for adding user defined symbols Jun 12, 2024

ArthurZucker removed their request for review June 12, 2024 15:13

itazap changed the title ~~PR SPLIT: moving original changes for adding user defined symbols~~ PR SPLIT: add user defined symbols and control symbols Jun 13, 2024

itazap requested a review from ArthurZucker June 13, 2024 08:39

itazap marked this pull request as ready for review June 13, 2024 13:28

itazap force-pushed the user_defined_symbols_add_30824 branch 6 times, most recently from 97d0d35 to befadd1 Compare June 17, 2024 08:57

itazap added 7 commits June 17, 2024 11:00

PR SPLIT: moving origina changes for adding user defined symbols

1519792

adding gemma test and generalizing gemma converter

00c2d51

ruff

b3b6a2e

update common test

e0e8df0

update serialization test

d0e04b8

deberta v2 tests updates as rust version adds '.' as a user added tok…

a122c54

…en, so a space is not added

removing commented lines

904121e

itazap force-pushed the user_defined_symbols_add_30824 branch from befadd1 to 904121e Compare June 17, 2024 09:00

itazap changed the title ~~PR SPLIT: add user defined symbols and control symbols~~ SPLIT PR: add user defined symbols and control symbols Jun 18, 2024

ArthurZucker reviewed Jun 18, 2024

View reviewed changes

applying feedback - user only added_tokens to add and check piece.typ…

1e538ba

…e instead of trainer_spec for user_defined_symbols

itazap requested a review from ArthurZucker June 20, 2024 15:05

ArthurZucker approved these changes Jun 20, 2024

View reviewed changes

add comment referencing sentencepiece

7ef1011

itazap merged commit 1e79ead into main Jun 21, 2024
24 checks passed

itazap deleted the user_defined_symbols_add_30824 branch June 21, 2024 08:48

mo-arvan mentioned this pull request Jun 22, 2024

GGUFTokenizerSkeleton AttributeError during conversion #31553

Closed

4 tasks

SunMarc mentioned this pull request Jun 24, 2024

Fix llama gguf converter #31575

Merged

This was referenced Jul 10, 2024

[Severe Bug] Performance Degradation Starting from v4.42.* #31890

Closed

adding user defined tokens #30824 #30929

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPLIT PR: add user defined symbols and control symbols #31305

SPLIT PR: add user defined symbols and control symbols #31305

itazap commented Jun 7, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 7, 2024

ArthurZucker left a comment

ArthurZucker Jun 18, 2024

ArthurZucker Jun 18, 2024

ArthurZucker Jun 18, 2024

ArthurZucker left a comment

ArthurZucker Jun 20, 2024

ArthurZucker Jun 20, 2024

ArthurZucker Jun 20, 2024

SPLIT PR: add user defined symbols and control symbols #31305

SPLIT PR: add user defined symbols and control symbols #31305

Conversation

itazap commented Jun 7, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Jun 7, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jun 18, 2024

Choose a reason for hiding this comment

ArthurZucker Jun 18, 2024

Choose a reason for hiding this comment

ArthurZucker Jun 18, 2024

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jun 20, 2024

Choose a reason for hiding this comment

ArthurZucker Jun 20, 2024

Choose a reason for hiding this comment

ArthurZucker Jun 20, 2024

Choose a reason for hiding this comment

itazap commented Jun 7, 2024 •

edited

Loading