-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add M2M100TokenizerFast
(+ convert_slow_tokenizer implementation)
#25478
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
M2M100Tokenizer
M2M100TokenizerFast
(+ convert_slow_tokenizer implementation)
* Add `M2M100Tokenizer` * Allow `added_tokens` list to be empty * Apply hot-fix for issue in HF's `M2M100Tokenizer` * Skip M2M100 tokenizer tests for now TODO: Remove when huggingface/transformers#25478 is merged * Fix `_build_translation_inputs` for `M2M100Tokenizer` * Add example code in JSDoc for `TranslationPipeline` * Update supported_models.py
Feel free to ping me when you need a review! |
@ArthurZucker Hey! 👋 I think this is ready for your review. The tokenizer is quite similar to NLLB, which was added here and then updated by you here. I haven't added a tokenizer before, but one thing I am missing is a unit test for hf-internal-testing/tiny-random-m2m_100, but this is due to the missing tokenizer.json file. Here is a tokenizer file which is generated from above, but then updated by hand to work correctly. So that functionality in the conversion script is missing (i.e., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Looks good already, could you give me more details on how you converted the model? The tests should also be updated I think (apply integration tests to both tokenizers!)
with open(self.original_vocab_file) as fp: | ||
vocab = json.load(fp) | ||
|
||
vocab = list(sorted(vocab.items(), key=lambda x: x[1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If <s>
and <unk>
etc are not part of the original_vocab_file
then you need to add them here at the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are in https://huggingface.co/facebook/m2m100_418M/raw/main/vocab.json 👀, so, I think it should already be included here, no?
[t for t in additional_special_tokens if t not in _additional_special_tokens] | ||
) | ||
|
||
self.add_special_tokens({"additional_special_tokens": _additional_special_tokens}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be adding the tokens to the list of special tokens, the problem is that in fast
if the tokens are already part of the vocabulary they will not be added again. For Llama we used:
tokenizer.add_special_tokens(
[
AddedToken("<unk>"),
AddedToken("<s>"),
AddedToken("</s>"),
]
)
in the convert_slow_tokenizer.py
script.
Did you convert the tokenizer slow using from_slow=True
?
(I am not really suprised by this bug)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you convert the tokenizer slow using
from_slow=True
?
I didn't add any other parameters other than the ones in the original comment 😅 ... Is this needed, considering there is no fast tokenizer to convert from?
self._src_lang = new_src_lang | ||
self.set_src_lang_special_tokens(self._src_lang) | ||
|
||
def build_inputs_with_special_tokens( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing copied from statements here and wherever you copied from!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, my bad! Will add that :)
It was from NllbTokenizerFast
class NllbTokenizerFast(PreTrainedTokenizerFast): |
I just used Optimum to convert the model to ONNX, but this shouldn't matter for this PR (which only concerns the tokenizer). |
Sorry by model I meant the tokenizer model (the backend sentencepiece model if you will). I am trying to understand why you had to manually add the added tokens, and the |
Oh yes of course 😆 (silly me). I converted it with this code: from transformers import convert_slow_tokenizer, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Xenova/m2m100_418M', use_fast=False)
fast_tokenizer=convert_slow_tokenizer.convert_slow_tokenizer(tokenizer)
fast_tokenizer.save('tokenizer.json')
I'm not 100% sure, I just ran some unit tests for transformers.js and found that it failed (and after inspecting, I found that it was just missing added tokens. I'll play around with a few more things today! |
@ArthurZucker I added the "copied from" annotations as well as added the special tokens (like they are done for Llama). On the latter point, the tokenizer includes Here's the output from running the above code (zipped because GH doesn't like JSON 🤷) |
Regarding the madeup words, it depends, would just say let's follow what's done for slow! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Should this be reopened @xenova ? :) |
I marked it as a draft since I have been quite busy on some more important things, with the idea that I would return to it eventually once I had more time 😅. It's basically just missing some unit tests and example tokenizer files. |
What does this PR do?
Adds fast tokenizer support for
M2M100Tokenizer
. I also added aconvert_slow_tokenizer
config to support generating a tokenizer.json file. For example, the generated tokenizer files for facebook/m2m100_418M can be found here.The fast tokenizer format is needed for transformers.js (see issue and PR). This may or may not be super relevant (considering the models are quite old), but there was a feature request in transformers.js, and I needed to write this code to get it working - so I thought I might as well share here.
I have ran the tokenizer on a variety of test cases, and it passes each one. Additionally, it fixes a bug/inconsistency with the slow tokenizer (actually found in all sentencepiece tokenizers), where whitespace after special tokens is removed. To make the test cases past, I just basically hardcoded a fix for it (for now at least).
Example conversion code
Example usage code
NOTE: To get facebook/m2m100_418M working, we need to remove the
"tokenizer_file": null,
line from tokenizer_config.jsonFixes # (issue)
TODO
NllbTokenizerFast
</s>
and similar tokens)Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@ArthurZucker