-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NLLB tokenizer #18126
NLLB tokenizer #18126
Conversation
Co-authored-by: Stefan Schweter <stefan@schweter.it>
All models are now public, feel free to try it out @stefan-it. The generation seems good, have not tried fine-tuning yet. |
The documentation is not available anymore as the PR was closed or merged. |
I don't know of a better place to post (issue?), so I'll do it here :) @LysandreJik Thank you so much for adding support for the NLLB dense models! I pulled out this branch and tried all of them and they work awesome! There is the following place in the readme So it would be really great if you could add MoE models! I tried to figure out the original repo, but it turned out to be unexpectedly difficult. I couldn't get MoE to run. So if you add MoE models, I'm sure it will make a lot of people happier, at least me :) |
@LysandreJik Thanks a lot for your promt work! I tried using NLLB model from HuggingFace and noticed one problem: max_length does not set in config.json for any of the NLLB models, so it uses default value of max_length (20).
As the result, your example code cannot generate more than 20 tokens. it is possible to set max_length higher when calling translation method, but it will be great to have meaningful default as well. For comparison, both for M2M and MBart50 models max_length set in config.json file to 200. |
How is the default max_length determined per model? Or is it documented in their white papers? With this PR, I have started evaluating the extremely large model (facebook/nllb-200-3.3B) against GCP translation and so far it is doing really well despite the length of text I give it but I want to give it the best chance to perform so knowing the ideal max_length would help. |
I think usual default for max_length is to be equal to max input length. Translation pipeline in transformers are checking that max_length at higher than 90% of input length. transformers/src/transformers/pipelines/text2text_generation.py Lines 272 to 278 in 33028f4
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this model! Would it make sense to add a default model for NLLB in the auto mappings?
Co-authored-by: Stefan Schweter <stefan@schweter.it>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* NLLB tokenizer * Apply suggestions from code review - Thanks Stefan! Co-authored-by: Stefan Schweter <stefan@schweter.it> * Final touches * Style :) * Update docs/source/en/model_doc/nllb.mdx Co-authored-by: Stefan Schweter <stefan@schweter.it> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * PR reviews * Auto models Co-authored-by: Stefan Schweter <stefan@schweter.it> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
as mentionned here #19943 |
I've just seen this example - where the lang-token is prepended: from original code base 🤔 |
right. Also I am wondering why they use "" which is "eos" as the start token of the source sequence. (in fact same for the target sequence). I would have expected: It seems they use EOS instead of BOS and that they put a EOS as the SRC start. |
Adds the NLLB tokenizer. In order to run:
Closes #18043