Fast tokenizers Bartpho Phobert Bertweet #23

datquocnguyen · 2022-08-13T04:50:44Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

* NLLB tokenizer * Apply suggestions from code review - Thanks Stefan! Co-authored-by: Stefan Schweter <stefan@schweter.it> * Final touches * Style :) * Update docs/source/en/model_doc/nllb.mdx Co-authored-by: Stefan Schweter <stefan@schweter.it> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * PR reviews * Auto models Co-authored-by: Stefan Schweter <stefan@schweter.it> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Guilhem Chéron <guilhemc@authentifier.com>

* [HPO] update to sigopt new experiment api * follow https://docs.sigopt.com/experiments Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * [HPO] use new API if sigopt version >= 8.0.0 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix expected loss values Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

…face#18073) Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

…#18183)

* updated _toctree.yml * added preprocessing * updated preprocessing.mdx * updated preprocessing.mdx updated after review

* added multilingual.mdx * updated multilingual.mdx * italian translation multilingual.mdx * updated _toctree.yml * fixed typos _toctree.yml * fixed typos after review * fixed error after review

* minor fixes - add correct revision - corrected dosctring for test - removed a test * contrib credits Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Co-authored-by: Nouamane Tazi <nouamane98@gmail.com> Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Co-authored-by: Nouamane Tazi <nouamane98@gmail.com>

…7662) * added training.mdx * updated training.mdx * updated training.mdx * updated training.mdx * updated _toctree.yml * fixed typos after review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* FSDP integration enhancements and fixes * resolving comments * fsdp fp16 mixed precision requires `ShardedGradScaler`

Update

* remove use_auth_token from from_config * restore use_auth_token from_pretrained run_t5_mlm_flax

…ingface#18196) * Update docs README with instructions on locally previewing docs * Add instructions to install `watchdog` before previewing the docs

…ggingface#18184) * add first generation tutorial * [from_pretrained] Allow loading models from subfolders * remove gen file * add doc strings * allow download from subfolder * add tests * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * apply comments * correct doc string Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Initial work * More work * Add tests for custom pipelines on the Hub * Protect import * Make the test work for TF as well * Last PyTorch specific bit * Add documentation * Style * Title in toc * Bad names! * Update docs/source/en/add_new_pipeline.mdx Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> * Auto stash before merge of "custom_pipeline" and "origin/custom_pipeline" * Address review comments * Address more review comments * Update src/transformers/pipelines/__init__.py Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Fix TF GPT-J tests * add try/finally block

…8202) * Reduce console spam when using the KerasMetricCallback * Switch to predict_on_batch to improve performance

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* Fix docstrings * Fix legacy issue * up * apply suggestions * up * quality

* Segformer TF: fix output size in doc * Segformer pytorch: fix output size in doc Co-authored-by: Maxime Gardoni <maxime.gardoni@ecorobotix.com>

* Fixes resizing bug in OWL-ViT * Defaults to square resize if size is set to an int * Sets do_center_crop default value to False

* fix typos * fix sequence_length docs of LayoutLMv3Model * delete trailing white spaces * fix layoutlmv3 docs more * apply make fixup & quality * change to two versions of input docstring * apply make fixup & quality

@michaelbenayoun

…upport Opacus training (huggingface#18486) * changing BartLearnedPositionalEmbedding forward signature and references to it * removing debugging dead code (thanks style checker) * blackened modeling_bart file * removing copy inconsistencies via make fix-copies * changing references to copied signatures in Bart variants * make fix-copies once more * using expand over repeat (thanks @michaelbenayoun) * expand instead of repeat for all model copies Co-authored-by: Daniel Jones <jonesdaniel@microsoft.com>

* Create _config.py * Create _toctree.yml * Create index.mdx not sure about "du / ihr" oder "sie" * Create quicktour.mdx * Update _toctree.yml * Update build_documentation.yml * Update build_pr_documentation.yml * fix build * Update index.mdx * Update quicktour.mdx * Create installation.mdx * Update _toctree.yml

…face#18272) * Fix critical trace warnings to allow ONNX export * Force input to `sqrt` to be float type * Cleanup code * Remove unused import statement * Update model sew * Small refactor Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com> * Use broadcasting instead of repeat * Implement suggestion Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com> * Match deberta v2 changes in sew_d * Improve code quality * Update code quality * Consistency of small refactor * Match changes in sew_d Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>

…abels (huggingface#18580) * Support audio classification architectures for labels generation, as well as provides a flag to print warnings or not * Use ENV_VARS_TRUE_VALUES

…e#18581) * Fix docstrings with last version of hf-doc-builder styler * Remove empty Parameter block

…ert (huggingface#18565) Bumps [nbconvert](https://github.com/jupyter/nbconvert) from 6.0.1 to 6.3.0. - [Release notes](https://github.com/jupyter/nbconvert/releases) - [Commits](jupyter/nbconvert@6.0.1...6.3.0) --- updated-dependencies: - dependency-name: nbconvert dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…e#18566) Bumps [nbconvert](https://github.com/jupyter/nbconvert) from 6.0.1 to 6.3.0. - [Release notes](https://github.com/jupyter/nbconvert/releases) - [Commits](jupyter/nbconvert@6.0.1...6.3.0) --- updated-dependencies: - dependency-name: nbconvert dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

)

* initial commit * add small test * add cross pt tf flag to test * fix quality * style * update test with new repo * fix failing test * update * fix wrong param ordering * style * update based on review * update related to recent new caching mechanism * quality * Update based on review Co-authored-by: sgugger <sylvain.gugger@gmail.com> * quality and style * Update src/transformers/modeling_flax_utils.py Co-authored-by: sgugger <sylvain.gugger@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Add type hints for Vilt models * Add missing return type for TokenClassification class

…ngface#18576) * update doc for perf_train_cpu_many, add mpi introduction Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Update docs/source/en/perf_train_cpu_many.mdx Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update docs/source/en/perf_train_cpu_many.mdx Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

…uggingface#18600) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

…nts) (huggingface#18261) * validate generate model_kwargs * generate tests -- not all models have an attn mask

…#18579) * Supporting seq2seq models for `bitsandbytes` integration - `bitsandbytes` integration supports now seq2seq models - check if a model has tied weights as an additional check * small modification - tie the weights before looking at tied weights!

* First draft * Improve script * Update script * Make conversion work * Add final_layer_norm attribute to Swin's config * Add DonutProcessor * Convert more models * Improve feature extractor and convert base models * Fix bug * Improve integration tests * Improve integration tests and add model to README * Add doc test * Add feature extractor to docs * Fix integration tests * Remove register_buffer * Fix toctree and add missing attribute * Add DonutSwin * Make conversion script work * Improve conversion script * Address comment * Fix bug * Fix another bug * Remove deprecated method from docs * Make Swin and Swinv2 untouched * Fix code examples * Fix processor * Update model_type to donut-swin * Add feature extractor tests, add token2json method, improve feature extractor * Fix failing tests, remove integration test * Add do_thumbnail for consistency * Improve code examples * Add code example for document parsing * Add DonutSwin to MODEL_NAMES_MAPPING * Add model to appropriate place in toctree * Update namespace to appropriate organization Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>

Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>

* Update BLOOM parameter counts * Update BLOOM parameter counts

the manual anchors end up being duplicated with automatically added anchors and no longer work.

* [fsmt] deal with -100 indices in decoder ids Fixes: huggingface#17945 decoder ids get the default index -100, which breaks the model - like t5 and many other models add a fix to replace -100 with the correct pad index. For some reason this use case hasn't been used with this model until recently - so this issue was there since the beginning it seems. Any suggestions to how to add a simple test here? or perhaps we have something similar already? user's script is quite massive. * style

Update with the latest commits

HuggingFaceDocBuilderDev · 2022-08-13T05:03:46Z

The documentation is not available anymore as the PR was closed or merged.

JohnGiorgi and others added 30 commits July 18, 2022 09:50

Fix check for falsey inputs in run_summarization (huggingface#18155)

c46d39f

Fix incorrect type hint for lang (huggingface#18161)

a4f97e6

add ONNX support for LeVit (huggingface#18154)

8c14b34

Co-authored-by: Guilhem Chéron <guilhemc@authentifier.com>

Fix expected loss values in some (m)T5 tests (huggingface#18177)

cb19c2a

* fix expected loss values Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Update TF(Vision)EncoderDecoderModel PT/TF equivalence tests (hugging…

6561fbc

…face#18073) Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

FIX: Typo (huggingface#18156)

4525581

Fix template for new models in README (huggingface#18182)

aeeab1f

Better default for offload_state_dict in from_pretrained (huggingface…

edadfc5

…#18183)

fix typo inside bloom documentation (huggingface#18187)

ced1f1f

Added preprocessing.mdx italian translation (huggingface#17600)

0a5b61d

* updated _toctree.yml * added preprocessing * updated preprocessing.mdx * updated preprocessing.mdx updated after review

Translation italian: multilingual.mdx (huggingface#17768)

c4cc894

* added multilingual.mdx * updated multilingual.mdx * italian translation multilingual.mdx * updated _toctree.yml * fixed typos _toctree.yml * fixed typos after review * fixed error after review

FSDP integration enhancements and fixes (huggingface#18134)

bc8e30b

* FSDP integration enhancements and fixes * resolving comments * fsdp fp16 mixed precision requires `ShardedGradScaler`

Use smaller variant of BLOOM for doc to fix tests

29fd471

Merge pull request #13 from huggingface/main

9630bce

Update

Remove use_auth_token from the from_config method (huggingface#18192)

4bea658

* remove use_auth_token from from_config * restore use_auth_token from_pretrained run_t5_mlm_flax

Add vision example to README (huggingface#18194)

e630dad

bugfix: div-->dim (huggingface#18135)

7983844

Update docs README with instructions on locally previewing docs (hugg…

ce01528

…ingface#18196) * Update docs README with instructions on locally previewing docs * Add instructions to install `watchdog` before previewing the docs

Typo in readme (huggingface#18195)

9f12ec7

Use next-gen CircleCI convenience images (huggingface#18197)

05ed569

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

TF: Add missing cast to GPT-J (huggingface#18201)

ec6cd76

* Fix TF GPT-J tests * add try/finally block

Reduce console spam when using the KerasMetricCallback (huggingface#1…

8a61fe0

…8202) * Reduce console spam when using the KerasMetricCallback * Switch to predict_on_batch to improve performance

update cache to v0.5 (huggingface#18203)

4b1ed79

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Fix LayoutXLM docstrings (huggingface#17038)

0ed4d0d

* Fix docstrings * Fix legacy issue * up * apply suggestions * up * quality

joihn and others added 28 commits August 11, 2022 10:59

Segformer TF: fix output size in documentation (huggingface#18572)

76568d2

* Segformer TF: fix output size in doc * Segformer pytorch: fix output size in doc Co-authored-by: Maxime Gardoni <maxime.gardoni@ecorobotix.com>

Fix resizing bug in OWL-ViT (huggingface#18573)

f762f37

* Fixes resizing bug in OWL-ViT * Defaults to square resize if size is set to an int * Sets do_center_crop default value to False

Fix LayoutLMv3 documentation (huggingface#17932)

4c8ec66

* fix typos * fix sequence_length docs of LayoutLMv3Model * delete trailing white spaces * fix layoutlmv3 docs more * apply make fixup & quality * change to two versions of input docstring * apply make fixup & quality

Skip broken tests

3f0707b

[FX] _generate_dummy_input supports audio-classification models for l…

42b8940

…abels (huggingface#18580) * Support audio classification architectures for labels generation, as well as provides a flag to print warnings or not * Use ENV_VARS_TRUE_VALUES

Fix docstrings with last version of hf-doc-builder styler (huggingfac…

c23cbdf

…e#18581) * Fix docstrings with last version of hf-doc-builder styler * Remove empty Parameter block

fix owlvit tests, update docstring examples (huggingface#18586)

f28f240

Return the permuted hidden states if return_dict=True (huggingface#18578

c8b6ae8

)

Add type hints for ViLT models (huggingface#18577)

46d0941

* Add type hints for Vilt models * Add missing return type for TokenClassification class

typos (huggingface#18594)

d344534

FSDP bug fix for load_state_dict (huggingface#18596)

4eed2be

Add TFAutoModelForSemanticSegmentation to the main __init__.py (h…

2156619

…uggingface#18600) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Generate: validate model_kwargs (and catch typos in generate argume…

ed1924e

…nts) (huggingface#18261) * validate generate model_kwargs * generate tests -- not all models have an attn mask

Fix URLs (huggingface#18604)

153d136

Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>

Update BLOOM parameter counts (huggingface#18531)

56ef0ba

* Update BLOOM parameter counts * Update BLOOM parameter counts

[doc] fix anchors (huggingface#18591)

37c5991

the manual anchors end up being duplicated with automatically added anchors and no longer work.

small change (huggingface#18584)

1ccd251

Merge pull request #22 from huggingface/main

0db5b71

Update with the latest commits

datquocnguyen merged commit c21aadb into main Aug 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast tokenizers Bartpho Phobert Bertweet #23

Fast tokenizers Bartpho Phobert Bertweet #23

datquocnguyen commented Aug 13, 2022

HuggingFaceDocBuilderDev commented Aug 13, 2022 •

edited

Loading

Fast tokenizers Bartpho Phobert Bertweet #23

Fast tokenizers Bartpho Phobert Bertweet #23

Conversation

datquocnguyen commented Aug 13, 2022

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Aug 13, 2022 • edited Loading

HuggingFaceDocBuilderDev commented Aug 13, 2022 •

edited

Loading