Add Information Gain Filtration algorithm #16953

mraunak · 2022-04-27T00:50:31Z

What does this PR do?

Adding a new feature for fine-tuning transformer models called Information Gain Filtration 'IGF'

Motivation

The quality of a fine-tuned model depends on the quality of the data samples used for the first few batches. As the process is stochastic, a random seed would influence the quality of the final fine-tuned model. We are proposing a novel and robust fine-tuning method “Information Gain Filtration” (IGF), which filters informative training samples before a fine-tuning (training) and improves the overall training efficiency and final performance of the language model fine-tuning step

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. This can be of interest to @sgugger,

Models:

gpt2

Examples:

research_projects/information-gain-filtration: @TuKo,@mraunak

HuggingFaceDocBuilderDev · 2022-04-27T01:07:13Z

The documentation is not available anymore as the PR was closed or merged.

LysandreJik

LGTM, thank you @mraunak!

LysandreJik · 2022-04-27T16:23:12Z

Could you just run the code quality tool to ensure that the code quality passes? You can install them with the following, from the root of your clone:

pip install -e ".[quality]"

And then run them with:

make fixup

mraunak · 2022-05-02T16:23:03Z

Running the command: make fixup, gives an error that does not include terms from my PR,

The output with error is shown below. Please guide me on it. Thanks

(igfprnew) mraunak@bcl-main1:~/transformers$ make fixup
No library .py files were modified
python utils/custom_init_isort.py
python utils/style_doc.py src/transformers docs/source --max_len 119
running deps_table_update
updating src/transformers/dependency_versions_table.py
python utils/check_copies.py
python utils/check_table.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
python utils/check_dummies.py
python utils/check_repo.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Checking all models are included.
Checking all models are public.
Checking all models are properly tested.
Checking all objects are properly documented.
Checking all models are in at least one auto class.
utils/check_repo.py:456: UserWarning: Full quality checks require all backends to be installed (with pip install -e .[dev] in the Transformers repo, the following are missing: PyTorch, TensorFlow, Flax. While it's probably fine as long as you didn't make any change in one of those backends modeling files, you should probably execute the command above to be on the safe side.
warnings.warn(
python utils/check_inits.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Traceback (most recent call last):
File "utils/check_inits.py", line 265, in
check_submodules()
File "utils/check_inits.py", line 256, in check_submodules
raise ValueError(
ValueError: The following submodules are not properly registed in the main init of Transformers:

sagemaker
activations
activations_tf
convert_slow_tokenizer
deepspeed
generation_beam_constraints
generation_beam_search
generation_flax_logits_process
generation_flax_utils
generation_logits_process
generation_stopping_criteria
generation_tf_logits_process
generation_tf_utils
generation_utils
image_utils
keras_callbacks
modeling_flax_outputs
modeling_flax_utils
modeling_outputs
modeling_tf_outputs
modeling_tf_utils
modeling_utils
optimization
optimization_tf
pytorch_utils
tf_utils
trainer
trainer_pt_utils
trainer_seq2seq
trainer_tf
data.datasets
Make sure they appear somewhere in the keys of _import_structure with an empty list as value.
make: *** [repo-consistency] Error 1

TuKo · 2022-05-10T20:56:21Z

@LysandreJik thanks for the suggestion! We were able to correct the quality check issues.
Let us know if you need us to run/test anything else. Thank you!

LysandreJik · 2022-05-18T14:39:08Z

Thank you for your contributions!

TuKo · 2022-05-18T14:46:11Z

Thank you for accepting our work!

commit 5419205 Author: Patrick von Platen <patrick.v.platen@gmail.com> Date: Thu May 19 23:46:26 2022 +0200 [Test OPT] Add batch generation test opt (huggingface#17359) * up * up commit 48c2269 Author: ddobokki <44228269+ddobokki@users.noreply.github.com> Date: Fri May 20 05:42:44 2022 +0900 Fix bug in Wav2Vec2 pretrain example (huggingface#17326) commit 5d6feec Author: Nathan Dahlberg <58701810+nadahlberg@users.noreply.github.com> Date: Thu May 19 16:21:19 2022 -0400 fix for 17292 (huggingface#17293) commit 518bd02 Author: Patrick von Platen <patrick.v.platen@gmail.com> Date: Thu May 19 22:17:02 2022 +0200 [Generation] Fix Transition probs (huggingface#17311) * [Draft] fix transition probs * up * up * up * make it work * fix * finish * update commit e8714c0 Author: Patrick von Platen <patrick.v.platen@gmail.com> Date: Thu May 19 22:15:36 2022 +0200 [OPT] Run test in lower precision on GPU (huggingface#17353) * [OPT] Run test only in half precision * up * up * up * up * finish * fix on GPU * Update tests/models/opt/test_modeling_opt.py commit 2b28229 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Thu May 19 20:28:12 2022 +0200 Adding `batch_size` test to QA pipeline. (huggingface#17330) commit a4386d7 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Thu May 19 10:29:16 2022 +0200 [BC] Fixing usage of text pairs (huggingface#17324) * [BC] Fixing usage of text pairs The BC is actually preventing users from misusing the pipeline since users could have been willing to send text pairs and the pipeline would instead understand the thing as a batch returning bogus results. The correct usage of text pairs is preserved in this PR even when that makes the code clunky. Adds support for {"text":..,, "text_pair": ...} inputs for both dataset iteration and more explicit usage to pairs. * Updating the doc. * Update src/transformers/pipelines/text_classification.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/pipelines/text_classification.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update tests/pipelines/test_pipelines_text_classification.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * quality. Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> commit 3601aa8 Author: Stas Bekman <stas00@users.noreply.github.com> Date: Wed May 18 16:00:47 2022 -0700 [tests] fix copy-n-paste error (huggingface#17312) * [tests] fix copy-n-paste error * fix commit 1b20c97 Author: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Date: Wed May 18 21:49:08 2022 +0200 Fix ci_url might be None (huggingface#17332) * fix * Update utils/notification_service.py Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> commit 6aad387 Author: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Date: Wed May 18 21:26:44 2022 +0200 fix (huggingface#17337) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> commit 1762ded Author: Zachary Mueller <muellerzr@gmail.com> Date: Wed May 18 14:17:40 2022 -0400 Fix metric calculation in examples and setup tests to run on multi-gpu for no_trainer scripts (huggingface#17331) * Fix length in no_trainer examples * Add setup and teardown * Use new accelerator config generator to automatically make tests able to run based on environment commit 6e195eb Author: Jader Martins <jadermcs@users.noreply.github.com> Date: Wed May 18 14:18:43 2022 -0300 docs for typical decoding (huggingface#17186) Co-authored-by: Jader Martins <jadermcs94@gmail.com> commit 060fe61 Author: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Date: Wed May 18 19:07:48 2022 +0200 Not send successful report (huggingface#17329) * send report only if there is any failure Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> commit b3b9f99 Author: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Date: Wed May 18 17:57:23 2022 +0200 Fix test_t5_decoder_model_past_large_inputs (huggingface#17320) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> commit 6da76b9 Author: Jingya HUANG <44135271+JingyaHuang@users.noreply.github.com> Date: Wed May 18 17:52:13 2022 +0200 Add onnx export cuda support (huggingface#17183) Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> commit adc0ff2 Author: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Date: Wed May 18 17:47:18 2022 +0200 Add CvT (huggingface#17299) * Adding cvt files * Adding cvt files * changes in init file * Adding cvt files * changes in init file * Style fixes * Address comments from code review * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Format lists in docstring * Fix copies * Apply suggestion from code review Co-authored-by: AnugunjNaman <anugunjjha@gmail.com> Co-authored-by: Ayushman Singh <singhayushman13@protonmail.com> Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> commit 4710702 Author: Sylvain Gugger <Sylvain.gugger@gmail.com> Date: Wed May 18 10:46:40 2022 -0400 Fix style commit 5fdb54e Author: mraunak <83710963+mraunak@users.noreply.github.com> Date: Wed May 18 10:39:02 2022 -0400 Add Information Gain Filtration algorithm (huggingface#16953) * Add information gain filtration algorithm * Complying with black requirements * Added author * Fixed import order * flake8 corrections Co-authored-by: Javier Turek <javier.turek@intel.com> commit 91ede48 Author: Kamal Raj <kamalraj97@gmail.com> Date: Wed May 18 19:59:53 2022 +0530 Fix typo (huggingface#17328) commit fe28eb9 Author: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Date: Wed May 18 16:06:41 2022 +0200 remove (huggingface#17325) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> commit 2cb2ea3 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Wed May 18 16:06:24 2022 +0200 Accepting real pytorch device as arguments. (huggingface#17318) * Accepting real pytorch device as arguments. * is_torch_available. commit 1c9d1f4 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Wed May 18 15:46:12 2022 +0200 Updating the docs for `max_seq_len` in QA pipeline (huggingface#17316) commit 60ad734 Author: Patrick von Platen <patrick.v.platen@gmail.com> Date: Wed May 18 15:08:56 2022 +0200 [T5] Fix init in TF and Flax for pretraining (huggingface#17294) * fix init * Apply suggestions from code review * fix * finish * Update src/transformers/modeling_tf_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> commit 7ba1d4e Author: Joaq <55513213+jQuinRivero@users.noreply.github.com> Date: Wed May 18 09:23:47 2022 -0300 Add type hints for ProphetNet (Pytorch) (huggingface#17223) * added type hints to prophetnet * reformatted with black * fix bc black misformatted some parts * fix imports * fix imports * Update src/transformers/models/prophetnet/configuration_prophetnet.py Co-authored-by: Matt <Rocketknight1@users.noreply.github.com> * update OPTIONAL type hint and docstring Co-authored-by: Matt <Rocketknight1@users.noreply.github.com> commit d6b8e9c Author: Carl <carl.cochet@gmail.com> Date: Wed May 18 01:07:43 2022 +0200 Add trajectory transformer (huggingface#17141) * Add trajectory transformer Fix model init Fix end of lines for .mdx files Add trajectory transformer model to toctree Add forward input docs Fix docs, remove prints, simplify prediction test Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Update docs, more descriptive comments Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Update readme Small comment update and add conversion script Rebase and reformat Fix copies Fix rebase, remove duplicates Fix rebase, remove duplicates * Remove tapex * Remove tapex * Remove tapex commit c352640 Author: Patrick von Platen <patrick.v.platen@gmail.com> Date: Wed May 18 00:34:31 2022 +0200 fix (huggingface#17310) commit d9050dc Author: Cesare Campagnano <cesare.campagnano@gmail.com> Date: Tue May 17 23:44:37 2022 +0200 [LED] fix global_attention_mask not being passed for generation and docs clarification about grad checkpointing (huggingface#17112) * [LED] fixed global_attention_mask not passed for generation + docs clarification for gradient checkpointing * LED docs clarification Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * [LED] gradient_checkpointing=True should be passed to TrainingArguments Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * [LED] docs: remove wrong word Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * [LED] docs fix typo Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> commit bad3583 Author: Jean Vancoppenolle <jean.vcop@gmail.com> Date: Tue May 17 23:42:14 2022 +0200 Add support for pretraining recurring span selection to Splinter (huggingface#17247) * Add SplinterForSpanSelection for pre-training recurring span selection. * Formatting. * Rename SplinterForSpanSelection to SplinterForPreTraining. * Ensure repo consistency * Fixup changes * Address SplinterForPreTraining PR comments * Incorporate feedback and derive multiple question tokens per example. * Update src/transformers/models/splinter/modeling_splinter.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/models/splinter/modeling_splinter.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Jean Vancoppenole <jean.vancoppenolle@retresco.de> Co-authored-by: Tobias Günther <tobias.guenther@retresco.de> Co-authored-by: Tobias Günther <github@tobigue.de> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> commit 0511305 Author: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Date: Tue May 17 18:56:58 2022 +0200 Add PR author in CI report + merged by info (huggingface#17298) * Add author info to CI report * Add merged by info * update Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> commit 032d63b Author: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Date: Tue May 17 12:56:24 2022 -0400 Fix dummy creation script (huggingface#17304) commit 986dd5c Author: Sylvain Gugger <Sylvain.gugger@gmail.com> Date: Tue May 17 12:50:14 2022 -0400 Fix style commit 38ddab1 Author: Karim Foda <35491698+KMFODA@users.noreply.github.com> Date: Tue May 17 09:32:12 2022 -0700 Doctest longformer (huggingface#16441) * Add initial doctring changes * make fixup * Add TF doc changes * fix seq classifier output * fix quality errors * t * swithc head to random init * Fix expected outputs * Update src/transformers/models/longformer/modeling_longformer.py Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> commit 10704e1 Author: Patrick von Platen <patrick.v.platen@gmail.com> Date: Tue May 17 18:20:36 2022 +0200 [Test] Fix W2V-Conformer integration test (huggingface#17303) * [Test] Fix W2V-Conformer integration test * correct w2v2 * up commit 28a0811 Author: regisss <15324346+regisss@users.noreply.github.com> Date: Tue May 17 17:58:14 2022 +0200 Improve mismatched sizes management when loading a pretrained model (huggingface#17257) - Add --ignore_mismatched_sizes argument to classification examples - Expand the error message when loading a model whose head dimensions are different from expected dimensions commit 1f13ba8 Author: Patrick von Platen <patrick.v.platen@gmail.com> Date: Tue May 17 15:48:23 2022 +0200 correct opt (huggingface#17301) commit 349f1c8 Author: Matt <Rocketknight1@users.noreply.github.com> Date: Tue May 17 14:36:23 2022 +0100 Rewrite TensorFlow train_step and test_step (huggingface#17057) * Initial commit * Better label renaming * Remove breakpoint before pushing (this is your job) * Test a lot more in the Keras fit() test * make fixup * Clarify the case where we flatten y dicts into tensors * Clarify the case where we flatten y dicts into tensors * Extract label name remapping to a method commit 651e48e Author: Matt <Rocketknight1@users.noreply.github.com> Date: Tue May 17 14:14:17 2022 +0100 Fix tests of mixed precision now that experimental is deprecated (huggingface#17300) * Fix tests of mixed precision now that experimental is deprecated * Fix mixed precision in training_args_tf.py too commit 6d21142 Author: SaulLu <55560583+SaulLu@users.noreply.github.com> Date: Tue May 17 14:33:13 2022 +0200 fix retribert's `test_torch_encode_plus_sent_to_model` (huggingface#17231)

* Add information gain filtration algorithm * Complying with black requirements * Added author * Fixed import order * flake8 corrections Co-authored-by: Javier Turek <javier.turek@intel.com>

Add information gain filtration algorithm

240a2f2

LysandreJik approved these changes Apr 27, 2022

View reviewed changes

TuKo added 5 commits May 10, 2022 11:12

Complying with black requirements

31b3b8e

Added author

4ff8edb

Fixed import order

5d9c6e1

flake8 corrections

1424245

Merge branch 'master' into igf-pr

46a8753

LysandreJik merged commit 5fdb54e into huggingface:main May 18, 2022

TuKo deleted the igf-pr branch May 18, 2022 14:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Information Gain Filtration algorithm #16953

Add Information Gain Filtration algorithm #16953

mraunak commented Apr 27, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 27, 2022 •

edited

Loading

LysandreJik left a comment

LysandreJik commented Apr 27, 2022

mraunak commented May 2, 2022

TuKo commented May 10, 2022

LysandreJik commented May 18, 2022

TuKo commented May 18, 2022

Add Information Gain Filtration algorithm #16953

Add Information Gain Filtration algorithm #16953

Conversation

mraunak commented Apr 27, 2022 • edited Loading

What does this PR do?

Motivation

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Apr 27, 2022 • edited Loading

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik commented Apr 27, 2022

mraunak commented May 2, 2022

TuKo commented May 10, 2022

LysandreJik commented May 18, 2022

TuKo commented May 18, 2022

mraunak commented Apr 27, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 27, 2022 •

edited

Loading