mail to ssl synthesis (#11)

* bug fix - sample rate was being ignored in vocoder dataset when not loading mel Signed-off-by: Paarth Neekhara <paarth.n@gmail.com> * handled n segments for a different sampling rate than original sampling rate Signed-off-by: Paarth Neekhara <paarth.n@gmail.com> * Added case for n_segments 0, warning for n_segments greater than file length Signed-off-by: Paarth Neekhara <paarth.n@gmail.com> * Fix metric setup for finetuning without a test set (NVIDIA#4585) * Fix metric setup for finetuning without a test set Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Fix log key Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Remove pdb Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Minor Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Fix skip train ds building while finetuning Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com> * r1.10.0 MegaMolBART Compatibility (NVIDIA#4603) * 1. Added vocab_size property to RegExTokenizer. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Fixed passing hiddens directly. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Added support in encoder outputs. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Added comments. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Added automatic mapping of kwargs to args in forward. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Added encode function. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. PP and TP works (but not together) Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Separated get_forward_output_only_func_encode and get_forward_output_only_func_decode. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * update branch Signed-off-by: ericharper <complex451@gmail.com> * Set headscale false (NVIDIA#4364) Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Add wandb as dependency (NVIDIA#4365) Signed-off-by: smajumdar <smajumdar@nvidia.com> * Raise trainer error (NVIDIA#4356) Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> Co-authored-by: Micha Livne <michalivne@users.noreply.github.com> * Set headscale false (NVIDIA#4364) (NVIDIA#4366) Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> Signed-off-by: smajumdar <smajumdar@nvidia.com> * Finetuning changes for BART (NVIDIA#4003) * Temp Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Checkpoint converter to nemo for bart Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Style Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> Co-authored-by: Micha Livne <michalivne@users.noreply.github.com> * Make position embedding expansion specific to a batch to avoid checkpoint size mismatches (NVIDIA#4357) * Style Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Fix logging warning Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> Co-authored-by: Micha Livne <michalivne@users.noreply.github.com> * 1. Added return logits to validation. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed unkown token during sampling. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed RegExTokenizer loading. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed ckpt file with samples int(0). Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed regex tokenizer. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed allowing enc_tokens to be None. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Added ability to ignore tokens by id during decode. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed regex tokenizer .nemo loading issue. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed RegEx test. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * r1.10.0 untie embeddings weights (NVIDIA#4519) * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Added independent decoder embeddings, and independent decoder token_head. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Added support in yaml config. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed initialization. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Added tests for untied embeddings and decoder token head. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Updated share_word_embeddings to share_token_embeddings. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed error in __del__ when TextMemMapDataset fails to build. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed comments. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1.Made method private. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed config names. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed alerts and style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed PP, TP, PP+TP still fails. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> Co-authored-by: Micha Livne <mlivne@nvidia.com> Co-authored-by: ericharper <complex451@gmail.com> Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> * Update megatron t5 interface to dialogue (NVIDIA#4626) * G2P Aligner (NVIDIA#4604) * Aligner inference notebook in progress. Preprocessing, forward, attn viz Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Hard attn, duration extraction, distance matrix Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Started: phoneme disambiguation using Aligner distance matrix Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Decouple encode_from_g2p() from phoneme tokenizer encode() for disambiguation inference Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Aligner G2P disambiguation using mean L2 embedding distance Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Rename aligner inference notebook Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Header text for Aligner notebook, formatting Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Aligner notebook formatting, header, license updates Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Aligner G2P disambiguation script draft Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Aligner G2P disambiguation script finished Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Remove normalization step to fix words with apostrophes (G2P) Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Fix normalization args for G2P disambiguation Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Allow str to be passed in for supp data, add 'text_normalized' as manifest option Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Aligner G2P script fixes: normalization, tokenization, add brackets around tokens, etc. Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Only disambiguate words in the given heteronyms list Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Filtering option for disambiguation script Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Add confidence thresholding, add PASTY to cmudict entries Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * TTS Aligner tutorial updates to generic path text Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Add confidence to aligner_g2p.py run example Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Move avg word distance function to Aligner encoder, add docstring, fix license Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Aligner Inference notebook updates (link to sample, resources added) Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Fix HF check for model card info (NVIDIA#4628) Signed-off-by: smajumdar <smajumdar@nvidia.com> * Tiny VAD refactoring for postprocessing (NVIDIA#4625) * binarization start index Signed-off-by: fayejf <fayejf07@gmail.com> * fix frame len Signed-off-by: fayejf <fayejf07@gmail.com> * style fix Signed-off-by: fayejf <fayejf07@gmail.com> * rame UNIT_FRAME_LEN Signed-off-by: fayejf <fayejf07@gmail.com> * update overlap script and fix lgtm Signed-off-by: fayejf <fayejf07@gmail.com> * style fi Signed-off-by: fayejf <fayejf07@gmail.com> * Fix ITN pt (NVIDIA#4623) Signed-off-by: Guilherme Steinmann <guist@linse.ufsc.br> * [TN] bug fix "hundred" in Audio-based, added method so split text in sentences (NVIDIA#4610) * fix duplex inference with grammars Signed-off-by: ekmb <ebakhturina@nvidia.com> * fix hundred TN audio bug, add split text Signed-off-by: ekmb <ebakhturina@nvidia.com> * fix header year Signed-off-by: ekmb <ebakhturina@nvidia.com> * style fix Signed-off-by: ekmb <ebakhturina@nvidia.com> * exclude I from roman-ordinal form Signed-off-by: ekmb <ebakhturina@nvidia.com> * fix graph_with_and Signed-off-by: ekmb <ebakhturina@nvidia.com> * fix tests Signed-off-by: ekmb <ebakhturina@nvidia.com> * fix split regex Signed-off-by: ekmb <ebakhturina@nvidia.com> * fix warning Signed-off-by: ekmb <ebakhturina@nvidia.com> * [Text Processing] G2P for OOV and heteronyms (NVIDIA#4624) * add models Signed-off-by: ekmb <ebakhturina@nvidia.com> * fix header and t5 inference Signed-off-by: ekmb <ebakhturina@nvidia.com> * fix jenkins Signed-off-by: ekmb <ebakhturina@nvidia.com> * fix jenkins Signed-off-by: ekmb <ebakhturina@nvidia.com> * fix lgtm Signed-off-by: ekmb <ebakhturina@nvidia.com> * review fixes Signed-off-by: ekmb <ebakhturina@nvidia.com> * fix if/else and removed unused imports Signed-off-by: ekmb <ebakhturina@nvidia.com> * replace ModelPT with G2PModel Signed-off-by: ekmb <ebakhturina@nvidia.com> * black Signed-off-by: ekmb <ebakhturina@nvidia.com> * add missing headers Signed-off-by: ekmb <ebakhturina@nvidia.com> * jenkins Signed-off-by: ekmb <ebakhturina@nvidia.com> * jenkins Signed-off-by: ekmb <ebakhturina@nvidia.com> * fix TRANSFORMERS_OFFLINE flag Signed-off-by: ekmb <ebakhturina@nvidia.com> * jenkins Signed-off-by: ekmb <ebakhturina@nvidia.com> * jenkins Signed-off-by: ekmb <ebakhturina@nvidia.com> * jenkins Signed-off-by: ekmb <ebakhturina@nvidia.com> * Update README.rst * Fp16 support for Conformer (NVIDIA#4571) * adding auto-select best precision for mhsa * cleanup * moving mhsa32 check into mhsa * switching to torch.cuda.is_bf16_supported() * now using torch.is_autocast_enabled() * added to non rel mhsa * only forcing 32bit subsampling if using bf16 * removing unused imports * moving contexts to utils Signed-off-by: Dima Rekesh <drekesh@nvidia.com> * formatting Signed-off-by: Dima Rekesh <drekesh@nvidia.com> * naming Co-authored-by: Dima Rekesh <drekesh@nvidia.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> * Maximum sample-based training for Megatron NMT and Text Memmap based Seq2seq Pre-training (NVIDIA#4396) * Update blendable dataset, and refactor seq2seq data Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Blendable dataset with binarized mmap working Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Pass seed from cfg to dataset Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Fix multilingual setup Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Add on epoch start reconfiguration Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Style Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Update tokenizer creation for multilingual Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Tmp Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Update NMT script Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Remove unused import Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Update training script Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Log consumed samples Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Logging on val epoch end Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Style Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Remove redundant print Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Ckpt averaging for non model parallel megatron models Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Style Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Empty Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Update error message Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Style Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Remove check Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Restore fixes Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Remove ipdb Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Fixes Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Testing a simple solution Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed. Seems to work. Need to validate. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Added support in CSV and text memmap toMEgatron encoder-decoder Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Added support in CSV. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed style. 2. Fixed bugs. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed bugs. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Updated yaml. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed warnings. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed a bug. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Added a test for text_memmap Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * Fix retro Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * add docstrings Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Minor Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Uncomment CI tests and fix existing gpt ci tests Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Tmp Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Remove max step hacking and move on_train_batch_end to base model Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Empty Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> Co-authored-by: Micha Livne <michalivne@users.noreply.github.com> Co-authored-by: Micha Livne <mlivne@cs.toronto.edu> Co-authored-by: Eric Harper <complex451@gmail.com> * NeMo Megatron Doc updates1 (NVIDIA#4633) * Work on NeMo Megatron OSS documentation Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com> * NeMo Megatron doc updates Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca> Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com> Co-authored-by: Micha Livne <michalivne@users.noreply.github.com> Co-authored-by: Micha Livne <mlivne@nvidia.com> Co-authored-by: ericharper <complex451@gmail.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> Co-authored-by: Zhilin Wang <wangzhilin12061996@hotmail.com> Co-authored-by: Jocelyn <jocelynh@nvidia.com> Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com> Co-authored-by: Guilherme Steinmann <guist@linse.ufsc.br> Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com> Co-authored-by: Dima Rekesh <bmwshop@gmail.com> Co-authored-by: Dima Rekesh <drekesh@nvidia.com> Co-authored-by: Micha Livne <mlivne@cs.toronto.edu>
paarthneekhara · Jul 31, 2022 · 3104325 · 3104325
1 parent 30e0e5f
commit 3104325
Show file tree

Hide file tree

Showing 101 changed files with 6,178 additions and 591 deletions.
diff --git a/Jenkinsfile b/Jenkinsfile
diff --git a/README.rst b/README.rst
@@ -202,7 +202,7 @@ Megatron GPT training requires NVIDIA Apex to be installed.
 
     git clone https://github.com/NVIDIA/apex
     cd apex
-    git checkout 5d8c8a8eedaf567d56f0762a45431baf9c0e800e
+    git checkout 3c19f1061879394f28272a99a7ea26d58f72dace
     pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" ./
 
 Docker containers:

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -40,10 +40,11 @@ NVIDIA NeMo User Guide
    :caption: Natural Language Processing
    :name: Natural Language Processing
 
-   nlp/models
-   nlp/megatron
-   nlp/api
+   nlp/nemo_megatron/intro
+   nlp/machine_translation/machine_translation
    nlp/text_normalization/intro
+   nlp/api
+
 
 .. toctree::
    :maxdepth: 2

diff --git a/docs/source/nlp/machine_translation.rst → ...chine_translation/machine_translation.rst b/docs/source/nlp/machine_translation.rst → ...chine_translation/machine_translation.rst
diff --git a/docs/source/nlp/nemo_megatron/batching.rst b/docs/source/nlp/nemo_megatron/batching.rst
@@ -0,0 +1,21 @@
+.. _batching:
+
+Batching
+--------
+
+Batch size is one of the first parameters you should play with. For efficiency and convergence reasons we recommend you first try maximizing your batch size per GPU so that your GPU RAM usage is maximized.
+
+NeMo Megatron uses the following concepts.
+
+*Micro batch size* is the number of examples per data parallel rank. It is controlled by ``model.micro_batch_size`` parameter.
+
+*Global batch size* = micro_batch_size * data_parallel_size * gradient_accumulation_steps. For details on ``data_parallel_size`` see :ref:`parallelisms` section, but typically it is equal to the number of GPUs being used.
+Global batch size is controlled by ``model.global_batch_size`` parameter. 
+
+
+*Gradient Accumulation*
+
+    * Idea: Train with large batch sizes with fixed memory footprint at the cost of additional compute.
+    * Do k forward and backward passes through the network with different batches, do not perform parameter updates until after k passes.
+    * Update paramters
+
diff --git a/docs/source/nlp/nemo_megatron/gpt/gpt_training.rst b/docs/source/nlp/nemo_megatron/gpt/gpt_training.rst
@@ -0,0 +1,232 @@
+GPT model training
+------------------
+
+GPT is a decoder-only Transformer model.
+
+
+Quick start
+^^^^^^^^^^^
+Steps below demonstrate training of a GPT style model with NeMo
+
+Data download & pre-processing
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. note::
+    Data download, pre-processing and tokenizer training in the example below will take ~3 hours.
+
+**Step 1: Download data**
+
+The step below will download Wikipedia data (around 20GB) and can take some several hours.
+
+.. code-block:: bash
+
+    wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
+    
+**Step 2: Extract raw data**
+
+.. code-block:: bash
+
+    pip install wikiextractor
+    python -m wikiextractor.WikiExtractor enwiki-latest-pages-articles.xml.bz2 --json
+    find text -name 'wiki_*' -exec cat {} \; > train_data.jsonl
+
+Now, ``train_data.jsonl`` will contain our training data in the json line format. We are interested in the data under "text" field.
+
+
+**Step 3: Train tokenizer**
+
+Below we will condider 2 options for training data tokenizers: Using pre-built HuggingFace BPE and training and using your own Google Sentencepiece tokenizer.
+Note that only second option allows you to experiment with vocabulary size.
+
+*Option 1:* Using HuggingFace GPT2 tokenizer files.
+
+With this option we will just download pre-built vocabulary and merge files for BPE tokenizer.
+
+.. code-block:: bash
+
+    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
+    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
+
+
+*Option 2:* Using `Google Sentencepiece <https://github.com/google/sentencepiece>`_ tokenizer library. 
+
+It comes as dependency with NeMo, so if you have installed NeMo it should already be installed.
+Note that training tokenizer model will also take some time.
+
+.. code-block:: bash
+   
+   sudo apt install jq
+   jq .text train_data.jsonl >> text_for_tokenizer.txt
+   spm_train --input=text_for_tokenizer.txt \
+        --model_prefix=spm_32k_wiki \
+        --vocab_size=32768 \
+        --character_coverage=0.9999 \
+        --model_type=bpe \
+        --byte_fallback=true \
+        --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 \
+        --split_digits true 
+
+After this is done (will take a while), you'll have two files: ```spm_32k_wiki.model and spm_32k_wiki.vocab`` which correspond to model and vocabulary.
+
+**Step 4: Convert training data into memory map format**
+
+This format makes trainig more efficient, especially with many nodes and GPUs. This step will also tokenize data using tokenizer model from Step 3.
+
+*Option 1:* Using HuggingFace GPT2 tokenizer files.
+
+.. code-block:: bash
+
+    python <NeMo_ROOT_FOLDER>/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
+    --input=train_data.jsonl \
+    --json-keys=text \
+    --tokenizer-library=megatron \
+    --vocab gpt2-vocab.json \
+    --dataset-impl mmap \
+    --tokenizer-type GPT2BPETokenizer \
+    --merge-file gpt2-merges.txt \
+    --output-prefix=hfbpe_gpt_training_data \
+    --append-eod \
+    --workers=32 
+
+*Option 2:* Using `Google Sentencepiece <https://github.com/google/sentencepiece>`_ tokenizer library.  
+
+.. code-block:: bash
+    
+    python <NeMo_ROOT_FOLDER>/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
+    --input=train_data.jsonl \
+    --json-keys=text \
+    --tokenizer-library=sentencepiece \
+    --tokenizer-model=spm_32k_wiki.model \
+    --output-prefix=gpt_training_data \
+    --append-eod \
+    --workers=32 
+
+
+Train GPT-style Model
+~~~~~~~~~~~~~~~~~~~~~
+
+Once you have prepared training data and tokenizer, you are ready to train the model.
+The configuration we present below has about 124M parameters and it should fit on a single 16GB GPU if using float16.
+Let's go!!!
+
+*Option 1:* Using HuggingFace GPT2 tokenizer files.
+
+.. code-block:: bash
+
+    python /home/okuchaiev/repos/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py  \
+	--config-path=/home/okuchaiev/repos/NeMo/examples/nlp/language_modeling/conf \
+	--config-name=megatron_gpt_config \
+	trainer.devices=1 \
+	trainer.num_nodes=1 \
+	trainer.max_epochs=null \
+	trainer.max_steps=300000 \
+	trainer.val_check_interval=300 \
+	trainer.log_every_n_steps=50 \
+	trainer.limit_val_batches=50 \
+	trainer.limit_test_batches=50 \
+	trainer.accumulate_grad_batches=1 \
+	trainer.precision=16 \
+	model.micro_batch_size=6 \
+	model.global_batch_size=192 \
+	model.tensor_model_parallel_size=1 \
+	model.pipeline_model_parallel_size=1 \
+	model.max_position_embeddings=1024 \
+	model.encoder_seq_length=1024 \
+	model.hidden_size=768 \
+	model.ffn_hidden_size=3072 \
+	model.num_layers=12 \
+	model.num_attention_heads=12 \
+	model.init_method_std=0.021 \
+	model.hidden_dropout=0.1 \
+	model.layernorm_epsilon=1e-5 \
+	model.tokenizer.vocab_file=gpt2-vocab.json \
+    model.tokenizer.merge_file=gpt2-merges.txt \
+	model.data.data_prefix=[1.0,hfbpe_gpt_training_data_text_document] \
+	model.data.num_workers=2 \
+	model.data.seq_length=1024 \
+	model.data.splits_string=\'980,10,10\' \
+	model.optim.name=fused_adam \
+	model.optim.lr=6e-4 \
+	model.optim.betas=[0.9,0.95] \
+	model.optim.weight_decay=0.1 \
+	model.optim.sched.name=CosineAnnealing \
+	model.optim.sched.warmup_steps=750 \
+	model.optim.sched.constant_steps=80000 \
+	model.optim.sched.min_lr=6e-5 \
+	exp_manager.resume_if_exists=True \
+	exp_manager.resume_ignore_no_checkpoint=True \
+	exp_manager.create_checkpoint_callback=True \
+	exp_manager.checkpoint_callback_params.monitor=val_loss \
+	exp_manager.checkpoint_callback_params.save_top_k=3 \
+	exp_manager.checkpoint_callback_params.mode=min \
+	exp_manager.checkpoint_callback_params.always_save_nemo=False
+
+
+*Option 2:* Using `Google Sentencepiece <https://github.com/google/sentencepiece>`_ tokenizer library.
+
+.. code-block:: bash
+
+    python /home/okuchaiev/repos/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py  \
+	--config-path=/home/okuchaiev/repos/NeMo/examples/nlp/language_modeling/conf \
+	--config-name=megatron_gpt_config \
+	trainer.devices=1 \
+	trainer.num_nodes=1 \
+	trainer.max_epochs=null \
+	trainer.max_steps=300000 \
+	trainer.val_check_interval=300 \
+	trainer.log_every_n_steps=50 \
+	trainer.limit_val_batches=50 \
+	trainer.limit_test_batches=50 \
+	trainer.accumulate_grad_batches=1 \
+	trainer.precision=16 \
+	model.micro_batch_size=6 \
+	model.global_batch_size=192 \
+	model.tensor_model_parallel_size=1 \
+	model.pipeline_model_parallel_size=1 \
+	model.max_position_embeddings=1024 \
+	model.encoder_seq_length=1024 \
+	model.hidden_size=768 \
+	model.ffn_hidden_size=3072 \
+	model.num_layers=12 \
+	model.num_attention_heads=12 \
+	model.init_method_std=0.021 \
+	model.hidden_dropout=0.1 \
+	model.layernorm_epsilon=1e-5 \
+	model.tokenizer.library=sentencepiece \
+	model.tokenizer.model=spm_32k_wiki.model \
+	model.data.data_prefix=[1.0,gpt_training_data_text_document] \
+	model.data.num_workers=2 \
+	model.data.seq_length=1024 \
+	model.data.splits_string=\'980,10,10\' \
+	model.optim.name=fused_adam \
+	model.optim.lr=6e-4 \
+	model.optim.betas=[0.9,0.95] \
+	model.optim.weight_decay=0.1 \
+	model.optim.sched.name=CosineAnnealing \
+	model.optim.sched.warmup_steps=750 \
+	model.optim.sched.constant_steps=80000 \
+	model.optim.sched.min_lr=6e-5 \
+	exp_manager.resume_if_exists=True \
+	exp_manager.resume_ignore_no_checkpoint=True \
+	exp_manager.create_checkpoint_callback=True \
+	exp_manager.checkpoint_callback_params.monitor=val_loss \
+	exp_manager.checkpoint_callback_params.save_top_k=3 \
+	exp_manager.checkpoint_callback_params.mode=min \
+	exp_manager.checkpoint_callback_params.always_save_nemo=False
+
+
+Next, simply launch Tensorboard to monitor training like so:
+
+.. code-block:: bash
+
+    tensorboard --logdir nemo_experiments --bind_all
+
+Next steps
+~~~~~~~~~~
+
+Please refer to:
+
+* :ref:`batching` section for batch size adjustments
+* :ref:`parallelisms` section for understanding various types of parallelisms
+* :ref:`promptlearning` section for details on prompt-tuning and p-tuning
+
diff --git a/docs/source/nlp/nemo_megatron/images/ddp.gif b/docs/source/nlp/nemo_megatron/images/ddp.gif
diff --git a/docs/source/nlp/nemo_megatron/images/pnom.gif b/docs/source/nlp/nemo_megatron/images/pnom.gif
diff --git a/docs/source/nlp/nemo_megatron/images/pp.gif b/docs/source/nlp/nemo_megatron/images/pp.gif
diff --git a/docs/source/nlp/nemo_megatron/images/sp.gif b/docs/source/nlp/nemo_megatron/images/sp.gif
diff --git a/docs/source/nlp/nemo_megatron/images/tp.gif b/docs/source/nlp/nemo_megatron/images/tp.gif
diff --git a/docs/source/nlp/nemo_megatron/intro.rst b/docs/source/nlp/nemo_megatron/intro.rst
@@ -0,0 +1,27 @@
+NeMo Megatron
+=============
+
+Megatron :cite:`nlp-megatron-shoeybi2019megatron` is a large, powerful transformer developed by the Applied Deep Learning Research 
+team at NVIDIA. NeMo Megatron supports several types of models:
+
+* GPT-style models (decoder only)
+* T5/BART/UL2-style models (encoder-decoder)
+* BERT-style models (encoder only)
+
+
+
+.. note::
+    NeMo Megatron has an Enterprise edition which contains tools for data preprocessing, hyperparameter tuning, container, scripts for various clouds and more. With Enterprise edition you also get deployment tools. Apply for `early access here <https://developer.nvidia.com/nemo-megatron-early-access>`_ .
+
+
+.. toctree::
+   :maxdepth: 1
+
+   mlm_migration   
+   gpt/gpt_training
+   t5/t5_training
+   batching 
+   parallelisms  
+   prompt_learning
+
+
diff --git a/docs/source/nlp/nemo_megatron/mlm_migration.rst b/docs/source/nlp/nemo_megatron/mlm_migration.rst
@@ -0,0 +1,24 @@
+Migrating from Megatron-LM
+--------------------------
+
+NeMo Megatron and Megatron-LM share many underlying technology. You should be able to convert your GPT model checkpoints trained with Megatron-LM into NeMo Megatron.
+Example conversion script:
+
+.. code-block:: bash
+
+   <NeMo_ROOT_FOLDER>/examples/nlp/language_modeling/megatron_lm_ckpt_to_nemo.py \
+     --checkpoint_folder <path_to_PTL_checkpoints_folder> \
+     --checkpoint_name megatron_gpt--val_loss=99.99-step={steps}-consumed_samples={consumed}.0 \
+     --nemo_file_path <path_to_output_nemo_file> \
+     --model_type <megatron model type> \
+     --tensor_model_parallel_size <tensor_model_parallel_size> \
+     --pipeline_model_parallel_size <pipeline_model_parallel_size>  \
+     --gpus_per_node  <gpus per node>
+
+
+
+To resume the training from converted MegatronLM checkpoint, make sure to set the 
+`trainer.max_steps=round(lr-warmup-fraction * lr-decay-iters + lr-decay-iters)`
+where  `lr-warmup-fraction` and `lr-decay-iters` are arguments from MegatronLM training
+so the learning rate scheduler will follow the same curve.
+
diff --git a/docs/source/nlp/nemo_megatron/parallelisms.rst b/docs/source/nlp/nemo_megatron/parallelisms.rst
@@ -0,0 +1,44 @@
+.. _parallelisms:
+
+Parallelisms
+------------
+
+NeMo Megatron supports 4 types of parallelisms (can be mixed together arbitraritly):
+
+Distributed Data parallelism
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. image:: images/ddp.gif
+    :align: center
+    :alt: Distributed Data Parallel
+
+
+Tensor Parallelism
+^^^^^^^^^^^^^^^^^^
+
+.. image:: images/tp.gif
+    :align: center
+    :alt: Tensor Parallel
+
+Pipeline Parallelism
+^^^^^^^^^^^^^^^^^^^^
+
+.. image:: images/pp.gif
+    :align: center
+    :alt: Pipeline Parallel
+
+Sequence Parallelism
+^^^^^^^^^^^^^^^^^^^^
+
+.. image:: images/sp.gif
+    :align: center
+    :alt: Sqeuence Parallel
+
+Parallelism nomenclature
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+When reading and modifying NeMo Megatron code you will encounter the following terms.
+
+.. image:: images/pnom.gif
+    :align: center
+    :alt: Parallelism nomenclature
diff --git a/docs/source/nlp/prompt_learning.rst → ...rce/nlp/nemo_megatron/prompt_learning.rst b/docs/source/nlp/prompt_learning.rst → ...rce/nlp/nemo_megatron/prompt_learning.rst
@@ -1,5 +1,7 @@
+.. _promptlearning:
+
 Prompt Learning
--------------
+---------------
 
 Within NeMo we refer to **p-tuning** and **prompt tuning** methods collectively as prompt learning. Both methods are parameter efficient alternatives to fine-tuning pretrained language models. Our NeMo implementation makes it possible to use one pretrained GPT model on many downstream tasks without needing to tune the model's full set of parameters. It also allows for adding new tasks to your model without overwriting or disrupting previous tasks for which the model has already been p-tuned/prompt-tuned. Because the original model parameters are frozen and never altered by either method, p-tuning/prompt-tuning also avoid cartographic forgetting issues often encountered when fine-tuning models. 
 

diff --git a/docs/source/starthere/intro.rst b/docs/source/starthere/intro.rst
@@ -150,8 +150,8 @@ If you chose to work with the ``main`` branch, we recommend using `NVIDIA's PyTo
     stack=67108864 --device=/dev/snd nvcr.io/nvidia/pytorch:21.05-py3
 
 
-FAQ
----
+`FAQ <https://github.com/NVIDIA/NeMo/discussions>`_
+---------------------------------------------------
 Have a look at our `discussions board <https://github.com/NVIDIA/NeMo/discussions>`_ and feel free to post a question or start a discussion.
-Original file line number
+Diff line change
@@ Expand Up @@
         stack=67108864 --device=/dev/snd nvcr.io/nvidia/pytorch:21.05-py3
-    FAQ
-    ---
+    `FAQ <https://github.com/NVIDIA/NeMo/discussions>`_
+    ---------------------------------------------------
     Have a look at our `discussions board <https://github.com/NVIDIA/NeMo/discussions>`_ and feel free to post a question or start a discussion.
@@ Expand Down @@