Merge r1.7.1 to main (#3824)

* Tn bug 1.7.0 (#3730) * fix es and fr bug Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * add file Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * [TTS] Fix bugs in E2E TTS, Mixer-TTS and FastPitch (#3740) * fix bugs Signed-off-by: Oktai Tatanov <oktai.tatanov@gmail.com> * fix bug in e2e tts and mixer tts Signed-off-by: Oktai Tatanov <oktai.tatanov@gmail.com> * Mirror AN4 data while servers are down (#3743) Signed-off-by: smajumdar <titu1994@gmail.com> * Bugfix for GPT eval (#3744) * use tokens_cut not tokens Signed-off-by: ericharper <complex451@gmail.com> * remove precision conversion and comment jit for bias gelu Signed-off-by: ericharper <complex451@gmail.com> * revert comment update mbs in config Signed-off-by: ericharper <complex451@gmail.com> * calculate micro_batch_size during complete and compute_logprobs Signed-off-by: ericharper <complex451@gmail.com> * ASR SSL update (#3746) * ssl update Signed-off-by: sam1373 <samuelkriman@gmail.com> * tutorial update Signed-off-by: sam1373 <samuelkriman@gmail.com> * Fix SSL configs for 1.7 (#3748) * ssl update Signed-off-by: sam1373 <samuelkriman@gmail.com> * tutorial update Signed-off-by: sam1373 <samuelkriman@gmail.com> * revert configs Signed-off-by: sam1373 <samuelkriman@gmail.com> * revert configs Signed-off-by: sam1373 <samuelkriman@gmail.com> * punct process bug fix (#3747) Signed-off-by: ekmb <ebakhturina@nvidia.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> * updated conformer models. (#3741) Signed-off-by: Vahid <vnoroozi@nvidia.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> * Yuya/megatron t5 glue eval (#3751) * Add megatron t5 glue eval-only script Signed-off-by: Yu Yao <yuya@nvidia.com> * Update megatron t5 glue eval default configs Signed-off-by: Yu Yao <yuya@nvidia.com> * Update megatron t5 glue eval configs Signed-off-by: Yu Yao <yuya@nvidia.com> * Update config comments Signed-off-by: Yu Yao <yuya@nvidia.com> Co-authored-by: Yu Yao <yuya@nvidia.com> * Specify gpus in SSL notebook (#3753) * ssl update Signed-off-by: sam1373 <samuelkriman@gmail.com> * tutorial update Signed-off-by: sam1373 <samuelkriman@gmail.com> * revert configs Signed-off-by: sam1373 <samuelkriman@gmail.com> * revert configs Signed-off-by: sam1373 <samuelkriman@gmail.com> * specify gpus Signed-off-by: sam1373 <samuelkriman@gmail.com> * Duplex model inference fix, money encoder fix (#3754) Signed-off-by: ekmb <ebakhturina@nvidia.com> * Update docs for RNNT and overriding fused batch size (#3755) Signed-off-by: smajumdar <titu1994@gmail.com> * fix consumed samples calculation + PTune Model bugs (#3738) * fix the way computing consumed samples Signed-off-by: Yi Dong <yidong@nvidia.com> * fixed ptune model Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure notebook is working Signed-off-by: Yi Dong <yidong@nvidia.com> * added try-catch Signed-off-by: Yi Dong <yidong@nvidia.com> Co-authored-by: Micha Livne <michalivne@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> * fix directories in ssl notebook (#3758) * ssl update Signed-off-by: sam1373 <samuelkriman@gmail.com> * tutorial update Signed-off-by: sam1373 <samuelkriman@gmail.com> * revert configs Signed-off-by: sam1373 <samuelkriman@gmail.com> * revert configs Signed-off-by: sam1373 <samuelkriman@gmail.com> * specify gpus Signed-off-by: sam1373 <samuelkriman@gmail.com> * update dirs Signed-off-by: sam1373 <samuelkriman@gmail.com> * TN docs update (#3735) * TN docs update: audio based docs added, quick start, ref fixed, etc Signed-off-by: ekmb <ebakhturina@nvidia.com> * add deployment script dir and Sp TN Signed-off-by: ekmb <ebakhturina@nvidia.com> Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com> * Update Tacotron2_Training.ipynb (#3769) Signed-off-by: Jason <jasoli@nvidia.com> * fix dockerfile (#3778) Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * Prompt-Tuning-Documentation (#3777) * Update megatron.rst * Updated example prompt tuning script's doc string * Update megatron.rst * Update megatron.rst Co-authored-by: Eric Harper <complex451@gmail.com> * Prompt tuning bug fix (#3780) * Making updated code backwards compatible with previous prompt tuned models Signed-off-by: Virginia Adams <vadams@nvidia.com> * Fixed backward compatiablity bug Signed-off-by: Virginia Adams <vadams@nvidia.com> * Removed random import Signed-off-by: Virginia Adams <vadams@nvidia.com> Co-authored-by: Eric Harper <complex451@gmail.com> * update branch Signed-off-by: ericharper <complex451@gmail.com> * revert changes (#3785) Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * Fixed soft prompt eval loading bug (#3805) Signed-off-by: Virginia Adams <vadams@nvidia.com> * mT5 whole word masking and T5 finetuning config fixes (#3776) * O2 and whole word masking changes Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Style Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Update yaml Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Tok and O2 fix Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Fix arg passing Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Fix checkpoint path Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Style fixes Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Raise error if FP16 training is tried with O2 recipe. (#3806) * raise error Signed-off-by: ericharper <complex451@gmail.com> * update assert Signed-off-by: ericharper <complex451@gmail.com> * update error message Signed-off-by: ericharper <complex451@gmail.com> * update error message Signed-off-by: ericharper <complex451@gmail.com> * update branch Signed-off-by: ericharper <complex451@gmail.com> * remove test Signed-off-by: ericharper <complex451@gmail.com> * revert bad merges Signed-off-by: ericharper <complex451@gmail.com> * revert change partitions Signed-off-by: ericharper <complex451@gmail.com> Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com> Co-authored-by: Oktai Tatanov <oktai.tatanov@gmail.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> Co-authored-by: Samuel Kriman <samuelkriman@gmail.com> Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com> Co-authored-by: Vahid Noroozi <VahidooX@users.noreply.github.com> Co-authored-by: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: Yu Yao <yuya@nvidia.com> Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com> Co-authored-by: Micha Livne <michalivne@users.noreply.github.com> Co-authored-by: Jason <jasoli@nvidia.com> Co-authored-by: Virginia Adams <78445382+vadam5@users.noreply.github.com> Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
NVIDIA · Mar 22, 2022 · 4211f17 · 4211f17
1 parent 50952f5
commit 4211f17
Show file tree

Hide file tree

Showing 12 changed files with 88 additions and 73 deletions.
diff --git a/Jenkinsfile b/Jenkinsfile
@@ -2181,33 +2181,34 @@ pipeline {
     }
 
 
-    stage('L2: Megatron GPT Convert from Megatron-LM checkpoing and Eval') {
-      when {
-        anyOf {
-          branch 'main'
-          changeRequest target: 'main'
-        }
-      }
-      failFast true
-      steps {
-        sh "python -m torch.distributed.launch --nproc_per_node=2 \
-        examples/nlp/language_modeling/megatron_lm_ckpt_to_nemo.py \
-        --checkpoint_folder=/home/TestData/nlp/megatron_gpt/data/gpt/iter_0008700 \
-        --checkpoint_name=model_optim_rng.pt \
-        --hparams_file=/home/TestData/nlp/megatron_gpt/data/gpt/iter_0008700/hparams.yaml \
-        --nemo_file_path=examples/nlp/language_modeling/small_gpt.nemo \
-        --model_type=gpt \
-        --pipeline_model_parallel_size=1 \
-        --gpus_per_node=2 \
-        --tensor_model_parallel_size=2"
-        sh "python examples/nlp/language_modeling/megatron_gpt_eval.py \
-        --model_file=examples/nlp/language_modeling/small_gpt.nemo \
-        --tokens_to_generate=32 \
-        --tensor_model_parallel_size=2 \
-        --prompt='This is a test.'"
-        sh "rm examples/nlp/language_modeling/small_gpt.nemo"
-      }
-    }
+    // TODO: Add this test back. Test was failing on CI machines due to HW error
+    // stage('L2: Megatron GPT Convert from Megatron-LM checkpoing and Eval') {
+    //   when {
+    //     anyOf {
+    //       branch 'main'
+    //       changeRequest target: 'main'
+    //     }
+    //   }
+    //   failFast true
+    //   steps {
+    //     sh "python -m torch.distributed.launch --nproc_per_node=2 \
+    //     examples/nlp/language_modeling/megatron_lm_ckpt_to_nemo.py \
+    //     --checkpoint_folder=/home/TestData/nlp/megatron_gpt/data/gpt/iter_0008700 \
+    //     --checkpoint_name=model_optim_rng.pt \
+    //     --hparams_file=/home/TestData/nlp/megatron_gpt/data/gpt/iter_0008700/hparams.yaml \
+    //     --nemo_file_path=examples/nlp/language_modeling/small_gpt.nemo \
+    //     --model_type=gpt \
+    //     --pipeline_model_parallel_size=1 \
+    //     --gpus_per_node=2 \
+    //     --tensor_model_parallel_size=2"
+    //     sh "python examples/nlp/language_modeling/megatron_gpt_eval.py \
+    //     --model_file=examples/nlp/language_modeling/small_gpt.nemo \
+    //     --tokens_to_generate=32 \
+    //     --tensor_model_parallel_size=2 \
+    //     --prompt='This is a test.'"
+    //     sh "rm examples/nlp/language_modeling/small_gpt.nemo"
+    //   }
+    // }
     stage('L2: Megatron Change Partitions') {
       when {
         anyOf {

diff --git a/docs/source/nlp/megatron.rst b/docs/source/nlp/megatron.rst
@@ -173,18 +173,20 @@ Prompt tuning is a continuous or soft prompt approach to finding the optimal pro
 Implementation Overview
 ^^^^^^^^^^
 
-Our current prompt tuning implementation adapt’s Lester et. al’s EMNLP 2021 "`The Power of Scale for Parameter-Efficient Prompt Tuning <https://arxiv.org/abs/2104.08691>`_" to prompt tuning for GPT style models. In this implementation, a number of soft tokens specified by the user are prepended to the beginning of the discrete token input embeddings during the forward pass. During training, all model parameters are frozen except for those corresponding to the soft tokens. Only the soft prompt parameters are updated via gradient decent in the backward pass. Each soft token has the same dimensionality as a regular token embedding from the model’s vocabulary corresponding to the ``hidden_size`` hyperparameter. Soft token embeddings can be initialized randomly or with selected existing embeddings from the pretrained model.
+Our current prompt tuning implementation adapt’s Lester et. al’s EMNLP 2021 "`The Power of Scale for Parameter-Efficient Prompt Tuning <https://arxiv.org/abs/2104.08691>`_" to prompt tuning for GPT style models. In this implementation, a number of soft tokens specified by the user are prepended to the beginning of the discrete token input embeddings during the forward pass. During training, all model parameters are frozen except for those corresponding to the soft tokens. Only the soft prompt parameters are updated via gradient decent in the backward pass. Each soft token has the same dimensionality as a regular token embedding from the model’s vocabulary corresponding to the ``hidden_size`` hyperparameter. Soft token embeddings can be initialized randomly or with selected existing embeddings from the pretrained model. 
+
+As of NeMo 1.7 prompt tuning now works with tensor parallel > 1. 
 
 Data Formatting
 ^^^^^^^^^^
 
-The dataset should be a .json file where each json object has 2 fields: ``prompt_tag`` and ``text``.
+The dataset should be a .jsonl file where each json object has 3 fields: ``prompt_tag``, ``text``, and ``answer``.
 
 .. code::
 
-  {"prompt_tag": [tag1], "text": [text1]}
-  {"prompt_tag": [tag1], "text": [text2]}
-  {"prompt_tag": [tag1], "text": [text3]}
+  {"prompt_tag": [tag1], "text": [text1], "answer": [answer1]}
+  {"prompt_tag": [tag1], "text": [text2], "answer": [answer2]}
+  {"prompt_tag": [tag1], "text": [text3], "answer": [answer3]}
   
 .. _data-example-label:
 
@@ -218,6 +220,9 @@ Prompt Tuning Specific Config Values
    * - **model.new_prompt_init_text**
      - list of strings
      - The text you want to use for soft prompt initalization if ``model.new_prompt_init_methods`` is set to ['text']. The text is tokenized and clipped or tiled to match ``model.num_prompt_tokens``. The vocab embeddings associated with each token are copied and use to initialize the soft prompts.
+   * - **model.calc_loss_on_answer_only**
+     - bool
+     - Whether to calculate cross entropy loss on the full text input or only the answer portion of the input during prompt tuning. 
    * - **model.data.train_ds**
      - string
      - path to training dataset .json or .jsonl file. See `Data Formatting`_ for an example
@@ -228,6 +233,7 @@ Prompt Tuning Specific Config Values
 
 Example Prompt Tuning Command for the First Task
 ^^^^^^^^^^
+
 .. code::
   
   EXPR_NAME='winogrande_prompt_tuning'

diff --git a/examples/nlp/language_modeling/conf/megatron_prompt_tuning_gpt.yaml b/examples/nlp/language_modeling/conf/megatron_prompt_tuning_gpt.yaml
@@ -10,9 +10,9 @@ trainer:
   enable_checkpointing: False
   replace_sampler_ddp: False
   max_epochs: null
-  max_steps: 1000 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
+  max_steps: 3000 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
   log_every_n_steps: 10
-  val_check_interval: 50
+  val_check_interval: 250
   limit_val_batches: 50
   limit_test_batches: 500
   accumulate_grad_batches: 1 # do not modify, grad acc is automatic for training megatron models
@@ -43,7 +43,7 @@ model:
   # specify micro_batch_size, global_batch_size, and model parallelism
   # gradient accumulation will be done automatically based on data_parallel_size
   micro_batch_size: 4 # limited by GPU memory
-  global_batch_size: 16 # will use more micro batches to reach global batch size
+  global_batch_size: 8 # will use more micro batches to reach global batch size
   tensor_model_parallel_size: 1 # intra-layer model parallelism
   pipeline_model_parallel_size: 1 # inter-layer model parallelism
 
@@ -117,7 +117,7 @@ model:
 
   optim:
     name: fused_adam
-    lr: 2e-4
+    lr: 1e-5
     weight_decay: 0.01 
     betas: 
     - 0.9
@@ -126,4 +126,4 @@ model:
       name: CosineAnnealing
       warmup_steps: 50
       constant_steps: 10
-      min_lr: 2e-5
+      min_lr: 1e-6
diff --git a/examples/nlp/language_modeling/conf/megatron_t5_config_finetune_glue_mnli.yaml b/examples/nlp/language_modeling/conf/megatron_t5_config_finetune_glue_mnli.yaml
@@ -27,11 +27,11 @@ exp_manager:
   resume_ignore_no_checkpoint: True
   create_checkpoint_callback: True
   checkpoint_callback_params:
-    monitor: val_acc
+    monitor: validation_acc
     save_top_k: 10
     mode: max
     always_save_nemo: False # TODO: add support
-    filename: 'megatron_t5--{val_acc:.3f}-{step}'
+    filename: 'megatron_t5--{validation_acc:.3f}-{step}'
     model_parallel_size: ${model.tensor_model_parallel_size}
     save_best_model: True
 

diff --git a/examples/nlp/language_modeling/conf/megatron_t5_config_finetune_glue_xnli.yaml b/examples/nlp/language_modeling/conf/megatron_t5_config_finetune_glue_xnli.yaml
@@ -27,11 +27,11 @@ exp_manager:
   resume_ignore_no_checkpoint: True
   create_checkpoint_callback: True
   checkpoint_callback_params:
-    monitor: val_acc
+    monitor: validation_acc
     save_top_k: 10
     mode: max
     always_save_nemo: False # TODO: add support
-    filename: 'megatron_t5--{val_acc:.3f}-{step}'
+    filename: 'megatron_t5--{validation_acc:.3f}-{step}'
     model_parallel_size: ${model.tensor_model_parallel_size}
     save_best_model: True
 

diff --git a/nemo/collections/nlp/data/language_modeling/megatron/dataset_utils.py b/nemo/collections/nlp/data/language_modeling/megatron/dataset_utils.py
@@ -162,18 +162,13 @@ def create_tokens_and_tokentypes(tokens_a, tokens_b, cls_id, sep_id):
 MaskedLmInstance = collections.namedtuple("MaskedLmInstance", ["index", "label"])
 
 
-def is_start_piece(piece, tokenizer_type='wordpiece'):
+def is_start_piece(piece):
     """Check if the current word piece is the starting piece. (BERT)"""
     # When a word has been split into
     # WordPieces, the first token does not have any marker and any subsequence
     # tokens are prefixed with ##. So whenever we see the ## token, we
     # append it to the previous set of word indexes.
-    if tokenizer_type == 'wordpiece':
-        return not piece.startswith("##")
-    elif tokenizer_type == 'sentencepiece':
-        return piece.startswith('▁')
-    else:
-        raise ValueError(f"Tokenizer type {tokenizer_type} is not supported.")
+    return not piece.startswith("##")
 
 
 def create_masked_lm_predictions(
@@ -217,15 +212,11 @@ def create_masked_lm_predictions(
         # Note that Whole Word Masking does *not* change the training code
         # at all -- we still predict each WordPiece independently, softmaxed
         # over the entire vocabulary.
-        if (
-            whole_word_masking
-            and len(cand_indexes) >= 1
-            and not is_start_piece(vocab_id_to_token_dict[token], tokenizer_type=tokenizer_type)
-        ):
+        if whole_word_masking and len(cand_indexes) >= 1 and not is_start_piece(vocab_id_to_token_dict[token]):
             cand_indexes[-1].append(i)
         else:
             cand_indexes.append([i])
-            if is_start_piece(vocab_id_to_token_dict[token], tokenizer_type=tokenizer_type):
+            if is_start_piece(vocab_id_to_token_dict[token]):
                 token_boundary[i] = 1
 
     output_tokens = list(tokens)

diff --git a/nemo/collections/nlp/data/language_modeling/megatron/t5_dataset.py b/nemo/collections/nlp/data/language_modeling/megatron/t5_dataset.py
@@ -91,6 +91,10 @@ def __init__(
             if not self.tokenizer.legacy:
                 raise ValueError("Sentencepiece Tokenizer must have legacy = False to add special tokens.")
             self.tokenizer_type = 'sentencepiece'
+            if whole_word_masking:
+                raise ValueError(
+                    "Whole word masking is not supported with sentencepiece tokenizers and only with wordpiece tokenizers. Please set it to False."
+                )
 
         self.cls_id = tokenizer.cls_id
         self.sep_id = tokenizer.sep_id

diff --git a/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py b/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py
@@ -139,10 +139,6 @@ def __init__(self, cfg: DictConfig, trainer: Trainer):
             tensor_model_parallel_size=cfg.get('tensor_model_parallel_size', 1),
         )
 
-        # TODO: Not sure how to use lists of modules with PTL.
-        # This means we can only use pipeline parallelism without the interleaved schedule.
-        self.model = build_model(model_provider_func=self.model_provider_func, wrap_with_ddp=False)[0]
-
         # Prompt tuning initialization
         self.use_soft_prompts = self.cfg.get('use_soft_prompts', False)
 
@@ -156,12 +152,27 @@ def __init__(self, cfg: DictConfig, trainer: Trainer):
             self.num_prompt_tokens = cfg.get('num_prompt_tokens', 100)
 
             if self.cfg.get('existing_prompt_tags', None):
+                # Assign prompt tag ids if none were present in the config
+                if type(self.cfg.existing_prompt_tags[0]) == str:
+                    existing_prompt_tags = self.cfg.existing_prompt_tags
+                    num_prompt_tags = len(existing_prompt_tags)
+                    existing_prompt_tags = [
+                        (existing_prompt_tags[tag_id], tag_id + 1) for tag_id in range(num_prompt_tags)
+                    ]
+
+                    with open_dict(self.cfg):
+                        self.cfg.existing_prompt_tags = existing_prompt_tags
+
                 # Fill table with prev tuned prompt tags and their ids
                 self.prompt_table = set(self.cfg.existing_prompt_tags)
 
                 # Get max prompt id from table for starting point of new prompt ids
                 self.next_prompt_id = max(self.prompt_table, key=lambda x: x[1])[1]
 
+        # TODO: Not sure how to use lists of modules with PTL.
+        # This means we can only use pipeline parallelism without the interleaved schedule.
+        self.model = build_model(model_provider_func=self.model_provider_func, wrap_with_ddp=False)[0]
+
         self.setup_optimizer_param_groups()
 
         self.megatron_amp_o2 = cfg.get('megatron_amp_O2', False)
@@ -662,13 +673,13 @@ def setup(self, stage=None):
             init_consumed_samples = 0
         self.init_consumed_samples = init_consumed_samples
 
-        # Initalize soft prompts before loading datasets and training
-        if self.use_soft_prompts:
-            self.init_new_prompts()
-
         if stage == 'predict':
             return
         else:
+            # Initalize soft prompts before loading datasets and training
+            if self.use_soft_prompts:
+                self.init_new_prompts()
+
             # TODO: consider adding a ModelPT guard to check if model is being restored.
             # allowing restored models to optionally setup datasets
             self.build_train_valid_test_datasets()
@@ -737,6 +748,9 @@ def configure_optimizers(self):
                 fp32_grad_accum = False
                 # TODO: contiguous grad bucket for fp16 is also planned to be supported
                 contiguous_grad_bucket = False
+                raise ValueError(
+                    "fp16 training is not yet supported with O2. Please set megatron_amp_O2 to False in the model config."
+                )
 
             # TODO: this should be true when not using pipeline parallelism
             # we will support that for bf16 when we have async handler from apex

diff --git a/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py b/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py
@@ -325,7 +325,7 @@ def build_pretraining_data_loader(self, dataset, consumed_samples):
         )
 
     def setup(self, stage=None):
-        resume_checkpoint_path = self.trainer.checkpoint_connector.resume_checkpoint_path
+        resume_checkpoint_path = self.trainer.checkpoint_connector.resume_from_checkpoint_fit_path
         if resume_checkpoint_path:
             try:
                 init_consumed_samples = int(

diff --git a/nemo/collections/nlp/models/language_modeling/megatron_t5_model.py b/nemo/collections/nlp/models/language_modeling/megatron_t5_model.py
@@ -125,12 +125,12 @@ def build_train_valid_test_datasets(self):
             seed=self._cfg.seed,
             skip_warmup=self._cfg.data.skip_warmup,
             dataset_type=self._cfg.data.get('dataset_type', 't5'),
-            max_ngram_size=self._cfg.get('max_ngram_size', 10),
-            mean_ngram_size=self._cfg.get('mean_ngram_size', None),
-            geometric_dist=self._cfg.get('geometric_dist', True),
-            permutation=self._cfg.get('permutation', False),
-            whole_word_masking=self._cfg.get('whole_word_masking', True),
-            favor_long_ngrams=self._cfg.get('favor_long_ngrams', False),
+            max_ngram_size=self._cfg.data.get('max_ngram_size', 10),
+            mean_ngram_size=self._cfg.data.get('mean_ngram_size', None),
+            geometric_dist=self._cfg.data.get('geometric_dist', True),
+            permutation=self._cfg.data.get('permutation', False),
+            whole_word_masking=self._cfg.data.get('whole_word_masking', True),
+            favor_long_ngrams=self._cfg.data.get('favor_long_ngrams', False),
         )
         logging.info(f'Length of train dataset: {len(self._train_ds)}')
         logging.info(f'Length of val dataset: {len(self._validation_ds)}')

diff --git a/nemo/core/optim/optimizer_with_main_params.py b/nemo/core/optim/optimizer_with_main_params.py
@@ -126,7 +126,7 @@ def __init__(
                 'which is supposed to be accumulated after grad op.'
             )
             assert contiguous_grad_bucket, (
-                'currently async_grad_allreduce is supported only ' 'with async_grad_allreduce.'
+                'currently async_grad_allreduce is supported only ' 'with contiguous_grad_bucket.'
             )
 
         self._fp32_grad_accum = fp32_grad_accum

diff --git a/tools/text_processing_deployment/Dockerfile b/tools/text_processing_deployment/Dockerfile
@@ -31,9 +31,8 @@ RUN apt-get install build-essential -y && apt-get install wget -y
 RUN wget https://github.com/protocolbuffers/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
 RUN tar xzvf protobuf-2.5.0.tar.gz
 RUN cd protobuf-2.5.0 && ./configure && make && make install && ldconfig
-COPY ../../nemo_text_processing/ /tmp/nemo/nemo_text_processing/
-RUN bash /tmp/nemo/nemo_text_processing/setup.sh
+RUN conda install -c conda-forge thrax=1.3.4 -y
 RUN git clone https://github.com/yzhang123/sparrowhawk.git
 RUN cd sparrowhawk &&  git checkout test &&   apt-get install -y autoconf &&     bash autoreconf && ./configure && make && make install && ldconfig
 RUN git clone https://github.com/kward/shunit2.git
-RUN echo "DONE"
+RUN echo "DONE"