r1.10.0 MegaMolBART Compatibility (#4603)

* 1. Added vocab_size property to RegExTokenizer. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Fixed passing hiddens directly. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Added support in encoder outputs. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Added comments. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Added automatic mapping of kwargs to args in forward. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Added encode function. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. PP and TP works (but not together) Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Separated get_forward_output_only_func_encode and get_forward_output_only_func_decode. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * update branch Signed-off-by: ericharper <complex451@gmail.com> * Set headscale false (#4364) Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Add wandb as dependency (#4365) Signed-off-by: smajumdar <smajumdar@nvidia.com> * Raise trainer error (#4356) Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> Co-authored-by: Micha Livne <michalivne@users.noreply.github.com> * Set headscale false (#4364) (#4366) Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> Signed-off-by: smajumdar <smajumdar@nvidia.com> * Finetuning changes for BART (#4003) * Temp Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Checkpoint converter to nemo for bart Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Style Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> Co-authored-by: Micha Livne <michalivne@users.noreply.github.com> * Make position embedding expansion specific to a batch to avoid checkpoint size mismatches (#4357) * Style Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> * Fix logging warning Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca> Co-authored-by: Micha Livne <michalivne@users.noreply.github.com> * 1. Added return logits to validation. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed unkown token during sampling. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed RegExTokenizer loading. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed ckpt file with samples int(0). Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed regex tokenizer. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed allowing enc_tokens to be None. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Added ability to ignore tokens by id during decode. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed regex tokenizer .nemo loading issue. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed RegEx test. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * r1.10.0 untie embeddings weights (#4519) * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Added independent decoder embeddings, and independent decoder token_head. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Added support in yaml config. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed initialization. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Added tests for untied embeddings and decoder token head. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Updated share_word_embeddings to share_token_embeddings. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed error in __del__ when TextMemMapDataset fails to build. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed comments. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1.Made method private. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed config names. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed alerts and style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Fixed PP, TP, PP+TP still fails. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> Co-authored-by: Micha Livne <mlivne@nvidia.com> Co-authored-by: ericharper <complex451@gmail.com> Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
NVIDIA · Jul 29, 2022 · 59d635c · 59d635c
1 parent 72d78d8
commit 59d635c
Show file tree

Hide file tree

Showing 18 changed files with 479 additions and 118 deletions.
diff --git a/Jenkinsfile b/Jenkinsfile
@@ -3066,8 +3066,10 @@ pipeline {
         model.transformer_block_type='pre_ln' \
         model.data.data_prefix=[.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document,.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document] \
         model.position_embedding_type=relative \
+        model.data.index_mapping_dir=examples/nlp/language_modeling/t5_index_mappings \
         model.data.respect_document_boundaries=False \
-        model.data.index_mapping_dir=examples/nlp/language_modeling/t5_index_mappings"
+        model.share_token_embeddings=False \
+        model.share_decoder_tokens_head_embeddings=False"
         sh "python examples/nlp/language_modeling/megatron_t5_pretraining.py \
         trainer.devices=2 \
         trainer.accelerator=gpu \
@@ -3092,8 +3094,10 @@ pipeline {
         model.transformer_block_type='pre_ln' \
         model.data.data_prefix=[.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document,.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document] \
         model.position_embedding_type=relative \
+        model.data.index_mapping_dir=examples/nlp/language_modeling/t5_index_mappings \
         model.data.respect_document_boundaries=False \
-        model.data.index_mapping_dir=examples/nlp/language_modeling/t5_index_mappings"
+        model.share_token_embeddings=False \
+        model.share_decoder_tokens_head_embeddings=False"
         sh "rm -rf examples/nlp/language_modeling/t5_pretrain_results"
         sh "rm -rf examples/nlp/language_modeling/t5_index_mappings"
       }

diff --git a/examples/nlp/language_modeling/conf/megatron_bart_config.yaml b/examples/nlp/language_modeling/conf/megatron_bart_config.yaml
@@ -83,6 +83,8 @@ model:
   transformer_block_type: 'pre_ln' # Options ['pre_ln', 'post_ln', 'normformer']
   hidden_steps: 32 # Number of latent vectors to use for pereceiver encoders
   num_self_attention_per_cross_attention: 1 # Number of self-attention layers for every cross-attention layer.
+  share_token_embeddings: True # If True share encoder/decoder embeddings
+  share_decoder_tokens_head_embeddings: True # If True share decoder embeddings and decoder projection to logits
 
   tokenizer:
     library: 'megatron'

diff --git a/examples/nlp/language_modeling/conf/megatron_t5_config.yaml b/examples/nlp/language_modeling/conf/megatron_t5_config.yaml
@@ -86,6 +86,8 @@ model:
   transformer_block_type: 'pre_ln' # Options ['pre_ln', 'post_ln', 'normformer']
   hidden_steps: 32 # Number of latent vectors to use for pereceiver encoders
   num_self_attention_per_cross_attention: 1 # Number of self-attention layers for every cross-attention layer.
+  share_token_embeddings: True # If True share encoder/decoder embeddings
+  share_decoder_tokens_head_embeddings: True # If True share decoder embeddings and decoder projection to logits
 
   tokenizer:
     library: 'megatron'

diff --git a/examples/nlp/language_modeling/conf/megatron_ul2_config.yaml b/examples/nlp/language_modeling/conf/megatron_ul2_config.yaml
@@ -82,6 +82,8 @@ model:
   transformer_block_type: 'pre_ln' # Options ['pre_ln', 'post_ln', 'normformer']
   hidden_steps: 32 # Number of latent vectors to use for pereceiver encoders
   num_self_attention_per_cross_attention: 1 # Number of self-attention layers for every cross-attention layer.
+  share_token_embeddings: True # If True share encoder/decoder embeddings
+  share_decoder_tokens_head_embeddings: True # If True share decoder embeddings and decoder projection to logits
 
   tokenizer:
     library: 'megatron'

diff --git a/examples/nlp/machine_translation/conf/aayn_base_megatron.yaml b/examples/nlp/machine_translation/conf/aayn_base_megatron.yaml
@@ -93,6 +93,8 @@ model:
   transformer_block_type: 'pre_ln' # Options ['pre_ln', 'post_ln', 'normformer']
   hidden_steps: 32 # Number of latent vectors to use for pereceiver encoders
   num_self_attention_per_cross_attention: 1 # Number of self-attention layers for every cross-attention layer.
+  share_token_embeddings: True # If True share encoder/decoder embeddings
+  share_decoder_tokens_head_embeddings: True # If True share decoder embeddings and decoder projection to logits
 
   # precision
   native_amp_init_scale: 4294967296 # 2 ** 32

diff --git a/nemo/collections/common/tokenizers/regex_tokenizer.py b/nemo/collections/common/tokenizers/regex_tokenizer.py
@@ -68,8 +68,9 @@ def __init__(
         self.sep_token = sep_token
         self.unk_token = unk_token
 
-        # holds base name of .model/.vocab files
-        self.base_fname = None
+        # holds names of .model/.vocab files
+        self.regex_file = None
+        self.vocab_file = None
 
         # initialize with default vocab
         self.vocab = {
@@ -96,12 +97,12 @@ def _compile_regex(self):
         regex_string += r".)"
         self._compiled_regex = re.compile(regex_string)
 
+    @property
+    def vocab_size(self):
+        return len(self.vocab)
+
     def text_to_tokens(self, text):
-        # Begin token
-        tokens = [self.bos_token]
-        tokens.extend(self._compiled_regex.findall(text))
-        # End token
-        tokens.append(self.eos_token)
+        tokens = self._compiled_regex.findall(text)
 
         return tokens
 
@@ -137,18 +138,28 @@ def tokens_to_ids(self, token_data):
             ids_list.append(ids)
         return ids_list
 
-    def ids_to_tokens(self, ids):
+    def ids_to_tokens(self, ids_list):
+        if len(ids_list) and not isinstance(ids_list[0], list):
+            ids_list = [ids_list]
+            added_list = True
+        else:
+            added_list = False
+
         tokens_list = []
-        for ids in ids:
+        for ids in ids_list:
+            tokens = []
             for token_id in ids:
                 token = self._decode_vocab.get(token_id)
                 if token is None:
                     raise ValueError(f"Token id {token_id} is not recognised")
+                tokens.append(token)
 
-            tokens = [self._decode_vocab.get(token_id) for token_id in ids]
             tokens_list.append(tokens)
 
-        return tokens_list
+        if added_list:
+            return tokens_list[0]
+        else:
+            return tokens_list
 
     def text_to_ids(self, text):
         tokens = self.text_to_tokens(text)
@@ -159,51 +170,73 @@ def ids_to_text(self, ids):
         tokens = self.ids_to_tokens(ids)
         return self.tokens_to_text(tokens)
 
-    def save_tokenizer(self, base_fname=None):
+    @property
+    def pad_id(self):
+        return 0
+
+    @property
+    def unk_id(self):
+        return 1
+
+    @property
+    def bos_id(self):
+        return 2
+
+    @property
+    def eos_id(self):
+        return 3
+
+    @property
+    def mask_id(self):
+        return 4
+
+    @property
+    def sep_id(self):
+        return 5
+
+    def _get_regex_vocab_files(self, regex_file=None, vocab_file=None):
         """
-        Saves tokenizer's regex (base_fname.model) and vocab (base_fname.vocab) files
+        Infers files or update if given.
         """
-        if base_fname.endswith(".model"):
-            base_fname = os.path.splitext(base_fname)[0]
+        regex_file = regex_file or self.regex_file
+        if not regex_file:
+            raise ValueError(f"regex_file must be specified")
 
-        if base_fname:
-            self.base_fname = base_fname
+        vocab_file = vocab_file or self.vocab_file
+        # try to infer vocab_file from regex_file
+        if not vocab_file:
+            vocab_file = os.path.splitext(regex_file)[0] + '.vocab'
 
-        if not self.base_fname:
-            raise ValueError(f"base_fname must be specified")
+        self.regex_file = regex_file
+        self.vocab_file = vocab_file
 
-        vocab_file = self.base_fname + '.vocab'
-        regex_file = self.base_fname + '.model'
+        return regex_file, vocab_file
 
-        logging.debug(f"Saving vocabulary to file = {vocab_file}")
+    def save_tokenizer(self, regex_file=None, vocab_file=None):
+        """
+        Saves tokenizer's regex and vocab files
+        """
+        regex_file, vocab_file = self._get_regex_vocab_files(regex_file=regex_file, vocab_file=vocab_file)
+
+        logging.info(f"Saving vocabulary to file = {vocab_file}")
         with open(vocab_file, 'w') as fp:
             for token in self.vocab:
                 fp.write(f"{token[0]}\n")
 
-        logging.debug(f"Saving regex to file = {regex_file}")
+        logging.info(f"Saving regex to file = {regex_file}")
         open(regex_file, 'w').write(self.regex)
 
-    def load_tokenizer(self, base_fname):
+    def load_tokenizer(self, regex_file=None, vocab_file=None):
         """
-        Loads tokenizer's regex (base_fname.model) and vocab (base_fname.vocab) files
+        Loads tokenizer's regex and vocab files
         """
-        if base_fname.endswith(".model"):
-            base_fname = os.path.splitext(base_fname)[0]
-
-        if base_fname:
-            self.base_fname = base_fname
-
-        if not self.base_fname:
-            raise ValueError(f"base_fname must be specified")
-
-        vocab_file = self.base_fname + '.vocab'
-        regex_file = self.base_fname + '.model'
+        regex_file, vocab_file = self._get_regex_vocab_files(regex_file=regex_file, vocab_file=vocab_file)
 
         # load vocab file
         # vocab_file: path to file with vocabulary which consists
         # of characters separated by \n (None/"" for empty vocab)
 
-        logging.debug(f"Loading vocabulary from file = {vocab_file}")
+        logging.info(f"Loading vocabulary from file = {vocab_file}")
         if os.path.exists(vocab_file):
             vocab = {}
             with open(vocab_file, "r") as f:
@@ -217,11 +250,14 @@ def load_tokenizer(self, base_fname):
 
         # load regex from a file
         if os.path.exists(regex_file):
-            logging.debug(f"Loading regex from file = {regex_file}")
+            logging.info(f"Loading regex from file = {regex_file}")
             self.regex = open(regex_file, encoding="utf-8").read().strip()
         else:
             raise RuntimeError(f"Missing regex_file = {regex_file}")
 
+        self._update_cache()
+        self._compile_regex()
+
         return self
 
     def build_vocab_from_csv(self, data_csv_file, col="smiles"):

diff --git a/nemo/collections/nlp/data/language_modeling/megatron/megatron_batch_samplers.py b/nemo/collections/nlp/data/language_modeling/megatron/megatron_batch_samplers.py
@@ -74,9 +74,9 @@ def __init__(
         """
         # Sanity checks.
         if total_samples <= 0:
-            raise RuntimeError("no sample to consume: {}".format(self.total_samples))
+            raise RuntimeError("no sample to consume: {}".format(total_samples))
         if consumed_samples >= total_samples:
-            raise RuntimeError("no samples left to consume: {}, {}".format(self.consumed_samples, self.total_samples))
+            raise RuntimeError("no samples left to consume: {}, {}".format(consumed_samples, total_samples))
         if micro_batch_size <= 0:
             raise RuntimeError(f"micro_batch_size size must be greater than 0, but {micro_batch_size}")
         if data_parallel_size <= 0:

diff --git a/nemo/collections/nlp/data/language_modeling/text_memmap_dataset.py b/nemo/collections/nlp/data/language_modeling/text_memmap_dataset.py
@@ -40,6 +40,7 @@ def __init__(
         self, dataset_paths, newline_int=10, header_lines=0, workers=None, tokenizer=None, sort_dataset_paths=True,
     ):
         super().__init__()
+        self.mdata_midx_list = []
 
         if len(dataset_paths) < 1:
             raise ValueError("files_list must contain at leat one file name")

diff --git a/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py b/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py
@@ -352,7 +352,7 @@ def allreduce_first_last_embeddings(self):
         if parallel_state.get_pipeline_model_parallel_world_size() > 1 and (
             parallel_state.is_pipeline_first_stage() or parallel_state.is_pipeline_last_stage()
         ):
-            if self.model.share_word_embeddings:
+            if self.model.share_token_embeddings:
                 word_embeddings_weight = self.model.word_embeddings_weight()
                 if self.megatron_amp_o2:
                     # O2 recipe stores a "main" copy of weights and grads