Skip to content

Commit

Permalink
r1.10.0 MegaMolBART Compatibility (#4603)
Browse files Browse the repository at this point in the history
* 1. Added vocab_size property to RegExTokenizer.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Fixed passing hiddens directly.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Fixed style.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Added support in encoder outputs.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Added comments.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Added automatic mapping of kwargs to args in forward.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Added encode function.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Fixed style.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Fixed style.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. Fixed style.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@nvidia.com>

* 1. PP and TP works (but not together)

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Separated get_forward_output_only_func_encode and get_forward_output_only_func_decode.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* update branch

Signed-off-by: ericharper <complex451@gmail.com>

* Set headscale false (#4364)

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Add wandb as dependency (#4365)

Signed-off-by: smajumdar <smajumdar@nvidia.com>

* Raise trainer error (#4356)

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Co-authored-by: Micha Livne <michalivne@users.noreply.github.com>

* Set headscale false (#4364) (#4366)

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: smajumdar <smajumdar@nvidia.com>

* Finetuning changes for BART (#4003)

* Temp

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Checkpoint converter to nemo for bart

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Co-authored-by: Micha Livne <michalivne@users.noreply.github.com>

* Make position embedding expansion specific to a batch to avoid checkpoint size mismatches (#4357)

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix logging warning

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Co-authored-by: Micha Livne <michalivne@users.noreply.github.com>

* 1. Added return logits to validation.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Fixed unkown token during sampling.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Fixed RegExTokenizer loading.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Fixed ckpt file with samples int(0).

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Fixed regex tokenizer.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Fixed allowing enc_tokens to be None.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Added ability to ignore tokens by id during decode.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Fixed regex tokenizer .nemo loading issue.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Fixed style.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Fixed style.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Fixed RegEx test.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* r1.10.0 untie embeddings weights (#4519)

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Added independent decoder embeddings, and independent decoder token_head.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Added support in yaml config.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Fixed initialization.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Added tests for untied embeddings and decoder token head.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Updated share_word_embeddings to share_token_embeddings.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Fixed style.
Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Fixed error in __del__ when TextMemMapDataset fails to build.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Fixed comments.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1.Made method private.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Fixed config names.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Fixed alerts and style.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Fixed PP, TP, PP+TP still fails.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

* 1. Debugging.

Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>

Co-authored-by: Micha Livne <mlivne@nvidia.com>
Co-authored-by: ericharper <complex451@gmail.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
  • Loading branch information
5 people authored Jul 29, 2022
1 parent 72d78d8 commit 59d635c
Show file tree
Hide file tree
Showing 18 changed files with 479 additions and 118 deletions.
8 changes: 6 additions & 2 deletions Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -3066,8 +3066,10 @@ pipeline {
model.transformer_block_type='pre_ln' \
model.data.data_prefix=[.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document,.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document] \
model.position_embedding_type=relative \
model.data.index_mapping_dir=examples/nlp/language_modeling/t5_index_mappings \
model.data.respect_document_boundaries=False \
model.data.index_mapping_dir=examples/nlp/language_modeling/t5_index_mappings"
model.share_token_embeddings=False \
model.share_decoder_tokens_head_embeddings=False"
sh "python examples/nlp/language_modeling/megatron_t5_pretraining.py \
trainer.devices=2 \
trainer.accelerator=gpu \
Expand All @@ -3092,8 +3094,10 @@ pipeline {
model.transformer_block_type='pre_ln' \
model.data.data_prefix=[.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document,.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document] \
model.position_embedding_type=relative \
model.data.index_mapping_dir=examples/nlp/language_modeling/t5_index_mappings \
model.data.respect_document_boundaries=False \
model.data.index_mapping_dir=examples/nlp/language_modeling/t5_index_mappings"
model.share_token_embeddings=False \
model.share_decoder_tokens_head_embeddings=False"
sh "rm -rf examples/nlp/language_modeling/t5_pretrain_results"
sh "rm -rf examples/nlp/language_modeling/t5_index_mappings"
}
Expand Down
2 changes: 2 additions & 0 deletions examples/nlp/language_modeling/conf/megatron_bart_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,8 @@ model:
transformer_block_type: 'pre_ln' # Options ['pre_ln', 'post_ln', 'normformer']
hidden_steps: 32 # Number of latent vectors to use for pereceiver encoders
num_self_attention_per_cross_attention: 1 # Number of self-attention layers for every cross-attention layer.
share_token_embeddings: True # If True share encoder/decoder embeddings
share_decoder_tokens_head_embeddings: True # If True share decoder embeddings and decoder projection to logits

tokenizer:
library: 'megatron'
Expand Down
2 changes: 2 additions & 0 deletions examples/nlp/language_modeling/conf/megatron_t5_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,8 @@ model:
transformer_block_type: 'pre_ln' # Options ['pre_ln', 'post_ln', 'normformer']
hidden_steps: 32 # Number of latent vectors to use for pereceiver encoders
num_self_attention_per_cross_attention: 1 # Number of self-attention layers for every cross-attention layer.
share_token_embeddings: True # If True share encoder/decoder embeddings
share_decoder_tokens_head_embeddings: True # If True share decoder embeddings and decoder projection to logits

tokenizer:
library: 'megatron'
Expand Down
2 changes: 2 additions & 0 deletions examples/nlp/language_modeling/conf/megatron_ul2_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,8 @@ model:
transformer_block_type: 'pre_ln' # Options ['pre_ln', 'post_ln', 'normformer']
hidden_steps: 32 # Number of latent vectors to use for pereceiver encoders
num_self_attention_per_cross_attention: 1 # Number of self-attention layers for every cross-attention layer.
share_token_embeddings: True # If True share encoder/decoder embeddings
share_decoder_tokens_head_embeddings: True # If True share decoder embeddings and decoder projection to logits

tokenizer:
library: 'megatron'
Expand Down
2 changes: 2 additions & 0 deletions examples/nlp/machine_translation/conf/aayn_base_megatron.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,8 @@ model:
transformer_block_type: 'pre_ln' # Options ['pre_ln', 'post_ln', 'normformer']
hidden_steps: 32 # Number of latent vectors to use for pereceiver encoders
num_self_attention_per_cross_attention: 1 # Number of self-attention layers for every cross-attention layer.
share_token_embeddings: True # If True share encoder/decoder embeddings
share_decoder_tokens_head_embeddings: True # If True share decoder embeddings and decoder projection to logits

# precision
native_amp_init_scale: 4294967296 # 2 ** 32
Expand Down
112 changes: 74 additions & 38 deletions nemo/collections/common/tokenizers/regex_tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,8 +68,9 @@ def __init__(
self.sep_token = sep_token
self.unk_token = unk_token

# holds base name of .model/.vocab files
self.base_fname = None
# holds names of .model/.vocab files
self.regex_file = None
self.vocab_file = None

# initialize with default vocab
self.vocab = {
Expand All @@ -96,12 +97,12 @@ def _compile_regex(self):
regex_string += r".)"
self._compiled_regex = re.compile(regex_string)

@property
def vocab_size(self):
return len(self.vocab)

def text_to_tokens(self, text):
# Begin token
tokens = [self.bos_token]
tokens.extend(self._compiled_regex.findall(text))
# End token
tokens.append(self.eos_token)
tokens = self._compiled_regex.findall(text)

return tokens

Expand Down Expand Up @@ -137,18 +138,28 @@ def tokens_to_ids(self, token_data):
ids_list.append(ids)
return ids_list

def ids_to_tokens(self, ids):
def ids_to_tokens(self, ids_list):
if len(ids_list) and not isinstance(ids_list[0], list):
ids_list = [ids_list]
added_list = True
else:
added_list = False

tokens_list = []
for ids in ids:
for ids in ids_list:
tokens = []
for token_id in ids:
token = self._decode_vocab.get(token_id)
if token is None:
raise ValueError(f"Token id {token_id} is not recognised")
tokens.append(token)

tokens = [self._decode_vocab.get(token_id) for token_id in ids]
tokens_list.append(tokens)

return tokens_list
if added_list:
return tokens_list[0]
else:
return tokens_list

def text_to_ids(self, text):
tokens = self.text_to_tokens(text)
Expand All @@ -159,51 +170,73 @@ def ids_to_text(self, ids):
tokens = self.ids_to_tokens(ids)
return self.tokens_to_text(tokens)

def save_tokenizer(self, base_fname=None):
@property
def pad_id(self):
return 0

@property
def unk_id(self):
return 1

@property
def bos_id(self):
return 2

@property
def eos_id(self):
return 3

@property
def mask_id(self):
return 4

@property
def sep_id(self):
return 5

def _get_regex_vocab_files(self, regex_file=None, vocab_file=None):
"""
Saves tokenizer's regex (base_fname.model) and vocab (base_fname.vocab) files
Infers files or update if given.
"""
if base_fname.endswith(".model"):
base_fname = os.path.splitext(base_fname)[0]
regex_file = regex_file or self.regex_file
if not regex_file:
raise ValueError(f"regex_file must be specified")

if base_fname:
self.base_fname = base_fname
vocab_file = vocab_file or self.vocab_file
# try to infer vocab_file from regex_file
if not vocab_file:
vocab_file = os.path.splitext(regex_file)[0] + '.vocab'

if not self.base_fname:
raise ValueError(f"base_fname must be specified")
self.regex_file = regex_file
self.vocab_file = vocab_file

vocab_file = self.base_fname + '.vocab'
regex_file = self.base_fname + '.model'
return regex_file, vocab_file

logging.debug(f"Saving vocabulary to file = {vocab_file}")
def save_tokenizer(self, regex_file=None, vocab_file=None):
"""
Saves tokenizer's regex and vocab files
"""
regex_file, vocab_file = self._get_regex_vocab_files(regex_file=regex_file, vocab_file=vocab_file)

logging.info(f"Saving vocabulary to file = {vocab_file}")
with open(vocab_file, 'w') as fp:
for token in self.vocab:
fp.write(f"{token[0]}\n")

logging.debug(f"Saving regex to file = {regex_file}")
logging.info(f"Saving regex to file = {regex_file}")
open(regex_file, 'w').write(self.regex)

def load_tokenizer(self, base_fname):
def load_tokenizer(self, regex_file=None, vocab_file=None):
"""
Loads tokenizer's regex (base_fname.model) and vocab (base_fname.vocab) files
Loads tokenizer's regex and vocab files
"""
if base_fname.endswith(".model"):
base_fname = os.path.splitext(base_fname)[0]

if base_fname:
self.base_fname = base_fname

if not self.base_fname:
raise ValueError(f"base_fname must be specified")

vocab_file = self.base_fname + '.vocab'
regex_file = self.base_fname + '.model'
regex_file, vocab_file = self._get_regex_vocab_files(regex_file=regex_file, vocab_file=vocab_file)

# load vocab file
# vocab_file: path to file with vocabulary which consists
# of characters separated by \n (None/"" for empty vocab)

logging.debug(f"Loading vocabulary from file = {vocab_file}")
logging.info(f"Loading vocabulary from file = {vocab_file}")
if os.path.exists(vocab_file):
vocab = {}
with open(vocab_file, "r") as f:
Expand All @@ -217,11 +250,14 @@ def load_tokenizer(self, base_fname):

# load regex from a file
if os.path.exists(regex_file):
logging.debug(f"Loading regex from file = {regex_file}")
logging.info(f"Loading regex from file = {regex_file}")
self.regex = open(regex_file, encoding="utf-8").read().strip()
else:
raise RuntimeError(f"Missing regex_file = {regex_file}")

self._update_cache()
self._compile_regex()

return self

def build_vocab_from_csv(self, data_csv_file, col="smiles"):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,9 +74,9 @@ def __init__(
"""
# Sanity checks.
if total_samples <= 0:
raise RuntimeError("no sample to consume: {}".format(self.total_samples))
raise RuntimeError("no sample to consume: {}".format(total_samples))
if consumed_samples >= total_samples:
raise RuntimeError("no samples left to consume: {}, {}".format(self.consumed_samples, self.total_samples))
raise RuntimeError("no samples left to consume: {}, {}".format(consumed_samples, total_samples))
if micro_batch_size <= 0:
raise RuntimeError(f"micro_batch_size size must be greater than 0, but {micro_batch_size}")
if data_parallel_size <= 0:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ def __init__(
self, dataset_paths, newline_int=10, header_lines=0, workers=None, tokenizer=None, sort_dataset_paths=True,
):
super().__init__()
self.mdata_midx_list = []

if len(dataset_paths) < 1:
raise ValueError("files_list must contain at leat one file name")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -352,7 +352,7 @@ def allreduce_first_last_embeddings(self):
if parallel_state.get_pipeline_model_parallel_world_size() > 1 and (
parallel_state.is_pipeline_first_stage() or parallel_state.is_pipeline_last_stage()
):
if self.model.share_word_embeddings:
if self.model.share_token_embeddings:
word_embeddings_weight = self.model.word_embeddings_weight()
if self.megatron_amp_o2:
# O2 recipe stores a "main" copy of weights and grads
Expand Down
Loading

0 comments on commit 59d635c

Please sign in to comment.