diff --git a/README.rst b/README.rst index 0236acfcb982..d9b5b914ef85 100644 --- a/README.rst +++ b/README.rst @@ -45,11 +45,11 @@ Key Features * Speech processing * `Automatic Speech Recognition (ASR) `_ - * Supported models: Jasper, QuartzNet, CitriNet, Conformer-CTC, Conformer-Transducer, ContextNet, ... + * Supported models: Jasper, QuartzNet, CitriNet, Conformer-CTC, Conformer-Transducer, ContextNet, LSTM-Transducer (RNNT), LSTM-CTC, ... * Supports CTC and Transducer/RNNT losses/decoders * Beam Search decoding * `Language Modelling for ASR `_: N-gram LM in fusion with Beam Search decoding, Neural Rescoring with Transformer - * Streaming and Buffered ASR (CTC/Transdcer) - `Chunked Inference Examples `_ + * Streaming and Buffered ASR (CTC/Transducer) - `Chunked Inference Examples `_ * `Speech Classification and Speech Command Recognition `_: MatchboxNet (Command Recognition) * `Voice activity Detection (VAD) `_: MarbleNet * `Speaker Recognition `_: TitaNet, ECAPA_TDNN, SpeakerNet diff --git a/docs/source/asr/asr_all.bib b/docs/source/asr/asr_all.bib index c9d6f993fe03..cb6aea03e36f 100644 --- a/docs/source/asr/asr_all.bib +++ b/docs/source/asr/asr_all.bib @@ -997,3 +997,12 @@ @article{Dawalatabad_2021 month={Aug} } + +@inproceedings{he2019streaming, + title={Streaming end-to-end speech recognition for mobile devices}, + author={He, Yanzhang and Sainath, Tara N and Prabhavalkar, Rohit and McGraw, Ian and Alvarez, Raziel and Zhao, Ding and Rybach, David and Kannan, Anjuli and Wu, Yonghui and Pang, Ruoming and others}, + booktitle={ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, + pages={6381--6385}, + year={2019}, + organization={IEEE} +} diff --git a/docs/source/asr/configs.rst b/docs/source/asr/configs.rst index 72eb377ad252..995674bc06e6 100644 --- a/docs/source/asr/configs.rst +++ b/docs/source/asr/configs.rst @@ -511,13 +511,22 @@ Conformer-Transducer Please refer to the model page of `Conformer-Transducer <./models.html#Conformer-Transducer>`__ for more information on this model. +LSTM-Transducer and LSTM-CTC +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The config files for LSTM-Transducer and LSTM-CTC models can be found at ``/examples/asr/conf/lstm/lstm_transducer_bpe.yaml`` and ``/examples/asr/conf/lstm/lstm_ctc_bpe.yaml`` respectively. +Most of the of the configs of are similar to other ctc or transducer models. The main difference is the encoder part. +The encoder section includes the details about the RNN-based encoder architecture. You may find more information in the +config files and also :doc:`nemo.collections.asr.modules.RNNEncoder<./api.html#nemo.collections.asr.modules.RNNEncoder>`. + + Transducer Configurations ------------------------- All CTC-based ASR model configs can be modified to support Transducer loss training. Below, we discuss the modifications required in the config to enable Transducer training. All modifications are made to the ``model`` config. Model Defaults -~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~ It is a subsection to the model config representing the default values shared across the entire model represented as ``model.model_defaults``. diff --git a/docs/source/asr/models.rst b/docs/source/asr/models.rst index 5efc9d9327de..d0c679832ee9 100644 --- a/docs/source/asr/models.rst +++ b/docs/source/asr/models.rst @@ -127,6 +127,22 @@ You may find the example config files of Conformer-Transducer model with charact ``/examples/asr/conf/conformer/conformer_transducer_char.yaml`` and with sub-word encoding at ``/examples/asr/conf/conformer/conformer_transducer_bpe.yaml``. +LSTM-Transducer +--------------- + +LSTM-Transducer is a model which uses RNNs (eg. LSTM) in the encoder. The architecture of this model is followed from suggestions in :cite:`asr-models-he2019streaming`. +It uses RNNT/Transducer loss/decoder. The encoder consists of RNN layers (LSTM as default) with lower projection size to increase the efficiency. +Layer norm is added between the layers to stabilize the training. +It can be trained/used in unidirectional or bidirectional mode. The unidirectional mode is fully causal and can be used easily for simple and efficient frame-wise streaming. However the accuracy of this model is generally lower than other models like Conformer and Citrinet. + +This model supports both the sub-word level and character level encodings. You may find the example config file of RNNT model with wordpiece encoding at ``/examples/asr/conf/lstm/lstm_transducer_bpe.yaml``. +You can find more details on the config files for the RNNT models at ``LSTM-Transducer <./configs.html#lstm-transducer>``. + +LSTM-CTC +------- + +LSTM-CTC model is a CTC-variant of the LSTM-Transducer model which uses CTC loss/decoding instead of Transducer. +You may find the example config file of LSTM-CTC model with wordpiece encoding at ``/examples/asr/conf/lstm/lstm_ctc_bpe.yaml``. References ---------- diff --git a/examples/asr/conf/conformer/conformer_ctc_bpe.yaml b/examples/asr/conf/conformer/conformer_ctc_bpe.yaml index 153aeb5c067c..bf04362d9d02 100644 --- a/examples/asr/conf/conformer/conformer_ctc_bpe.yaml +++ b/examples/asr/conf/conformer/conformer_ctc_bpe.yaml @@ -30,6 +30,7 @@ model: sample_rate: 16000 log_prediction: true # enables logging sample predictions in the output during training ctc_reduction: 'mean_batch' + skip_nan_grad: false train_ds: manifest_filepath: ??? @@ -161,7 +162,7 @@ trainer: max_epochs: 1000 max_steps: null # computed at runtime if not set val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations - accelerator: gpu + accelerator: auto strategy: ddp accumulate_grad_batches: 1 gradient_clip_val: 0.0 diff --git a/examples/asr/conf/conformer/conformer_ctc_bpe_multilang.yaml b/examples/asr/conf/conformer/conformer_ctc_bpe_multilang.yaml index 3886bcc78bb7..d3913f1a5b4d 100644 --- a/examples/asr/conf/conformer/conformer_ctc_bpe_multilang.yaml +++ b/examples/asr/conf/conformer/conformer_ctc_bpe_multilang.yaml @@ -168,7 +168,8 @@ trainer: max_epochs: 1000 max_steps: null # computed at runtime if not set val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations - accelerator: ddp + accelerator: auto + strategy: ddp accumulate_grad_batches: 1 gradient_clip_val: 0.0 precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP. diff --git a/examples/asr/conf/conformer/conformer_ctc_char.yaml b/examples/asr/conf/conformer/conformer_ctc_char.yaml index 79c23b6fc16e..cdcf733d435e 100644 --- a/examples/asr/conf/conformer/conformer_ctc_char.yaml +++ b/examples/asr/conf/conformer/conformer_ctc_char.yaml @@ -11,6 +11,7 @@ model: "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"] log_prediction: true # enables logging sample predictions in the output during training ctc_reduction: 'mean_batch' + skip_nan_grad: false train_ds: manifest_filepath: ??? @@ -136,7 +137,7 @@ trainer: max_epochs: 1000 max_steps: null # computed at runtime if not set val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations - accelerator: gpu + accelerator: auto strategy: ddp accumulate_grad_batches: 1 gradient_clip_val: 0.0 diff --git a/examples/asr/conf/conformer/conformer_transducer_bpe.yaml b/examples/asr/conf/conformer/conformer_transducer_bpe.yaml index c617401a6352..4699726c0d3f 100644 --- a/examples/asr/conf/conformer/conformer_transducer_bpe.yaml +++ b/examples/asr/conf/conformer/conformer_transducer_bpe.yaml @@ -26,6 +26,7 @@ model: sample_rate: &sample_rate 16000 compute_eval_loss: false # eval samples can be very long and exhaust memory. Disable computation of transducer loss during validation/testing with this flag. log_prediction: true # enables logging sample predictions in the output during training + skip_nan_grad: false model_defaults: enc_hidden: ${model.encoder.d_model} @@ -38,7 +39,7 @@ model: batch_size: 16 # you may increase batch_size if your memory allows shuffle: true num_workers: 8 - pin_memory: false + pin_memory: true use_start_end_token: false trim_silence: false max_duration: 16.7 # it is set for LibriSpeech, you may need to update it for your dataset @@ -57,7 +58,7 @@ model: batch_size: 16 shuffle: false num_workers: 8 - pin_memory: false + pin_memory: true use_start_end_token: false test_ds: @@ -66,7 +67,7 @@ model: batch_size: 16 shuffle: false num_workers: 8 - pin_memory: false + pin_memory: true use_start_end_token: false # You may find more detail on how to train a tokenizer at: /scripts/tokenizers/process_asr_text_tokenizer.py @@ -208,10 +209,10 @@ model: trainer: devices: -1 # number of GPUs, -1 would use all available GPUs num_nodes: 1 - max_epochs: 1000 + max_epochs: 500 max_steps: null # computed at runtime if not set val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations - accelerator: gpu + accelerator: auto strategy: ddp accumulate_grad_batches: 1 gradient_clip_val: 0.0 diff --git a/examples/asr/conf/conformer/conformer_transducer_bpe_multilang.yaml b/examples/asr/conf/conformer/conformer_transducer_bpe_multilang.yaml index d889b01a9131..5c2cd498e153 100644 --- a/examples/asr/conf/conformer/conformer_transducer_bpe_multilang.yaml +++ b/examples/asr/conf/conformer/conformer_transducer_bpe_multilang.yaml @@ -39,7 +39,7 @@ model: batch_size: 16 # you may increase batch_size if your memory allows shuffle: true num_workers: 8 - pin_memory: false + pin_memory: true use_start_end_token: false trim_silence: false max_duration: 16.7 # it is set for LibriSpeech, you may need to update it for your dataset @@ -58,7 +58,7 @@ model: batch_size: 16 shuffle: false num_workers: 8 - pin_memory: false + pin_memory: true use_start_end_token: false test_ds: @@ -67,7 +67,7 @@ model: batch_size: 16 shuffle: false num_workers: 8 - pin_memory: false + pin_memory: true use_start_end_token: false # You may find more detail on how to train a monolingual tokenizer at: /scripts/tokenizers/process_asr_text_tokenizer.py @@ -218,7 +218,8 @@ trainer: max_epochs: 1000 max_steps: null # computed at runtime if not set val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations - accelerator: ddp + accelerator: auto + strategy: ddp accumulate_grad_batches: 1 gradient_clip_val: 0.0 precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP. diff --git a/examples/asr/conf/conformer/conformer_transducer_char.yaml b/examples/asr/conf/conformer/conformer_transducer_char.yaml index 5ff5a46044eb..1beccce28312 100644 --- a/examples/asr/conf/conformer/conformer_transducer_char.yaml +++ b/examples/asr/conf/conformer/conformer_transducer_char.yaml @@ -26,6 +26,7 @@ model: sample_rate: &sample_rate 16000 compute_eval_loss: false # eval samples can be very long and exhaust memory. Disable computation of transducer loss during validation/testing with this flag. log_prediction: true # enables logging sample predictions in the output during training + skip_nan_grad: false labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"] @@ -41,7 +42,7 @@ model: batch_size: 16 # you may increase batch_size if your memory allows shuffle: true num_workers: 8 - pin_memory: false + pin_memory: true trim_silence: false max_duration: 16.7 # it is set for LibriSpeech, you may need to update it for your dataset min_duration: 0.1 @@ -59,7 +60,7 @@ model: batch_size: 16 shuffle: false num_workers: 8 - pin_memory: false + pin_memory: true test_ds: manifest_filepath: null @@ -67,7 +68,7 @@ model: batch_size: 16 shuffle: false num_workers: 8 - pin_memory: false + pin_memory: true preprocessor: _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor @@ -203,10 +204,10 @@ model: trainer: devices: -1 # number of GPUs, -1 would use all available GPUs num_nodes: 1 - max_epochs: 1000 + max_epochs: 500 max_steps: null # computed at runtime if not set val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations - accelerator: gpu + accelerator: auto strategy: ddp accumulate_grad_batches: 1 gradient_clip_val: 0.0 diff --git a/examples/asr/conf/contextnet_rnnt/config_rnnt_bpe.yaml b/examples/asr/conf/contextnet_rnnt/config_rnnt_bpe.yaml index 4069e9318976..a13b4aa3d92b 100644 --- a/examples/asr/conf/contextnet_rnnt/config_rnnt_bpe.yaml +++ b/examples/asr/conf/contextnet_rnnt/config_rnnt_bpe.yaml @@ -187,7 +187,7 @@ model: # greedy strategy config greedy: - max_symbols: 30 + max_symbols: 10 # beam strategy config beam: diff --git a/examples/asr/conf/lstm/lstm_ctc_bpe.yaml b/examples/asr/conf/lstm/lstm_ctc_bpe.yaml new file mode 100644 index 000000000000..b011c495d9e9 --- /dev/null +++ b/examples/asr/conf/lstm/lstm_ctc_bpe.yaml @@ -0,0 +1,163 @@ +# It contains the default values for training an LSTM-CTC ASR model, large size (~170M for bidirectional and ~130M for unidirectional) with CTC loss and sub-word encoding. + +# Architecture and training config: +# Default learning parameters in this config are set for effective batch size of 2K. To train it with smaller effective +# batch sizes, you may need to re-tune the learning parameters or use higher accumulate_grad_batches. + +# Followed the architecture suggested in the following paper: +# 'STREAMING END-TO-END SPEECH RECOGNITION FOR MOBILE DEVICES' by Yanzhang He et al. (https://arxiv.org/pdf/1811.06621.pdf) + +# You may find more info about LSTM-CTC here: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#lstm-transducer +# Pre-trained models of LSTM-CTC can be found here: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/results.html + +name: "LSTM-CTC-BPE" + +model: + sample_rate: 16000 + log_prediction: true # enables logging sample predictions in the output during training + ctc_reduction: 'mean_batch' + skip_nan_grad: false + + train_ds: + manifest_filepath: ??? + sample_rate: ${model.sample_rate} + batch_size: 16 # you may increase batch_size if your memory allows + shuffle: true + num_workers: 4 + pin_memory: true + use_start_end_token: false + trim_silence: false + max_duration: 16.7 # it is set for LibriSpeech, you may need to update it for your dataset + min_duration: 0.1 + # tarred datasets + is_tarred: false + tarred_audio_filepaths: null + shuffle_n: 2048 + # bucketing params + bucketing_strategy: "synced_randomized" + bucketing_batch_size: null + + validation_ds: + manifest_filepath: ??? + sample_rate: ${model.sample_rate} + batch_size: 16 # you may increase batch_size if your memory allows + shuffle: false + num_workers: 4 + pin_memory: true + use_start_end_token: false + + test_ds: + manifest_filepath: null + sample_rate: ${model.sample_rate} + batch_size: 16 # you may increase batch_size if your memory allows + shuffle: false + num_workers: 4 + pin_memory: true + use_start_end_token: false + + # You may find more detail on how to train a tokenizer at: /scripts/tokenizers/process_asr_text_tokenizer.py + tokenizer: + dir: ??? # path to directory which contains either tokenizer.model (bpe) or vocab.txt (for wpe) + type: bpe # Can be either bpe (SentencePiece tokenizer) or wpe (WordPiece tokenizer) + + preprocessor: + _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor + sample_rate: ${model.sample_rate} + normalize: "per_feature" + window_size: 0.025 + window_stride: 0.01 + window: "hann" + features: 80 + n_fft: 512 + frame_splicing: 1 + dither: 0.00001 + pad_to: 0 + + spec_augment: + _target_: nemo.collections.asr.modules.SpectrogramAugmentation + freq_masks: 2 # set to zero to disable it + time_masks: 10 # set to zero to disable it + freq_width: 27 + time_width: 0.05 + + encoder: + _target_: nemo.collections.asr.modules.RNNEncoder + feat_in: ${model.preprocessor.features} + n_layers: 8 + d_model: 2048 + proj_size: 640 # you may set it if you need different output size other than the default d_model + rnn_type: "lstm" # it can be lstm, gru or rnn + bidirectional: true # need to set it to false if you want to make the model causal + + # Sub-sampling params + subsampling: stacking # stacking, vggnet or striding + subsampling_factor: 4 + subsampling_conv_channels: -1 # set to -1 to make it equal to the d_model + + ### regularization + dropout: 0.2 # The dropout used in most of the Conformer Modules + + decoder: + _target_: nemo.collections.asr.modules.ConvASRDecoder + feat_in: null + num_classes: -1 + vocabulary: [] + + optim: + name: adamw + lr: 5.0 + # optimizer arguments + betas: [0.9, 0.98] + weight_decay: 1e-2 + + # scheduler setup + sched: + name: NoamAnnealing + d_model: ${model.encoder.d_model} + # scheduler config override + warmup_steps: 10000 + warmup_ratio: null + min_lr: 1e-6 + +trainer: + devices: -1 # number of GPUs, -1 would use all available GPUs + num_nodes: 1 + max_epochs: 500 + max_steps: null # computed at runtime if not set + val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations + accelerator: gpu + strategy: ddp + accumulate_grad_batches: 1 + gradient_clip_val: 0.3 + precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP. + log_every_n_steps: 10 # Interval of logging. + progress_bar_refresh_rate: 10 + resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc. + num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it + check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs + sync_batchnorm: true + enable_checkpointing: False # Provided by exp_manager + logger: false # Provided by exp_manager + + +exp_manager: + exp_dir: null + name: ${name} + create_tensorboard_logger: true + create_checkpoint_callback: true + checkpoint_callback_params: + # in case of multiple validation sets, first one is used + monitor: "val_wer" + mode: "min" + save_top_k: 5 + always_save_nemo: True # saves the checkpoints as nemo files instead of PTL checkpoints + + # you need to set these two to True to continue the training + resume_if_exists: false + resume_ignore_no_checkpoint: false + + # You may use this section to create a W&B logger + create_wandb_logger: false + wandb_logger_kwargs: + name: null + project: null diff --git a/examples/asr/conf/lstm/lstm_transducer_bpe.yaml b/examples/asr/conf/lstm/lstm_transducer_bpe.yaml new file mode 100644 index 000000000000..6f30771e3e54 --- /dev/null +++ b/examples/asr/conf/lstm/lstm_transducer_bpe.yaml @@ -0,0 +1,227 @@ +# It contains the default values for training an LSTM-Transducer ASR model, large size (~170M for bidirectional and ~130M for unidirectional) with Transducer loss and sub-word encoding. + +# Architecture and training config: +# Default learning parameters in this config are set for effective batch size of 2K. To train it with smaller effective +# batch sizes, you may need to re-tune the learning parameters or use higher accumulate_grad_batches. + +# Followed the architecture suggested in the following paper: +# 'STREAMING END-TO-END SPEECH RECOGNITION FOR MOBILE DEVICES' by Yanzhang He et al. (https://arxiv.org/pdf/1811.06621.pdf) + +# You may find more info about LSTM-Transducer here: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#lstm-transducer +# Pre-trained models of LSTM-Transducer can be found here: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/results.html + +name: "LSTM-Transducer-BPE" + +model: + sample_rate: 16000 + compute_eval_loss: false # eval samples can be very long and exhaust memory. Disable computation of transducer loss during validation/testing with this flag. + log_prediction: true # enables logging sample predictions in the output during training + skip_nan_grad: false + + model_defaults: + enc_hidden: 640 + pred_hidden: 640 + joint_hidden: 640 + rnn_hidden_size: 2048 + + train_ds: + manifest_filepath: ??? + sample_rate: ${model.sample_rate} + batch_size: 16 # you may increase batch_size if your memory allows + shuffle: true + num_workers: 4 + pin_memory: true + use_start_end_token: false + trim_silence: false + max_duration: 16.7 # it is set for LibriSpeech, you may need to update it for your dataset + min_duration: 0.1 + # tarred datasets + is_tarred: false + tarred_audio_filepaths: null + shuffle_n: 2048 + # bucketing params + bucketing_strategy: "synced_randomized" + bucketing_batch_size: null + + validation_ds: + manifest_filepath: ??? + sample_rate: ${model.sample_rate} + batch_size: 16 + shuffle: false + num_workers: 4 + pin_memory: true + use_start_end_token: false + + test_ds: + manifest_filepath: null + sample_rate: ${model.sample_rate} + batch_size: 16 + shuffle: false + num_workers: 4 + pin_memory: true + use_start_end_token: false + + # You may find more detail on how to train a tokenizer at: /scripts/tokenizers/process_asr_text_tokenizer.py + tokenizer: + dir: ??? # path to directory which contains either tokenizer.model (bpe) or vocab.txt (for wpe) + type: bpe # Can be either bpe (SentencePiece tokenizer) or wpe (WordPiece tokenizer) + + preprocessor: + _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor + sample_rate: ${model.sample_rate} + normalize: "per_feature" + window_size: 0.025 + window_stride: 0.01 + window: "hann" + features: 80 + n_fft: 512 + frame_splicing: 1 + dither: 0.00001 + pad_to: 0 + + spec_augment: + _target_: nemo.collections.asr.modules.SpectrogramAugmentation + freq_masks: 2 # set to zero to disable it + time_masks: 10 # set to zero to disable it + freq_width: 27 + time_width: 0.05 + + encoder: + _target_: nemo.collections.asr.modules.RNNEncoder + feat_in: ${model.preprocessor.features} + n_layers: 8 + d_model: 2048 + proj_size: ${model.model_defaults.pred_hidden} # you may set it if you need different output size other than the default d_model + rnn_type: "lstm" # it can be lstm, gru or rnn + bidirectional: true # need to set it to false if you want to make the model causal + + # Sub-sampling params + subsampling: stacking # stacking, vggnet or striding + subsampling_factor: 4 + subsampling_conv_channels: -1 # set to -1 to make it equal to the d_model + + ### regularization + dropout: 0.2 # The dropout used in most of the Conformer Modules + + decoder: + _target_: nemo.collections.asr.modules.RNNTDecoder + normalization_mode: null # Currently only null is supported for export. + random_state_sampling: false # Random state sampling: https://arxiv.org/pdf/1910.11455.pdf + blank_as_pad: true # This flag must be set in order to support exporting of RNNT models + efficient inference. + + prednet: + pred_hidden: ${model.model_defaults.pred_hidden} + pred_rnn_layers: 2 + t_max: null + dropout: 0.2 + rnn_hidden_size: 2048 + + joint: + _target_: nemo.collections.asr.modules.RNNTJoint + log_softmax: null # 'null' would set it automatically according to CPU/GPU device + preserve_memory: false # dramatically slows down training, but might preserve some memory + + # Fuses the computation of prediction net + joint net + loss + WER calculation + # to be run on sub-batches of size `fused_batch_size`. + # When this flag is set to true, consider the `batch_size` of *_ds to be just `encoder` batch size. + # `fused_batch_size` is the actual batch size of the prediction net, joint net and transducer loss. + # Using small values here will preserve a lot of memory during training, but will make training slower as well. + # An optimal ratio of fused_batch_size : *_ds.batch_size is 1:1. + # However, to preserve memory, this ratio can be 1:8 or even 1:16. + # Extreme case of 1:B (i.e. fused_batch_size=1) should be avoided as training speed would be very slow. + fuse_loss_wer: true + fused_batch_size: 16 + + jointnet: + joint_hidden: ${model.model_defaults.joint_hidden} + activation: "relu" + dropout: 0.2 + + decoding: + strategy: "greedy_batch" # can be greedy, greedy_batch, beam, tsd, alsd. + + # greedy strategy config + greedy: + max_symbols: 10 + + # beam strategy config + beam: + beam_size: 2 + return_best_hypothesis: False + score_norm: true + tsd_max_sym_exp: 50 # for Time Synchronous Decoding + alsd_max_target_len: 2.0 # for Alignment-Length Synchronous Decoding + + loss: + loss_name: "default" + + warprnnt_numba_kwargs: + # FastEmit regularization: https://arxiv.org/abs/2010.11148 + # You may enable FastEmit to reduce the latency of the model for streaming + # using fastemit_lambda=1e-3 can help the accuracy of the model when it is unidirectional + fastemit_lambda: 0.0 # Recommended values to be in range [1e-4, 1e-2], 0.001 is a good start. + + # Adds Gaussian noise to the gradients of the decoder to avoid overfitting + variational_noise: + start_step: 0 + std: 0.0 + + optim: + name: adamw + lr: 5.0 + # optimizer arguments + betas: [0.9, 0.98] + weight_decay: 1e-2 + + # scheduler setup + sched: + name: NoamAnnealing + d_model: ${model.encoder.d_model} + # scheduler config override + warmup_steps: 10000 + warmup_ratio: null + min_lr: 1e-6 + +trainer: + devices: -1 # number of GPUs, -1 would use all available GPUs + num_nodes: 1 + max_epochs: 500 + max_steps: null # computed at runtime if not set + val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations + accelerator: auto + strategy: ddp + accumulate_grad_batches: 1 + gradient_clip_val: 0.3 + precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP. + log_every_n_steps: 10 # Interval of logging. + progress_bar_refresh_rate: 10 + resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc. + num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it + check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs + sync_batchnorm: true + enable_checkpointing: False # Provided by exp_manager + logger: false # Provided by exp_manager + + +exp_manager: + exp_dir: null + name: ${name} + create_tensorboard_logger: true + create_checkpoint_callback: true + checkpoint_callback_params: + # in case of multiple validation sets, first one is used + monitor: "val_wer" + mode: "min" + save_top_k: 5 + always_save_nemo: True # saves the checkpoints as nemo files instead of PTL checkpoints + + # you need to set these two to True to continue the training + resume_if_exists: false + resume_ignore_no_checkpoint: false + + # You may use this section to create a W&B logger + create_wandb_logger: false + wandb_logger_kwargs: + name: null + project: null + diff --git a/nemo/collections/asr/models/asr_model.py b/nemo/collections/asr/models/asr_model.py index 425ca8ff6417..6b72087002b4 100644 --- a/nemo/collections/asr/models/asr_model.py +++ b/nemo/collections/asr/models/asr_model.py @@ -11,6 +11,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. +import logging from abc import ABC, abstractmethod from typing import List, Optional, Union @@ -61,6 +62,23 @@ def list_available_models(cls) -> 'List[PretrainedModelInfo]': list_of_models = model_utils.resolve_subclass_pretrained_model_info(cls) return list_of_models + def on_after_backward(self): + """ + zero-out the gradients which any of them is NAN or INF + """ + super().on_after_backward() + if "skip_nan_grad" in self._cfg and self._cfg["skip_nan_grad"]: + valid_gradients = True + for param_name, param in self.named_parameters(): + if param.grad is not None: + valid_gradients = not (torch.isnan(param.grad).any() or torch.isinf(param.grad).any()) + if not valid_gradients: + break + + if not valid_gradients: + logging.warning(f'detected inf or nan values in gradients! Setting gradients to zero.') + self.zero_grad() + class ExportableEncDecModel(Exportable): """ diff --git a/nemo/collections/asr/modules/__init__.py b/nemo/collections/asr/modules/__init__.py index d0a098c798c7..3034afd4a0e3 100644 --- a/nemo/collections/asr/modules/__init__.py +++ b/nemo/collections/asr/modules/__init__.py @@ -32,4 +32,5 @@ ) from nemo.collections.asr.modules.graph_decoder import ViterbiDecoderWithGraph from nemo.collections.asr.modules.lstm_decoder import LSTMDecoder +from nemo.collections.asr.modules.rnn_encoder import RNNEncoder from nemo.collections.asr.modules.rnnt import RNNTDecoder, RNNTJoint diff --git a/nemo/collections/asr/modules/rnn_encoder.py b/nemo/collections/asr/modules/rnn_encoder.py new file mode 100644 index 000000000000..36c6e4e19f0b --- /dev/null +++ b/nemo/collections/asr/modules/rnn_encoder.py @@ -0,0 +1,175 @@ +# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from collections import OrderedDict + +import torch +import torch.distributed +import torch.nn as nn + +from nemo.collections.asr.parts.submodules.subsampling import ConvSubsampling, StackingSubsampling +from nemo.core.classes.common import typecheck +from nemo.core.classes.exportable import Exportable +from nemo.core.classes.module import NeuralModule +from nemo.core.neural_types import AcousticEncodedRepresentation, LengthsType, NeuralType, SpectrogramType + +__all__ = ['RNNEncoder'] + + +class RNNEncoder(NeuralModule, Exportable): + """ + The RNN-based encoder for ASR models. + Followed the architecture suggested in the following paper: + 'STREAMING END-TO-END SPEECH RECOGNITION FOR MOBILE DEVICES' by Yanzhang He et al. + https://arxiv.org/pdf/1811.06621.pdf + + + Args: + feat_in (int): the size of feature channels + n_layers (int): number of layers of RNN + d_model (int): the hidden size of the model + proj_size (int): the size of the output projection after each RNN layer + rnn_type (str): the type of the RNN layers, choices=['lstm, 'gru', 'rnn'] + bidirectional (float): specifies whether RNN layers should be bidirectional or not + Defaults to True. + feat_out (int): the size of the output features + Defaults to -1 (means feat_out is d_model) + subsampling (str): the method of subsampling, choices=['stacking, 'vggnet', 'striding'] + Defaults to stacking. + subsampling_factor (int): the subsampling factor + Defaults to 4. + subsampling_conv_channels (int): the size of the convolutions in the subsampling module for vggnet and striding + Defaults to -1 which would set it to d_model. + dropout (float): the dropout rate used between all layers + Defaults to 0.2. + """ + + def input_example(self): + """ + Generates input examples for tracing etc. + Returns: + A tuple of input examples. + """ + input_example = torch.randn(16, self._feat_in, 256).to(next(self.parameters()).device) + input_example_length = torch.randint(0, 256, (16,)).to(next(self.parameters()).device) + return tuple([input_example, input_example_length]) + + @property + def input_types(self): + """Returns definitions of module input ports. + """ + return OrderedDict( + { + "audio_signal": NeuralType(('B', 'D', 'T'), SpectrogramType()), + "length": NeuralType(tuple('B'), LengthsType()), + } + ) + + @property + def output_types(self): + """Returns definitions of module output ports. + """ + return OrderedDict( + { + "outputs": NeuralType(('B', 'D', 'T'), AcousticEncodedRepresentation()), + "encoded_lengths": NeuralType(tuple('B'), LengthsType()), + } + ) + + def __init__( + self, + feat_in: int, + n_layers: int, + d_model: int, + proj_size: int = -1, + rnn_type: str = 'lstm', + bidirectional: bool = True, + subsampling: str = 'striding', + subsampling_factor: int = 4, + subsampling_conv_channels: int = -1, + dropout: float = 0.2, + ): + super().__init__() + + self.d_model = d_model + self._feat_in = feat_in + + if subsampling_conv_channels == -1: + subsampling_conv_channels = proj_size + if subsampling and subsampling_factor > 1: + if subsampling == 'stacking': + self.pre_encode = StackingSubsampling( + subsampling_factor=subsampling_factor, feat_in=feat_in, feat_out=proj_size + ) + else: + self.pre_encode = ConvSubsampling( + subsampling=subsampling, + subsampling_factor=subsampling_factor, + feat_in=feat_in, + feat_out=proj_size, + conv_channels=subsampling_conv_channels, + activation=nn.ReLU(), + ) + else: + self.pre_encode = nn.Linear(feat_in, proj_size) + + self._feat_out = proj_size + + self.layers = nn.ModuleList() + + SUPPORTED_RNN = {"lstm": nn.LSTM, "gru": nn.GRU, "rnn": nn.RNN} + if rnn_type not in SUPPORTED_RNN: + raise ValueError(f"rnn_type can be one from the following:{SUPPORTED_RNN.keys()}") + else: + rnn_module = SUPPORTED_RNN[rnn_type] + + for i in range(n_layers): + rnn_proj_size = proj_size // 2 if bidirectional else proj_size + if rnn_type == "lstm": + layer = rnn_module( + input_size=self._feat_out, + hidden_size=d_model, + num_layers=1, + batch_first=True, + bidirectional=bidirectional, + proj_size=rnn_proj_size, + ) + self.layers.append(layer) + self.layers.append(nn.LayerNorm(proj_size)) + self.layers.append(nn.Dropout(p=dropout)) + self._feat_out = proj_size + + @typecheck() + def forward(self, audio_signal, length=None): + max_audio_length: int = audio_signal.size(-1) + + if length is None: + length = audio_signal.new_full( + audio_signal.size(0), max_audio_length, dtype=torch.int32, device=self.seq_range.device + ) + + audio_signal = torch.transpose(audio_signal, 1, 2) + + if isinstance(self.pre_encode, ConvSubsampling) or isinstance(self.pre_encode, StackingSubsampling): + audio_signal, length = self.pre_encode(audio_signal, length) + else: + audio_signal = self.pre_encode(audio_signal) + + for lth, layer in enumerate(self.layers): + audio_signal = layer(audio_signal) + if isinstance(audio_signal, tuple): + audio_signal, _ = audio_signal + + audio_signal = torch.transpose(audio_signal, 1, 2) + return audio_signal, length diff --git a/nemo/collections/asr/modules/rnnt.py b/nemo/collections/asr/modules/rnnt.py index 6d6ba3b777ef..39d4e6d16875 100644 --- a/nemo/collections/asr/modules/rnnt.py +++ b/nemo/collections/asr/modules/rnnt.py @@ -166,6 +166,7 @@ def __init__( weights_init_scale=weights_init_scale, hidden_hidden_bias_scale=hidden_hidden_bias_scale, dropout=dropout, + rnn_hidden_size=prednet.get("rnn_hidden_size", -1), ) self._rnnt_export = False @@ -292,6 +293,7 @@ def _predict_modules( weights_init_scale, hidden_hidden_bias_scale, dropout, + rnn_hidden_size, ): """ Prepare the trainable parameters of the Prediction Network. @@ -308,6 +310,7 @@ def _predict_modules( hidden_hidden_bias_scale: Float scale for the hidden-to-hidden bias scale. Set to 0.0 for the default behaviour. dropout: Whether to apply dropout to RNN. + rnn_hidden_size: the hidden size of the RNN, if not specified, pred_n_hidden would be used """ if self.blank_as_pad: embed = torch.nn.Embedding(vocab_size + 1, pred_n_hidden, padding_idx=self.blank_idx) @@ -319,7 +322,7 @@ def _predict_modules( "embed": embed, "dec_rnn": rnn.rnn( input_size=pred_n_hidden, - hidden_size=pred_n_hidden, + hidden_size=rnn_hidden_size if rnn_hidden_size > 0 else pred_n_hidden, num_layers=pred_rnn_layers, norm=norm, forget_gate_bias=forget_gate_bias, @@ -327,6 +330,7 @@ def _predict_modules( dropout=dropout, weights_init_scale=weights_init_scale, hidden_hidden_bias_scale=hidden_hidden_bias_scale, + proj_size=pred_n_hidden if pred_n_hidden < rnn_hidden_size else 0, ), } ) diff --git a/nemo/collections/asr/parts/submodules/subsampling.py b/nemo/collections/asr/parts/submodules/subsampling.py index 3455905c82d4..6b27f8c2cfb4 100644 --- a/nemo/collections/asr/parts/submodules/subsampling.py +++ b/nemo/collections/asr/parts/submodules/subsampling.py @@ -18,11 +18,34 @@ import torch.nn as nn +class StackingSubsampling(torch.nn.Module): + """Stacking subsampling which simply stacks consecutive frames to reduce the sampling rate + Args: + subsampling_factor (int): The subsampling factor + feat_in (int): size of the input features + feat_out (int): size of the output features + """ + + def __init__(self, subsampling_factor, feat_in, feat_out): + super(StackingSubsampling, self).__init__() + self.subsampling_factor = subsampling_factor + self.proj_out = torch.nn.Linear(subsampling_factor * feat_in, feat_out) + + def forward(self, x, lengths): + b, t, h = x.size() + pad_size = self.subsampling_factor - (t % self.subsampling_factor) + x = torch.nn.functional.pad(x, (0, 0, 0, pad_size)) + _, t, _ = x.size() + x = torch.reshape(x, (b, t // self.subsampling_factor, h * self.subsampling_factor)) + x = self.proj_out(x) + lengths = torch.div(lengths + pad_size, self.subsampling_factor, rounding_mode='floor') + return x, lengths + + class ConvSubsampling(torch.nn.Module): """Convolutional subsampling which supports VGGNet and striding approach introduced in: - VGGNet Subsampling: https://arxiv.org/pdf/1910.12977.pdf - Striding Subsampling: - "Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition" by Linhao Dong et al. + VGGNet Subsampling: Transformer-transducer: end-to-end speech recognition with self-attention (https://arxiv.org/pdf/1910.12977.pdf) + Striding Subsampling: "Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition" by Linhao Dong et al. (https://ieeexplore.ieee.org/document/8462506) Args: subsampling (str): The subsampling technique from {"vggnet", "striding"} subsampling_factor (int): The subsampling factor which should be a power of 2 diff --git a/nemo/collections/common/parts/rnn.py b/nemo/collections/common/parts/rnn.py index 81e30778d860..c248c17fc11f 100644 --- a/nemo/collections/common/parts/rnn.py +++ b/nemo/collections/common/parts/rnn.py @@ -33,6 +33,7 @@ def rnn( t_max: Optional[int] = None, weights_init_scale: float = 1.0, hidden_hidden_bias_scale: float = 0.0, + proj_size: int = 0, ) -> torch.nn.Module: """ Utility function to provide unified interface to common LSTM RNN modules. @@ -84,6 +85,7 @@ def rnn( t_max=t_max, weights_init_scale=weights_init_scale, hidden_hidden_bias_scale=hidden_hidden_bias_scale, + proj_size=proj_size, ) if norm == "batch": @@ -98,11 +100,12 @@ def rnn( norm_first_rnn=norm_first_rnn, weights_init_scale=weights_init_scale, hidden_hidden_bias_scale=hidden_hidden_bias_scale, + proj_size=proj_size, ) if norm == "layer": return torch.jit.script( - ln_lstm( # torch.jit.script( + ln_lstm( input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, @@ -156,6 +159,7 @@ def __init__( t_max: Optional[int] = None, weights_init_scale: float = 1.0, hidden_hidden_bias_scale: float = 0.0, + proj_size: int = 0, ): """Returns an LSTM with forget gate bias init to `forget_gate_bias`. Args: @@ -187,7 +191,7 @@ def __init__( super(LSTMDropout, self).__init__() self.lstm = torch.nn.LSTM( - input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, dropout=dropout, + input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, dropout=dropout, proj_size=proj_size ) if t_max is not None: @@ -244,6 +248,7 @@ def __init__( t_max: Optional[int] = None, weights_init_scale: float = 1.0, hidden_hidden_bias_scale: float = 0.0, + proj_size: int = 0, ): super().__init__() @@ -261,6 +266,7 @@ def __init__( t_max=t_max, weights_init_scale=weights_init_scale, hidden_hidden_bias_scale=hidden_hidden_bias_scale, + proj_size=proj_size, ) else: self.rnn = rnn_type(input_size=input_size, hidden_size=hidden_size, bias=not batch_norm) @@ -299,6 +305,7 @@ def __init__( t_max: Optional[int] = None, weights_init_scale: float = 1.0, hidden_hidden_bias_scale: float = 0.0, + proj_size: int = 0, ): super().__init__() self.rnn_layers = rnn_layers @@ -317,6 +324,7 @@ def __init__( t_max=t_max, weights_init_scale=weights_init_scale, hidden_hidden_bias_scale=hidden_hidden_bias_scale, + proj_size=proj_size, ) )