NVIDIA · titu1994 · Apr 2, 2022 · Jan 6, 2022 · Jan 8, 2022 · Jan 8, 2022
diff --git a/README.rst b/README.rst
@@ -45,11 +45,11 @@ Key Features
 
 * Speech processing
     * `Automatic Speech Recognition (ASR) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/intro.html>`_
-        * Supported models: Jasper, QuartzNet, CitriNet, Conformer-CTC, Conformer-Transducer, ContextNet, ...
+        * Supported models: Jasper, QuartzNet, CitriNet, Conformer-CTC, Conformer-Transducer, ContextNet, LSTM-Transducer (RNNT), LSTM-CTC, ...
         * Supports CTC and Transducer/RNNT losses/decoders
         * Beam Search decoding
         * `Language Modelling for ASR <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html>`_: N-gram LM in fusion with Beam Search decoding, Neural Rescoring with Transformer
-        * Streaming and Buffered ASR (CTC/Transdcer) - `Chunked Inference Examples <https://github.com/NVIDIA/NeMo/tree/main/examples/asr/asr_chunked_inference>`_
+        * Streaming and Buffered ASR (CTC/Transducer) - `Chunked Inference Examples <https://github.com/NVIDIA/NeMo/tree/main/examples/asr/asr_chunked_inference>`_
     * `Speech Classification and Speech Command Recognition <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/intro.html>`_: MatchboxNet (Command Recognition)
     * `Voice activity Detection (VAD) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speech_classification/models.html#marblenet-vad>`_: MarbleNet
     * `Speaker Recognition <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_recognition/intro.html>`_: TitaNet, ECAPA_TDNN, SpeakerNet

diff --git a/docs/source/asr/asr_all.bib b/docs/source/asr/asr_all.bib
@@ -997,3 +997,12 @@ @article{Dawalatabad_2021
    month={Aug}
 }
 
+
+@inproceedings{he2019streaming,
+  title={Streaming end-to-end speech recognition for mobile devices},
+  author={He, Yanzhang and Sainath, Tara N and Prabhavalkar, Rohit and McGraw, Ian and Alvarez, Raziel and Zhao, Ding and Rybach, David and Kannan, Anjuli and Wu, Yonghui and Pang, Ruoming and others},
+  booktitle={ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
+  pages={6381--6385},
+  year={2019},
+  organization={IEEE}
+}
diff --git a/docs/source/asr/configs.rst b/docs/source/asr/configs.rst
@@ -511,13 +511,22 @@ Conformer-Transducer
 
 Please refer to the model page of `Conformer-Transducer <./models.html#Conformer-Transducer>`__ for more information on this model.
 
+LSTM-Transducer and LSTM-CTC
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The config files for LSTM-Transducer and LSTM-CTC models can be found at ``<NeMo_git_root>/examples/asr/conf/lstm/lstm_transducer_bpe.yaml`` and ``<NeMo_git_root>/examples/asr/conf/lstm/lstm_ctc_bpe.yaml`` respectively.
+Most of the of the configs of are similar to other ctc or transducer models. The main difference is the encoder part.
+The encoder section includes the details about the RNN-based encoder architecture. You may find more information in the
+config files and also :doc:`nemo.collections.asr.modules.RNNEncoder<./api.html#nemo.collections.asr.modules.RNNEncoder>`.
+
+
 Transducer Configurations
 -------------------------
 
 All CTC-based ASR model configs can be modified to support Transducer loss training. Below, we discuss the modifications required in the config to enable Transducer training. All modifications are made to the ``model`` config.
 
 Model Defaults
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~
 
 It is a subsection to the model config representing the default values shared across the entire model represented as ``model.model_defaults``.
 

diff --git a/docs/source/asr/models.rst b/docs/source/asr/models.rst
@@ -127,6 +127,22 @@ You may find the example config files of Conformer-Transducer model with charact
 ``<NeMo_git_root>/examples/asr/conf/conformer/conformer_transducer_char.yaml`` and
 with sub-word encoding at ``<NeMo_git_root>/examples/asr/conf/conformer/conformer_transducer_bpe.yaml``.
 
+LSTM-Transducer
+---------------
+
+LSTM-Transducer is a model which uses RNNs (eg. LSTM) in the encoder. The architecture of this model is followed from suggestions in :cite:`asr-models-he2019streaming`.
+It uses RNNT/Transducer loss/decoder. The encoder consists of RNN layers (LSTM as default) with lower projection size to increase the efficiency.
+Layer norm is added between the layers to stabilize the training.
+It can be trained/used in unidirectional or bidirectional mode. The unidirectional mode is fully causal and can be used easily for simple and efficient frame-wise streaming. However the accuracy of this model is generally lower than other models like Conformer and Citrinet.
+
+This model supports both the sub-word level and character level encodings. You may find the example config file of RNNT model with wordpiece encoding at ``<NeMo_git_root>/examples/asr/conf/lstm/lstm_transducer_bpe.yaml``.
+You can find more details on the config files for the RNNT models at ``LSTM-Transducer <./configs.html#lstm-transducer>``.
+
+LSTM-CTC
+-------
+
+LSTM-CTC model is a CTC-variant of the LSTM-Transducer model which uses CTC loss/decoding instead of Transducer.
+You may find the example config file of LSTM-CTC model with wordpiece encoding at ``<NeMo_git_root>/examples/asr/conf/lstm/lstm_ctc_bpe.yaml``.
 
 References
 ----------

diff --git a/examples/asr/conf/conformer/conformer_ctc_bpe.yaml b/examples/asr/conf/conformer/conformer_ctc_bpe.yaml
@@ -30,6 +30,7 @@ model:
   sample_rate: 16000
   log_prediction: true # enables logging sample predictions in the output during training
   ctc_reduction: 'mean_batch'
+  skip_nan_grad: false
 
   train_ds:
     manifest_filepath: ???
@@ -161,7 +162,7 @@ trainer:
   max_epochs: 1000
   max_steps: null # computed at runtime if not set
   val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
-  accelerator: gpu
+  accelerator: auto
   strategy: ddp
   accumulate_grad_batches: 1
   gradient_clip_val: 0.0

diff --git a/examples/asr/conf/conformer/conformer_ctc_bpe_multilang.yaml b/examples/asr/conf/conformer/conformer_ctc_bpe_multilang.yaml
@@ -168,7 +168,8 @@ trainer:
   max_epochs: 1000
   max_steps: null # computed at runtime if not set
   val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
-  accelerator: ddp
+  accelerator: auto
+  strategy: ddp
   accumulate_grad_batches: 1
   gradient_clip_val: 0.0
   precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.

diff --git a/examples/asr/conf/conformer/conformer_ctc_char.yaml b/examples/asr/conf/conformer/conformer_ctc_char.yaml
@@ -11,6 +11,7 @@ model:
             "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
   log_prediction: true # enables logging sample predictions in the output during training
   ctc_reduction: 'mean_batch'
+  skip_nan_grad: false
 
   train_ds:
     manifest_filepath: ???
@@ -136,7 +137,7 @@ trainer:
   max_epochs: 1000
   max_steps: null # computed at runtime if not set
   val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
-  accelerator: gpu
+  accelerator: auto
   strategy: ddp
   accumulate_grad_batches: 1
   gradient_clip_val: 0.0

diff --git a/examples/asr/conf/conformer/conformer_transducer_bpe.yaml b/examples/asr/conf/conformer/conformer_transducer_bpe.yaml
@@ -26,6 +26,7 @@ model:
   sample_rate: &sample_rate 16000
   compute_eval_loss: false # eval samples can be very long and exhaust memory. Disable computation of transducer loss during validation/testing with this flag.
   log_prediction: true # enables logging sample predictions in the output during training
+  skip_nan_grad: false
 
   model_defaults:
     enc_hidden: ${model.encoder.d_model}
@@ -38,7 +39,7 @@ model:
     batch_size: 16 # you may increase batch_size if your memory allows
     shuffle: true
     num_workers: 8
-    pin_memory: false
+    pin_memory: true
     use_start_end_token: false
     trim_silence: false
     max_duration: 16.7 # it is set for LibriSpeech, you may need to update it for your dataset
@@ -57,7 +58,7 @@ model:
     batch_size: 16
     shuffle: false
     num_workers: 8
-    pin_memory: false
+    pin_memory: true
     use_start_end_token: false
 
   test_ds:
@@ -66,7 +67,7 @@ model:
     batch_size: 16
     shuffle: false
     num_workers: 8
-    pin_memory: false
+    pin_memory: true
     use_start_end_token: false
 
   # You may find more detail on how to train a tokenizer at: /scripts/tokenizers/process_asr_text_tokenizer.py
@@ -208,10 +209,10 @@ model:
 trainer:
   devices: -1 # number of GPUs, -1 would use all available GPUs
   num_nodes: 1
-  max_epochs: 1000
+  max_epochs: 500
   max_steps: null # computed at runtime if not set
   val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
-  accelerator: gpu
+  accelerator: auto
   strategy: ddp
   accumulate_grad_batches: 1
   gradient_clip_val: 0.0

diff --git a/examples/asr/conf/conformer/conformer_transducer_bpe_multilang.yaml b/examples/asr/conf/conformer/conformer_transducer_bpe_multilang.yaml
@@ -39,7 +39,7 @@ model:
     batch_size: 16 # you may increase batch_size if your memory allows
     shuffle: true
     num_workers: 8
-    pin_memory: false
+    pin_memory: true
     use_start_end_token: false
     trim_silence: false
     max_duration: 16.7 # it is set for LibriSpeech, you may need to update it for your dataset
@@ -58,7 +58,7 @@ model:
     batch_size: 16
     shuffle: false
     num_workers: 8
-    pin_memory: false
+    pin_memory: true
     use_start_end_token: false
 
   test_ds:
@@ -67,7 +67,7 @@ model:
     batch_size: 16
     shuffle: false
     num_workers: 8
-    pin_memory: false
+    pin_memory: true
     use_start_end_token: false
 
   # You may find more detail on how to train a monolingual tokenizer at: /scripts/tokenizers/process_asr_text_tokenizer.py
@@ -218,7 +218,8 @@ trainer:
   max_epochs: 1000
   max_steps: null # computed at runtime if not set
   val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
-  accelerator: ddp
+  accelerator: auto
+  strategy: ddp
   accumulate_grad_batches: 1
   gradient_clip_val: 0.0
   precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.

diff --git a/examples/asr/conf/conformer/conformer_transducer_char.yaml b/examples/asr/conf/conformer/conformer_transducer_char.yaml
@@ -26,6 +26,7 @@ model:
   sample_rate: &sample_rate 16000
   compute_eval_loss: false # eval samples can be very long and exhaust memory. Disable computation of transducer loss during validation/testing with this flag.
   log_prediction: true # enables logging sample predictions in the output during training
+  skip_nan_grad: false
 
   labels:  [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
             "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
@@ -41,7 +42,7 @@ model:
     batch_size: 16 # you may increase batch_size if your memory allows
     shuffle: true
     num_workers: 8
-    pin_memory: false
+    pin_memory: true
     trim_silence: false
     max_duration: 16.7 # it is set for LibriSpeech, you may need to update it for your dataset
     min_duration: 0.1
@@ -59,15 +60,15 @@ model:
     batch_size: 16
     shuffle: false
     num_workers: 8
-    pin_memory: false
+    pin_memory: true
 
   test_ds:
     manifest_filepath: null
     sample_rate: ${model.sample_rate}
     batch_size: 16
     shuffle: false
     num_workers: 8
-    pin_memory: false
+    pin_memory: true
 
   preprocessor:
     _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
@@ -203,10 +204,10 @@ model:
 trainer:
   devices: -1 # number of GPUs, -1 would use all available GPUs
   num_nodes: 1
-  max_epochs: 1000
+  max_epochs: 500
   max_steps: null # computed at runtime if not set
   val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
-  accelerator: gpu
+  accelerator: auto
   strategy: ddp
   accumulate_grad_batches: 1
   gradient_clip_val: 0.0

diff --git a/examples/asr/conf/contextnet_rnnt/config_rnnt_bpe.yaml b/examples/asr/conf/contextnet_rnnt/config_rnnt_bpe.yaml
@@ -187,7 +187,7 @@ model:
 
     # greedy strategy config
     greedy:
-      max_symbols: 30
+      max_symbols: 10
 
     # beam strategy config
     beam: