Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding RNN encoder for LSTM-Transducer and LSTM-CTC models #3886

Merged
merged 40 commits into from
Apr 2, 2022
Merged
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
81ed609
added initial rnn encoder.
VahidooX Jan 6, 2022
f072143
added rnn encoder and decoder.
VahidooX Jan 8, 2022
a719cdf
added stackingdownsampling.
VahidooX Jan 8, 2022
948186a
added stackingdownsampling.
VahidooX Jan 8, 2022
d303b43
added stackingdownsampling.
VahidooX Jan 8, 2022
6781679
fixed the bug for bidirectional.
VahidooX Feb 13, 2022
267be30
fixed the bug for bidirectional.
VahidooX Feb 13, 2022
fd0da28
added skip_nan_grad.
VahidooX Mar 24, 2022
fdfac60
added skip_nan_grad.
VahidooX Mar 24, 2022
e3e9edf
added rnn tpype.
VahidooX Mar 25, 2022
f2df1d9
added rnn tpype.
VahidooX Mar 25, 2022
2d48b86
cleaned the configs.
VahidooX Mar 25, 2022
8f0f844
cleaned the configs.
VahidooX Mar 25, 2022
d29af7f
added docs.
VahidooX Mar 25, 2022
46cb7f7
added docs.
VahidooX Mar 25, 2022
d01ec23
added docs.
VahidooX Mar 25, 2022
9892388
changed proj_out to proj_size
VahidooX Mar 26, 2022
c38f70c
changed proj_out to proj_size
VahidooX Mar 26, 2022
a189aa0
changed proj_out to proj_size
VahidooX Mar 26, 2022
e9bb886
set default to bpe.
VahidooX Mar 26, 2022
3e7bd6b
cleaned.
VahidooX Mar 26, 2022
285d918
addressed comments.
VahidooX Mar 31, 2022
851324f
CHANGED names.
VahidooX Mar 31, 2022
75a9984
CHANGED names.
VahidooX Mar 31, 2022
9166d5b
added types.
VahidooX Mar 31, 2022
ac77eb7
fixed proj_size in configs.
VahidooX Mar 31, 2022
d1355a0
Merge branch 'main' of https://github.com/NVIDIA/NeMo into add_rnn_en…
VahidooX Mar 31, 2022
d6e2db2
Merge branch 'main' of https://github.com/NVIDIA/NeMo into add_rnn_en…
VahidooX Mar 31, 2022
6649533
fixed style.
VahidooX Mar 31, 2022
253d441
Merge branch 'main' of https://github.com/NVIDIA/NeMo into add_rnn_en…
VahidooX Mar 31, 2022
4a9aec4
Merge branch 'main' of https://github.com/NVIDIA/NeMo into add_rnn_en…
VahidooX Mar 31, 2022
7ff3717
pull from main.
VahidooX Mar 31, 2022
20d81ec
Merge branch 'main' of https://github.com/NVIDIA/NeMo into add_rnn_en…
VahidooX Apr 1, 2022
6df0349
pulled from main.
VahidooX Apr 1, 2022
cee5e0e
replaced proj_size with pred_hidden.
VahidooX Apr 1, 2022
abbff06
replaced proj_size with pred_hidden.
VahidooX Apr 1, 2022
70472a2
Merge branch 'main' into add_rnn_encoder_main
titu1994 Apr 2, 2022
bceb362
replaced proj_size with pred_hidden.
VahidooX Apr 2, 2022
bdedaf5
Merge remote-tracking branch 'origin/add_rnn_encoder_main' into add_r…
VahidooX Apr 2, 2022
35a2a5e
replaced proj_size with pred_hidden.
VahidooX Apr 2, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,11 @@ Key Features

* Speech processing
* `Automatic Speech Recognition (ASR) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/intro.html>`_
* Supported models: Jasper, QuartzNet, CitriNet, Conformer-CTC, Conformer-Transducer, ContextNet, ...
* Supported models: Jasper, QuartzNet, CitriNet, Conformer-CTC, Conformer-Transducer, ContextNet, LSTM-Transducer (RNNT), LSTM-CTC, ...
* Supports CTC and Transducer/RNNT losses/decoders
* Beam Search decoding
* `Language Modelling for ASR <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html>`_: N-gram LM in fusion with Beam Search decoding, Neural Rescoring with Transformer
* Streaming and Buffered ASR (CTC/Transdcer) - `Chunked Inference Examples <https://github.com/NVIDIA/NeMo/tree/main/examples/asr/asr_chunked_inference>`_
* Streaming and Buffered ASR (CTC/Transducer) - `Chunked Inference Examples <https://github.com/NVIDIA/NeMo/tree/main/examples/asr/asr_chunked_inference>`_
* `Speech Classification and Speech Command Recognition <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/intro.html>`_: MatchboxNet (Command Recognition)
* `Voice activity Detection (VAD) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speech_classification/models.html#marblenet-vad>`_: MarbleNet
* `Speaker Recognition <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_recognition/intro.html>`_: TitaNet, ECAPA_TDNN, SpeakerNet
Expand Down
9 changes: 9 additions & 0 deletions docs/source/asr/asr_all.bib
Original file line number Diff line number Diff line change
Expand Up @@ -997,3 +997,12 @@ @article{Dawalatabad_2021
month={Aug}
}


@inproceedings{he2019streaming,
title={Streaming end-to-end speech recognition for mobile devices},
author={He, Yanzhang and Sainath, Tara N and Prabhavalkar, Rohit and McGraw, Ian and Alvarez, Raziel and Zhao, Ding and Rybach, David and Kannan, Anjuli and Wu, Yonghui and Pang, Ruoming and others},
booktitle={ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={6381--6385},
year={2019},
organization={IEEE}
}
11 changes: 10 additions & 1 deletion docs/source/asr/configs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -511,13 +511,22 @@ Conformer-Transducer

Please refer to the model page of `Conformer-Transducer <./models.html#Conformer-Transducer>`__ for more information on this model.

LSTM-Transducer and LSTM-CTC
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The config files for LSTM-Transducer and LSTM-CTC models can be found at ``<NeMo_git_root>/examples/asr/conf/lstm/lstm_transducer_bpe.yaml`` and ``<NeMo_git_root>/examples/asr/conf/lstm/lstm_ctc_bpe.yaml`` respectively.
Most of the of the configs of are similar to other ctc or transducer models. The main difference is the encoder part.
The encoder section includes the details about the RNN-based encoder architecture. You may find more information in the
config files and also :doc:`nemo.collections.asr.modules.RNNEncoder<./api.html#nemo.collections.asr.modules.RNNEncoder>`.


Transducer Configurations
-------------------------

All CTC-based ASR model configs can be modified to support Transducer loss training. Below, we discuss the modifications required in the config to enable Transducer training. All modifications are made to the ``model`` config.

Model Defaults
~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~

It is a subsection to the model config representing the default values shared across the entire model represented as ``model.model_defaults``.

Expand Down
16 changes: 16 additions & 0 deletions docs/source/asr/models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,22 @@ You may find the example config files of Conformer-Transducer model with charact
``<NeMo_git_root>/examples/asr/conf/conformer/conformer_transducer_char.yaml`` and
with sub-word encoding at ``<NeMo_git_root>/examples/asr/conf/conformer/conformer_transducer_bpe.yaml``.

LSTM-Transducer
---------------

LSTM-Transducer is a model which uses RNNs (eg. LSTM) in the encoder. The architecture of this model is followed from suggestions in :cite:`asr-models-he2019streaming`.
It uses RNNT/Transducer loss/decoder. The encoder consists of RNN layers (LSTM as default) with lower projection size to increase the efficiency.
Layer norm is added between the layers to stabilize the training.
It can be trained/used in unidirectional or bidirectional mode. The unidirectional mode is fully causal and can be used easily for simple and efficient frame-wise streaming. However the accuracy of this model is generally lower than other models like Conformer and Citrinet.

This model supports both the sub-word level and character level encodings. You may find the example config file of RNNT model with wordpiece encoding at ``<NeMo_git_root>/examples/asr/conf/lstm/lstm_transducer_bpe.yaml``.
You can find more details on the config files for the RNNT models at ``LSTM-Transducer <./configs.html#lstm-transducer>``.

LSTM-CTC
-------

LSTM-CTC model is a CTC-variant of the LSTM-Transducer model which uses CTC loss/decoding instead of Transducer.
You may find the example config file of LSTM-CTC model with wordpiece encoding at ``<NeMo_git_root>/examples/asr/conf/lstm/lstm_ctc_bpe.yaml``.

References
----------
Expand Down
3 changes: 2 additions & 1 deletion examples/asr/conf/conformer/conformer_ctc_bpe.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ model:
sample_rate: 16000
log_prediction: true # enables logging sample predictions in the output during training
ctc_reduction: 'mean_batch'
skip_nan_grad: false
titu1994 marked this conversation as resolved.
Show resolved Hide resolved

train_ds:
manifest_filepath: ???
Expand Down Expand Up @@ -161,7 +162,7 @@ trainer:
max_epochs: 1000
max_steps: null # computed at runtime if not set
val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
accelerator: gpu
accelerator: auto
strategy: ddp
accumulate_grad_batches: 1
gradient_clip_val: 0.0
Expand Down
3 changes: 2 additions & 1 deletion examples/asr/conf/conformer/conformer_ctc_bpe_multilang.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,8 @@ trainer:
max_epochs: 1000
max_steps: null # computed at runtime if not set
val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
accelerator: ddp
accelerator: auto
strategy: ddp
accumulate_grad_batches: 1
gradient_clip_val: 0.0
precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.
Expand Down
3 changes: 2 additions & 1 deletion examples/asr/conf/conformer/conformer_ctc_char.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ model:
"n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
log_prediction: true # enables logging sample predictions in the output during training
ctc_reduction: 'mean_batch'
skip_nan_grad: false

train_ds:
manifest_filepath: ???
Expand Down Expand Up @@ -136,7 +137,7 @@ trainer:
max_epochs: 1000
max_steps: null # computed at runtime if not set
val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
accelerator: gpu
accelerator: auto
strategy: ddp
accumulate_grad_batches: 1
gradient_clip_val: 0.0
Expand Down
11 changes: 6 additions & 5 deletions examples/asr/conf/conformer/conformer_transducer_bpe.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ model:
sample_rate: &sample_rate 16000
compute_eval_loss: false # eval samples can be very long and exhaust memory. Disable computation of transducer loss during validation/testing with this flag.
log_prediction: true # enables logging sample predictions in the output during training
skip_nan_grad: false

model_defaults:
enc_hidden: ${model.encoder.d_model}
Expand All @@ -38,7 +39,7 @@ model:
batch_size: 16 # you may increase batch_size if your memory allows
shuffle: true
num_workers: 8
pin_memory: false
pin_memory: true
use_start_end_token: false
trim_silence: false
max_duration: 16.7 # it is set for LibriSpeech, you may need to update it for your dataset
Expand All @@ -57,7 +58,7 @@ model:
batch_size: 16
shuffle: false
num_workers: 8
pin_memory: false
pin_memory: true
use_start_end_token: false

test_ds:
Expand All @@ -66,7 +67,7 @@ model:
batch_size: 16
shuffle: false
num_workers: 8
pin_memory: false
pin_memory: true
use_start_end_token: false

# You may find more detail on how to train a tokenizer at: /scripts/tokenizers/process_asr_text_tokenizer.py
Expand Down Expand Up @@ -208,10 +209,10 @@ model:
trainer:
devices: -1 # number of GPUs, -1 would use all available GPUs
num_nodes: 1
max_epochs: 1000
max_epochs: 500
max_steps: null # computed at runtime if not set
val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
accelerator: gpu
accelerator: auto
strategy: ddp
accumulate_grad_batches: 1
gradient_clip_val: 0.0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ model:
batch_size: 16 # you may increase batch_size if your memory allows
shuffle: true
num_workers: 8
pin_memory: false
pin_memory: true
use_start_end_token: false
trim_silence: false
max_duration: 16.7 # it is set for LibriSpeech, you may need to update it for your dataset
Expand All @@ -58,7 +58,7 @@ model:
batch_size: 16
shuffle: false
num_workers: 8
pin_memory: false
pin_memory: true
use_start_end_token: false

test_ds:
Expand All @@ -67,7 +67,7 @@ model:
batch_size: 16
shuffle: false
num_workers: 8
pin_memory: false
pin_memory: true
use_start_end_token: false

# You may find more detail on how to train a monolingual tokenizer at: /scripts/tokenizers/process_asr_text_tokenizer.py
Expand Down Expand Up @@ -218,7 +218,8 @@ trainer:
max_epochs: 1000
max_steps: null # computed at runtime if not set
val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
accelerator: ddp
accelerator: auto
strategy: ddp
accumulate_grad_batches: 1
gradient_clip_val: 0.0
precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.
Expand Down
11 changes: 6 additions & 5 deletions examples/asr/conf/conformer/conformer_transducer_char.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ model:
sample_rate: &sample_rate 16000
compute_eval_loss: false # eval samples can be very long and exhaust memory. Disable computation of transducer loss during validation/testing with this flag.
log_prediction: true # enables logging sample predictions in the output during training
skip_nan_grad: false

labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
"n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
Expand All @@ -41,7 +42,7 @@ model:
batch_size: 16 # you may increase batch_size if your memory allows
shuffle: true
num_workers: 8
pin_memory: false
pin_memory: true
trim_silence: false
max_duration: 16.7 # it is set for LibriSpeech, you may need to update it for your dataset
min_duration: 0.1
Expand All @@ -59,15 +60,15 @@ model:
batch_size: 16
shuffle: false
num_workers: 8
pin_memory: false
pin_memory: true

test_ds:
manifest_filepath: null
sample_rate: ${model.sample_rate}
batch_size: 16
shuffle: false
num_workers: 8
pin_memory: false
pin_memory: true

preprocessor:
_target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
Expand Down Expand Up @@ -203,10 +204,10 @@ model:
trainer:
devices: -1 # number of GPUs, -1 would use all available GPUs
num_nodes: 1
max_epochs: 1000
max_epochs: 500
max_steps: null # computed at runtime if not set
val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
accelerator: gpu
accelerator: auto
strategy: ddp
accumulate_grad_batches: 1
gradient_clip_val: 0.0
Expand Down
2 changes: 1 addition & 1 deletion examples/asr/conf/contextnet_rnnt/config_rnnt_bpe.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ model:

# greedy strategy config
greedy:
max_symbols: 30
max_symbols: 10

# beam strategy config
beam:
Expand Down
Loading