Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge r1.11 to main #4920

Merged
merged 34 commits into from
Sep 13, 2022
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
c92c486
update branch
ericharper Aug 4, 2022
0c5e980
update package info and dockerfile
ericharper Aug 4, 2022
8e8d822
[TTS] bugfix for missing configs. (#4725)
XuesongYang Aug 11, 2022
5fa9e13
Fix pynini install in TTS tutorials (#4729)
redoctopus Aug 12, 2022
d977f6b
[TTS] updated config with a German IPA phoneme tokenizer (#4756)
XuesongYang Aug 18, 2022
77025dd
Update r1.11 to new heteronyms list (#4745)
redoctopus Aug 18, 2022
c61769b
Fix tutorial formatting (#4778)
redoctopus Aug 19, 2022
26f7dac
update branch and typos (#4788)
ericharper Aug 22, 2022
23f1269
Adding support for models trained with full context for cache-aware s…
VahidooX Aug 22, 2022
fa2542e
Update megatron encoder decoder model to support py37 for colab (#4791)
Zhilin123 Aug 23, 2022
8674eb8
[ASR] Add pretrained ASR models for Croatian (#4682)
anteju Aug 24, 2022
5cdc912
added/fixed export for Megatron models (#4712)
Davood-M Aug 24, 2022
04431ea
update branch for qa notebook
ericharper Aug 24, 2022
212ede2
Fix initializing weights from ptl ckpt with exclude (#4807)
sam1373 Aug 25, 2022
d44bdd7
Fix index error from addition of voiced_mask and p_voiced (#4811)
redoctopus Aug 26, 2022
be60005
T5 prompt learning fixes (#4771)
MaximumEntropy Aug 26, 2022
ae686e1
G2P docs (#4841)
ekmb Aug 30, 2022
279a9be
Fix providing glue in seq2seq eval (#4843)
MaximumEntropy Aug 31, 2022
ae5dc6a
Updated inference code and squad scripts (#4835)
vadam5 Aug 31, 2022
f5080f7
Set the number of workers to 0 for validation and test sets in all en…
MaximumEntropy Sep 2, 2022
83e6e89
Fix Megatron NMT consumed samples and ckpt_to_nemo split rank (#4884)
MaximumEntropy Sep 6, 2022
7c286ed
added utf8 encoding (#4892)
vadam5 Sep 7, 2022
d6af5c8
update readme with apex commit
ericharper Sep 8, 2022
bbf8bc0
Add support for Apex distributed Adam optimizer with GPT-3 (#4487)
timmoon10 Aug 15, 2022
1bfd9cb
update readme
ericharper Sep 8, 2022
5955276
update readme
ericharper Sep 8, 2022
0fc0b82
fixed styles
XuesongYang Sep 12, 2022
2db64fa
removed unsued import.
XuesongYang Sep 12, 2022
e196620
removed duplicated func defintion.
XuesongYang Sep 12, 2022
5c21d46
replace 'r1.11.0' with 'main' in Jenkinsfile and all tutorials.
XuesongYang Sep 12, 2022
120a4c3
fix: PRE_RELEASE = 'rc0'
XuesongYang Sep 12, 2022
ceb4157
replace branch name to main for asr_with_adapters.ipynb.
XuesongYang Sep 12, 2022
db06e61
fix Fastpitch mixertts tutorial format to align with main to distings…
XuesongYang Sep 12, 2022
b679599
fix: correct path for tokenizers.
XuesongYang Sep 12, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -3146,8 +3146,10 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
inference.add_BOS=False \
trainer.devices=2 \
tensor_model_parallel_size=2 \
pred_file_path=/home/TestData/nlp/prompt_learning/p_tuning_test_tp_preds.txt \
data_paths=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl']"
sh "rm -rf /home/TestData/nlp/prompt_learning/p_tuning_test_tp.nemo"
sh "rm -rf /home/TestData/nlp/prompt_learning/p_tuning_test_tp_preds.txt"
}
}
stage('GPT Prompt Learning TP=1 PP=2') {
Expand All @@ -3173,8 +3175,10 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
inference.add_BOS=False \
trainer.devices=2 \
pipeline_model_parallel_size=2 \
pred_file_path=/home/TestData/nlp/prompt_learning/p_tuning_test_pp_preds.txt \
data_paths=['/home/TestData/nlp/prompt_learning/boolq_CI_test.jsonl']"
sh "rm -rf /home/TestData/nlp/prompt_learning/p_tuning_test_pp.nemo"
sh "rm -rf /home/TestData/nlp/prompt_learning/p_tuning_test_pp_preds.txt"
}
}
}
Expand Down Expand Up @@ -3433,7 +3437,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
trainer.max_steps=6 \
trainer.max_epochs=null \
model.tensor_model_parallel_size=1 \
model.pretrained_language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m-refactor.nemo' \
model.language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m-refactor.nemo' \
model.existing_tasks=[] \
model.new_tasks=['squad'] \
model.data.train_ds=['/home/TestData/nlp/prompt_learning/squad_CI_test.jsonl'] \
Expand All @@ -3443,11 +3447,13 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
sh "rm -rf /home/TestData/nlp/prompt_learning/t5_p_tuning_test"
sh "python examples/nlp/language_modeling/megatron_t5_prompt_learning_eval.py \
virtual_prompt_model_file='/home/TestData/nlp/prompt_learning/t5_p_tuning_test.nemo' \
pretrained_language_model_file='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m-refactor.nemo' \
language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m-refactor.nemo' \
data.test_ds=['/home/TestData/nlp/prompt_learning/squad_CI_test.jsonl'] \
pred_file_path='/home/TestData/nlp/prompt_learning/t5_p_tuning_test_preds.txt' \
data.global_batch_size=4 \
data.micro_batch_size=4"
sh "rm -rf /home/TestData/nlp/prompt_learning/t5_p_tuning_test.nemo"
sh "rm -rf /home/TestData/nlp/prompt_learning/t5_p_tuning_test_preds.txt"
}
}
stage('T5 Prompt Learning TP=2 PP=1') {
Expand All @@ -3459,7 +3465,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
trainer.max_steps=6 \
trainer.max_epochs=null \
model.tensor_model_parallel_size=2 \
model.pretrained_language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m_tp2.nemo' \
model.language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m_tp2.nemo' \
model.existing_tasks=[] \
model.new_tasks=['squad'] \
model.data.train_ds=['/home/TestData/nlp/prompt_learning/squad_CI_test.jsonl'] \
Expand All @@ -3469,13 +3475,15 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
sh "rm -rf /home/TestData/nlp/prompt_learning/t5_p_tuning_test_tp2"
sh "python examples/nlp/language_modeling/megatron_t5_prompt_learning_eval.py \
virtual_prompt_model_file='/home/TestData/nlp/prompt_learning/t5_p_tuning_test_tp2.nemo' \
pretrained_language_model_file='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m_tp2.nemo' \
language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m_tp2.nemo' \
data.test_ds=['/home/TestData/nlp/prompt_learning/squad_CI_test.jsonl'] \
pred_file_path='/home/TestData/nlp/prompt_learning/t5_p_tuning_test_tp2_preds.txt' \
tensor_model_parallel_size=2 \
trainer.devices=2 \
data.global_batch_size=8 \
data.micro_batch_size=8"
sh "rm -rf /home/TestData/nlp/prompt_learning/t5_p_tuning_test_tp2.nemo"
sh "rm -rf /home/TestData/nlp/prompt_learning/t5_p_tuning_test_tp2_preds.txt"
}
}
}
Expand Down
12 changes: 3 additions & 9 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -200,16 +200,10 @@ Megatron GPT training requires NVIDIA Apex to be installed.

.. code-block:: bash

git clone https://github.com/NVIDIA/apex
git clone https://github.com/ericharper/apex.git
XuesongYang marked this conversation as resolved.
Show resolved Hide resolved
cd apex
git checkout 3c19f1061879394f28272a99a7ea26d58f72dace
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" ./

.. note::

You may need to modify [setup.py](https://github.com/NVIDIA/apex/blob/3c19f1061879394f28272a99a7ea26d58f72dace/setup.py) if
your version of CUDA does not match the version used to compile Pytorch binaries, comment lines 33-41 in the above link
before installing.
git checkout nm_v1.11.0
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./

Docker containers:
~~~~~~~~~~~~~~~~~~
Expand Down
3 changes: 3 additions & 0 deletions docs/source/asr/data/benchmark_hr.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Model,Model Base Class,Model Card
stt_hr_conformer_ctc_large,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_hr_conformer_ctc_large"
stt_hr_conformer_transducer_large,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_hr_conformer_transducer_large"
3 changes: 3 additions & 0 deletions docs/source/asr/data/scores/hr/conformer_hr.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Model Name,Language,ParlaSpeech-HR v1.0 (dev),ParlaSpeech-HR v1.0 (test)
stt_hr_conformer_ctc_large,hr,4.43,4.70
stt_hr_conformer_transducer_large,hr,4.56,4.69
2 changes: 1 addition & 1 deletion docs/source/asr/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@ The audio files can be of any format supported by `Pydub <https://github.com/jia
WAV files as they are the default and have been most thoroughly tested.

There should be one manifest file per dataset that will be passed in, therefore, if the user wants separate training and validation
datasets, they should also have separate manifests. Otherwise, thay will be loading validation data with their training data and vice
datasets, they should also have separate manifests. Otherwise, they will be loading validation data with their training data and vice
versa.

Each line of the manifest should be in the following format:
Expand Down
18 changes: 12 additions & 6 deletions docs/source/asr/models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -132,15 +132,17 @@ Cache-aware Streaming Conformer

Buffered streaming uses overlapping chunks to make an offline ASR model to be used for streaming with reasonable accuracy. However, it uses significant amount of duplication in computations due to the overlapping chunks.
Also there is a accuracy gep between the offline model and the streaming one as there is inconsistency between how we train the model and how we perform inference for streaming.
The Cache-aware Streaming Conformer models would tackle and address these disadvantages. They are variants of Conformer which are trained with limited right context and it would make it possible to match the training and inference.
The Cache-aware Streaming Conformer models would tackle and address these disadvantages. These streaming Conformers are trained with limited right context that it would make it possible to match how the model is being used in both the training and inference.
They also uses caching to store intermediate activations to avoid any duplication in compute.
The cache-aware approach is supported for both the Conformer-CTC and Conformer-Transducer and enables the model to be used very efficiently for streaming.

Three categories of layers in Conformer have access to right tokens: 1-depthwise convolutions 2-self-attention, and 3-convolutions in downsampling layers.
Three categories of layers in Conformer have access to right tokens: 1-depthwise convolutions 2-self-attention, and 3-convolutions in the downsampling layers.
Streaming Conformer models uses causal convolutions or convolutions with lower right context and also self-attention with limited right context to limit the effective right context for the input.
The model trained with such limitations can be used in streaming mode and give the exact same output and accuracy as when the whole audio is given to the model in offline mode.
The model trained with such limitations can be used in streaming mode and give the exact same outputs and accuracy as when the whole audio is given to the model in offline mode.
These model can use caching mechanism to store and reuse the activations during streaming inference to avoid any duplications in the computations as much as possible.

We support the following three right context modeling:

* fully causal model with zero look-ahead: tokens would not see any future tokens. convolution layers are all causal and right tokens are masked for self-attention.
It gives zero latency but with limited accuracy.
To train such a model, you need to set `encoder.att_context_size=[left_context, 0]` and `encoder.conv_context_size=causal` in the config.
Expand All @@ -155,9 +157,9 @@ This approach is more efficient than regular look-ahead in terms of computations
In terms of accuracy, this approach gives similar or even better results in term of accuracy than regular look-ahead as each token in each layer have access to more tokens on average. That is why we recommend to use this approach for streaming.


** Note: Latencies are based on the assumption that the forward time of the network is zero.
** Note: Latencies are based on the assumption that the forward time of the network is zero and it just estimates the time needed after a frame would be available until it is passed through the model.

Approaches with non-zero look-ahead can give significantly better accuracy by sacrificing latency. The latency can get controlled by the left context size.
Approaches with non-zero look-ahead can give significantly better accuracy by sacrificing latency. The latency can get controlled by the left context size. Increasing the right context would help the accuracy to a limit but would increase the compuation time.


In all modes, left context can be controlled by the number of tokens to be visible in the self-attention and the kernel size of the convolutions.
Expand All @@ -168,12 +170,16 @@ Left context of convolutions is dependent to the their kernel size while it can
Self-attention left context of around 6 secs would give close result to have unlimited left context. For a model with 4x downsampling and shift window of 10ms in the preprocessor, each token corresponds to 4*10=40ms.

If striding approach is used for downsampling, all the convolutions in downsampling would be fully causal and don't see future tokens.
It is recommended to use stacking for streaming model which is significantly faster and uses less memory.
You may use stacking for downsampling in the streaming models which is significantly faster and uses less memory.
It also does not some of the the limitations with striding and vggnet and you may use any downsampling rate.

You may find the example config files of cache-aware streaming Conformer models at
``<NeMo_git_root>/examples/asr/conf/conformer/streaming/conformer_transducer_bpe_streaming.yaml`` for Transducer variant and
at ``<NeMo_git_root>/examples/asr/conf/conformer/streaming/conformer_ctc_bpe.yaml`` for CTC variant.

To simulate cache-aware stremaing, you may use the script at ``<NeMo_git_root>/examples/asr/asr_streaming/speech_to_text_streaming_infer.py``. It can simulate streaming in single stream or multi-stream mode (in batches) for an ASR model.
This script can be used for models trained offline with full-context but the accuracy would not be great unless the chunk size is large enough which would result in high latency.
It is recommended to train a model in streaming model with limited context for this script. More info can be found in the script.

.. _LSTM-Transducer_model:

Expand Down
10 changes: 10 additions & 0 deletions docs/source/asr/scores.rst
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,16 @@ FR

--------------------

HR
^^

.. csv-table::
:header-rows: 1
:align: left
:file: data/scores/hr/conformer_hr.csv

--------------------

IT
^^

Expand Down
1 change: 1 addition & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@
'nlp/text_normalization/tn_itn_all.bib',
'tools/tools_all.bib',
'tts_all.bib',
'text_processing/text_processing_all.bib',
'core/adapters/adapter_bib.bib',
]

Expand Down
9 changes: 9 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ NVIDIA NeMo User Guide
nlp/machine_translation/machine_translation
nlp/text_normalization/intro
nlp/api
nlp/models


.. toctree::
Expand All @@ -60,6 +61,14 @@ NVIDIA NeMo User Guide
:caption: Common
:name: Common

text_processing/intro

.. toctree::
:maxdepth: 2
:caption: Text Processing
:name: Text Processing

text_processing/g2p/g2p
common/intro


Expand Down
Loading