From dba6d774f78426655d1bc38dd62bf8f43ab622a4 Mon Sep 17 00:00:00 2001 From: Matvei Novikov Date: Tue, 4 Oct 2022 20:43:32 +0400 Subject: [PATCH] P&C Docs Signed-off-by: Matvei Novikov --- docs/source/nlp/models.rst | 2 +- docs/source/nlp/nlp_all.bib | 7 + .../nlp/punctuation_and_capitalization.rst | 8 - ...ation_and_capitalization_lexical_audio.rst | 390 ++++++++++++++++++ .../punctuation_and_capitalization_models.rst | 31 ++ 5 files changed, 429 insertions(+), 9 deletions(-) create mode 100644 docs/source/nlp/punctuation_and_capitalization_lexical_audio.rst create mode 100644 docs/source/nlp/punctuation_and_capitalization_models.rst diff --git a/docs/source/nlp/models.rst b/docs/source/nlp/models.rst index 92ac0b7f596b..b153ef9bd04a 100755 --- a/docs/source/nlp/models.rst +++ b/docs/source/nlp/models.rst @@ -8,7 +8,7 @@ NeMo's NLP collection supports provides the following task-specific models: .. toctree:: :maxdepth: 1 - punctuation_and_capitalization + punctuation_and_capitalization_models token_classification joint_intent_slot text_classification diff --git a/docs/source/nlp/nlp_all.bib b/docs/source/nlp/nlp_all.bib index ed8bdce399c6..3f77b369c107 100644 --- a/docs/source/nlp/nlp_all.bib +++ b/docs/source/nlp/nlp_all.bib @@ -170,4 +170,11 @@ @inproceedings{koehnetal2007moses publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P07-2045", pages = "177--180", +} + +@article{sunkara2020multimodal, + title={Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech}, + author={Monica Sunkara, Srikanth Ronanki, Dhanush Bekal, Sravan Bodapati, Katrin Kirchhoff}, + journal={arXiv preprint arXiv:2008.00702}, + year={2020} } \ No newline at end of file diff --git a/docs/source/nlp/punctuation_and_capitalization.rst b/docs/source/nlp/punctuation_and_capitalization.rst index 16a1e6856703..2c25c4f1277f 100755 --- a/docs/source/nlp/punctuation_and_capitalization.rst +++ b/docs/source/nlp/punctuation_and_capitalization.rst @@ -3,14 +3,6 @@ Punctuation and Capitalization Model ==================================== -Automatic Speech Recognition (ASR) systems typically generate text with no punctuation and capitalization of the words. -There are two issues with non-punctuated ASR output: - -- it could be difficult to read and understand -- models for some downstream tasks, such as named entity recognition, machine translation, or text-to-speech, are - usually trained on punctuated datasets and using raw ASR output as the input to these models could deteriorate their - performance - Quick Start Guide ----------------- diff --git a/docs/source/nlp/punctuation_and_capitalization_lexical_audio.rst b/docs/source/nlp/punctuation_and_capitalization_lexical_audio.rst new file mode 100644 index 000000000000..5dbdb9cf8972 --- /dev/null +++ b/docs/source/nlp/punctuation_and_capitalization_lexical_audio.rst @@ -0,0 +1,390 @@ +.. _punctuation_and_capitalization_lexical_audio: + +Punctuation and Capitalization Lexical Audio Model +==================================== + +Sometimes punctuation and capitalization cannot be restored based only on text. In this case we can use audio to improve model's accuracy. + +Like in these examples: + +.. code:: + + Oh yeah? or Oh yeah. + + We need to go? or We need to go. + + Yeah, they make you work. Yeah, over there you walk a lot? or Yeah, they make you work. Yeah, over there you walk a lot. + +You can find more details on text only punctuation and capitalization in `Punctuation And Capitalization's page `_. In this document, we focus on model changes needed to use acoustic features. + +Quick Start Guide +----------------- + +.. code-block:: python + + from nemo.collections.nlp.models import PunctuationCapitalizationLexicalAudioModel + + # to get the list of pre-trained models + PunctuationCapitalizationLexicalAudioModel.list_available_models() + + # Download and load the pre-trained model + model = PunctuationCapitalizationLexicalAudioModel.from_pretrained("") + + # try the model on a few examples + model.add_punctuation_capitalization(['how are you', 'great how about you'], audio_queries=['/path/to/1.wav', '/path/to/2.wav'], target_sr=16000) + +Model Description +----------------- +In addition to `Punctuation And Capitalization model `_ we add audio encoder (e.g. Conformer's encoder) and attention based fusion of lexical and audio features. +This model architecture is based on `Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech `__ :cite:`nlp-punct-sunkara2020multimodal`. + +.. note:: + + An example script on how to train and evaluate the model can be found at: `NeMo/examples/nlp/token_classification/punctuation_capitalization_lexical_audio_train_evaluate.py `__. + + The default configuration file for the model can be found at: `NeMo/examples/nlp/token_classification/conf/punctuation_capitalization_lexical_audio_config.yaml `__. + + The script for inference can be found at: `NeMo/examples/nlp/token_classification/punctuate_capitalize_infer.py `__. + +.. _raw_data_format_punct: + +Raw Data Format +--------------- +In addition to `Punctuation And Capitalization Raw Data Format `_ this model also requires audio data. +You have to provide ``audio_train.txt`` and ``audio_dev.txt`` (and optionally ``audio_test.txt``) which contain one valid path to audio per row. + +Example of the ``audio_train.txt``/``audio_dev.txt`` file: + +.. code:: + + /path/to/1.wav + /path/to/2.wav + .... +In this case ``source_data_dir`` structure should look similar to the following: + +.. code:: + + . + |--sourced_data_dir + |-- dev.txt + |-- train.txt + |-- audio_train.txt + |-- audio_dev.txt + +.. _nemo-data-format-label: + +Tarred dataset +-------------- + +It is recommended to use tarred dataset for training with large amount of data (>500 hours) due to large amount of RAM consumed by loading whole audio data into memory and CPU usage. + +For creating of tarred dataset with audio you will need data in NeMo format: + +.. code:: + + python examples/nlp/token_classification/data/create_punctuation_capitalization_tarred_dataset.py \ + --text \ + --labels \ + --output_dir \ + --num_batches_per_tarfile 100 \ + --use_audio \ + --audio_file \ + --sample_rate 16000 + +.. note:: + You can change sample rate to any positive integer. It will be used in constructor of :class:`~nemo.collections.asr.parts.preprocessing.AudioSegment`. It is recomended to set ``sample_rate`` to the same value as data which was used during training of ASR model. + + +Training Punctuation and Capitalization Model +--------------------------------------------- + +The audio encoder is initialized with pretrained ASR model. You can use any of ``list_available_models()`` of ``EncDecCTCModel`` or your own checkpoints, either one should be provided in ``model.audio_encoder.pretrained_model``. +You can freeze audio encoder during training and add additional ``ConformerLayer`` on top of encoder to reduce compute with ``model.audio_encoder.freeze``. You can also add `Adapters `_ to reduce compute with ``model.audio_encoder.adapter``. Parameters of fusion module are stored in ``model.audio_encoder.fusion``. +An example of a model configuration file for training the model can be found at: +`NeMo/examples/nlp/token_classification/conf/punctuation_capitalization_lexical_audio_config.yaml `__. + +Configs +^^^^^^^^^^^^ +.. note:: + This page contains only parameters specific to lexical and audio model. Others parameters can be found in `Punctuation And Capitalization's page `_. + +Model config +^^^^^^^^^^^^ + +A configuration of +:class:`~nemo.collections.nlp.models.token_classification.punctuation_capitalization_lexical_audio_model.PunctuationCapitalizationLexicalAudioModel` +model. + +.. list-table:: Model config + :widths: 5 5 10 25 + :header-rows: 1 + + * - **Parameter** + - **Data type** + - **Default value** + - **Description** + * - **audio_encoder** + - :ref:`audio encoder config` + - :ref:`audio encoder config` + - A configuration for audio encoder. + + +Data config +^^^^^^^^^^^ + +.. list-table:: Location of data configs in parent configs + :widths: 5 5 + :header-rows: 1 + + * - **Parent config** + - **Keys in parent config** + * - :ref:`Run config` + - ``model.train_ds``, ``model.validation_ds``, ``model.test_ds`` + * - :ref:`Model config` + - ``train_ds``, ``validation_ds``, ``test_ds`` + +.. _regular-dataset-parameters-label: + +.. list-table:: Parameters for regular (:class:`~nemo.collections.nlp.data.token_classification.punctuation_capitalization_dataset.BertPunctuationCapitalizationDataset`) dataset + :widths: 5 5 5 30 + :header-rows: 1 + + * - **Parameter** + - **Data type** + - **Default value** + - **Description** + * - **use_audio** + - bool + - ``false`` + - If set to ``true`` dataset will return audio as well as text. + * - **audio_file** + - string + - ``null`` + - A path to file with audio paths. + * - **sample_rate** + - int + - ``null`` + - Target sample rate of audios. Can be used for up sampling or down sampling of audio. + * - **use_bucketing** + - bool + - ``true`` + - If set to True will sort samples based on their audio length and assamble batches more efficently (less padding in batch). If set to False dataset will return ``batch_size`` batches instead of ``number_of_tokens`` tokens. + * - **preload_audios** + - bool + - ``true`` + - If set to True batches will include waveforms, if set to False will store audio_filepaths instead and load audios during ``collate_fn`` call. + + +.. _audio-encoder-config-label: + +Audio Encoder config +^^^^^^^^^^^^^^^^ + +.. list-table:: Audio Encoder Config + :widths: 5 5 10 25 + :header-rows: 1 + + * - **Parameter** + - **Data type** + - **Default value** + - **Description** + * - **pretrained_model** + - string + - ``stt_en_conformer_ctc_medium`` + - Pretrained model name or path to ``.nemo``` file to take audio encoder from. + * - **freeze** + - :ref:`freeze config` + - :ref:`freeze config` + - Configuration for freezing audio encoder's weights. + * - **adapter** + - :ref:`adapter config` + - :ref:`adapter config` + - Configuration for adapter. + * - **fusion** + - :ref:`fusion config` + - :ref:`fusion config` + - Configuration for fusion. + + +.. _freeze-config-label: + +.. list-table:: Freeze Config + :widths: 5 5 10 25 + :header-rows: 1 + + * - **Parameter** + - **Data type** + - **Default value** + - **Description** + * - **is_enabled** + - bool + - ``false`` + - If set to ``true`` encoder's weights will not be updated during training and aditional ``ConformerLayer`` layers will be added. + * - **d_model** + - int + - ``256`` + - Input dimension of ``MultiheadAttentionMechanism`` and ``PositionwiseFeedForward`` of additional ``ConformerLayer`` layers. + * - **d_ff** + - int + - ``1024`` + - Hidden dimension of ``PositionwiseFeedForward`` of additional ``ConformerLayer`` layers. + * - **num_layers** + - int + - ``4`` + - Number of additional ``ConformerLayer`` layers. + + +.. _adapter-config-label: + +.. list-table:: Adapter Config + :widths: 5 5 10 25 + :header-rows: 1 + + * - **Parameter** + - **Data type** + - **Default value** + - **Description** + * - **enable** + - bool + - ``false`` + - If set to ``true`` will enable adapters for audio encoder. + * - **config** + - ``LinearAdapterConfig`` + - ``null`` + - For more details see `nemo.collections.common.parts.LinearAdapterConfig `_ class. + + +.. _fusion-config-label: + +.. list-table:: Fusion Config + :widths: 5 5 10 25 + :header-rows: 1 + + * - **Parameter** + - **Data type** + - **Default value** + - **Description** + * - **num_layers** + - int + - ``4`` + - Number of layers to use in fusion. + * - **num_attention_heads** + - int + - ``4`` + - Number of attention heads to use in fusion. + * - **inner_size** + - int + - ``2048`` + - Fusion inner size. + + + +Model training +^^^^^^^^^^^^^^ + +For more information, refer to the :ref:`nlp_model` section. + +To train the model from scratch, run: + +.. code:: + + python examples/nlp/token_classification/punctuation_capitalization_lexical_audio_train_evaluate.py \ + model.train_ds.ds_item= \ + model.train_ds.text_file= \ + model.train_ds.labels_file= \ + model.validation_ds.ds_item= \ + model.validation_ds.text_file= \ + model.validation_ds.labels_file= \ + trainer.devices=[0,1] \ + trainer.accelerator='gpu' \ + optim.name=adam \ + optim.lr=0.0001 \ + model.train_ds.audio_file= \ + model.validation_ds.audio_file= + +The above command will start model training on GPUs 0 and 1 with Adam optimizer and learning rate of 0.0001; and the +trained model is stored in the ``nemo_experiments/Punctuation_and_Capitalization`` folder. + +To train from the pre-trained model, run: + +.. code:: + + python examples/nlp/token_classification/punctuation_capitalization_lexical_audio_train_evaluate.py \ + model.train_ds.ds_item= \ + model.train_ds.text_file= \ + model.train_ds.labels_file= \ + model.validation_ds.ds_item= \ + model.validation_ds.text_file= \ + model.validation_ds.labels_file= \ + model.train_ds.audio_file= \ + model.validation_ds.audio_file= \ + pretrained_model= + + +.. note:: + + All parameters defined in the configuration file can be changed with command arguments. For example, the sample + config file mentioned above has :code:`train_ds.tokens_in_batch` set to ``2048``. However, if you see that + the GPU utilization can be optimized further by using a larger batch size, you may override to the desired value + by adding the field :code:`train_ds.tokens_in_batch=4096` over the command-line. You can repeat this with + any of the parameters defined in the sample configuration file. + +Inference +--------- + +Inference is performed by a script `examples/nlp/token_classification/punctuate_capitalize_infer.py `_ + +.. code:: + + python punctuate_capitalize_infer.py \ + --input_manifest \ + --output_manifest \ + --pretrained_name \ + --max_seq_length 64 \ + --margin 16 \ + --step 8 \ + --use_audio + +Long audios are split just like in text only case, audio sequences treated the same as text seqences except :code:`max_seq_length` for audio equals :code:`max_seq_length*4000`. + +Model Evaluation +---------------- + +Model evaluation is performed by the same script +`examples/nlp/token_classification/punctuation_capitalization_lexical_audio_train_evaluate.py +`_ +as training. + +Use :ref`config` parameter ``do_training=false`` to disable training and parameter ``do_testing=true`` +to enable testing. If both parameters ``do_training`` and ``do_testing`` are ``true``, then model is trained and then +tested. + +To start evaluation of the pre-trained model, run: + +.. code:: + + python punctuation_capitalization_lexical_audio_train_evaluate.py \ + +model.do_training=false \ + +model.to_testing=true \ + model.test_ds.ds_item= \ + pretrained_model= \ + model.test_ds.text_file= \ + model.test_ds.labels_file= \ + model.test_ds.audio_file= + + +Required Arguments +^^^^^^^^^^^^^^^^^^ + +- :code:`pretrained_model`: pretrained Punctuation and Capitalization Lexical Audio model from ``list_available_models()`` or path to a ``.nemo`` + file. For example: ``your_model.nemo``. +- :code:`model.test_ds.ds_item`: path to the directory that contains :code:`model.test_ds.text_file`, :code:`model.test_ds.labels_file` and :code:`model.test_ds.audio_file` + +References +---------- + +.. bibliography:: nlp_all.bib + :style: plain + :labelprefix: NLP-PUNCT + :keyprefix: nlp-punct- + diff --git a/docs/source/nlp/punctuation_and_capitalization_models.rst b/docs/source/nlp/punctuation_and_capitalization_models.rst new file mode 100644 index 000000000000..6646e0895050 --- /dev/null +++ b/docs/source/nlp/punctuation_and_capitalization_models.rst @@ -0,0 +1,31 @@ +.. _punctuation_capitalization_models: + +Punctuation And Capitalization Models +============================================== + +Automatic Speech Recognition (ASR) systems typically generate text with no punctuation and capitalization of the words. +There are two issues with non-punctuated ASR output: + +- it could be difficult to read and understand +- models for some downstream tasks, such as named entity recognition, machine translation, or text-to-speech, are + usually trained on punctuated datasets and using raw ASR output as the input to these models could deteriorate their + performance + + +NeMo provides two types of Punctuation And Capitalization Models: + +Lexical only model: + +.. toctree:: + :maxdepth: 1 + + punctuation_and_capitalization + + +Lexical and audio model: + +.. toctree:: + :maxdepth: 1 + + punctuation_and_capitalization_lexical_audio +