Skip to content

Commit

Permalink
Tutorials and Docs for Multi-scale Diarization Decoder (NVIDIA#4930)
Browse files Browse the repository at this point in the history
Signed-off-by: Taejin Park <tango4j@gmail.com>
Co-authored-by: SeanNaren <snarenthiran@nvidia.com>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
  • Loading branch information
3 people authored and jubick1337 committed Oct 4, 2022
1 parent 6f26229 commit 05ca052
Show file tree
Hide file tree
Showing 43 changed files with 2,279 additions and 300 deletions.
41 changes: 35 additions & 6 deletions Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -359,6 +359,24 @@ pipeline {
sh 'rm -rf examples/speaker_tasks/recognition/speaker_recognition_results'
}
}

stage('Speaker Diarization') {
steps {
sh 'python examples/speaker_tasks/diarization/neural_diarizer/multiscale_diar_decoder.py \
model.diarizer.speaker_embeddings.model_path=titanet_large \
model.train_ds.batch_size=5 \
model.validation_ds.batch_size=5 \
model.train_ds.emb_dir=examples/speaker_tasks/diarization/speaker_diarization_results \
model.validation_ds.emb_dir=examples/speaker_tasks/diarization/speaker_diarization_results \
model.train_ds.manifest_filepath=/home/TestData/an4_diarizer/simulated_train/msdd_data.50step.json \
model.validation_ds.manifest_filepath=/home/TestData/an4_diarizer/simulated_valid/msdd_data.50step.json \
trainer.devices=[1] \
trainer.accelerator="gpu" \
+trainer.fast_dev_run=True \
exp_manager.exp_dir=examples/speaker_tasks/diarization/speaker_diarization_results'
sh 'rm -rf examples/speaker_tasks/diarization/speaker_diarization_results'
}
}

stage('Speech to Label') {
steps {
Expand All @@ -381,11 +399,10 @@ pipeline {
}
}


stage('Speaker Diarization with ASR Inference') {
steps {
sh 'python examples/speaker_tasks/diarization/clustering_diarizer/offline_diar_with_asr_infer.py \
diarizer.manifest_filepath=/home/TestData/an4_diarizer/an4_manifest.json \
diarizer.manifest_filepath=/home/TestData/an4_diarizer/an4_manifest.json \
diarizer.speaker_embeddings.model_path=/home/TestData/an4_diarizer/spkr.nemo \
diarizer.speaker_embeddings.parameters.save_embeddings=True \
diarizer.speaker_embeddings.parameters.window_length_in_sec=[1.5] \
Expand All @@ -398,18 +415,30 @@ pipeline {
}
}

stage('Speaker Diarization Inference') {
stage('Clustering Diarizer Inference') {
steps {
sh 'python examples/speaker_tasks/diarization/clustering_diarizer/offline_diar_infer.py \
diarizer.manifest_filepath=/home/TestData/an4_diarizer/an4_manifest.json \
diarizer.manifest_filepath=/home/TestData/an4_diarizer/an4_manifest.json \
diarizer.speaker_embeddings.model_path=/home/TestData/an4_diarizer/spkr.nemo \
diarizer.speaker_embeddings.parameters.save_embeddings=True \
diarizer.speaker_embeddings.parameters.window_length_in_sec=1.5 \
diarizer.speaker_embeddings.parameters.shift_length_in_sec=0.75 \
diarizer.speaker_embeddings.parameters.multiscale_weights=null \
diarizer.vad.model_path=/home/TestData/an4_diarizer/MatchboxNet_VAD_3x2.nemo \
diarizer.out_dir=examples/speaker_tasks/diarization/speaker_diarization_results'
sh 'rm -rf examples/speaker_tasks/diarization/speaker_diarization_results'
diarizer.out_dir=examples/speaker_tasks/diarization/clustering_diarizer_results'
sh 'rm -rf examples/speaker_tasks/diarization/clustering_diarizer_results'
}
}

stage('Neural Diarizer Inference') {
steps {
sh 'python examples/speaker_tasks/diarization/neural_diarizer/multiscale_diar_decoder_infer.py \
diarizer.manifest_filepath=/home/TestData/an4_diarizer/an4_manifest.json \
diarizer.msdd_model.model_path=/home/TestData/an4_diarizer/diar_msdd_telephonic.nemo \
diarizer.speaker_embeddings.parameters.save_embeddings=True \
diarizer.vad.model_path=/home/TestData/an4_diarizer/MatchboxNet_VAD_3x2.nemo \
diarizer.out_dir=examples/speaker_tasks/diarization/neural_diarizer_results'
sh 'rm -rf examples/speaker_tasks/diarization/neural_diarizer_results'
}
}

Expand Down
4 changes: 3 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,9 @@ Key Features
* `Speech Classification and Speech Command Recognition <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/intro.html>`_: MatchboxNet (Command Recognition)
* `Voice activity Detection (VAD) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speech_classification/models.html#marblenet-vad>`_: MarbleNet
* `Speaker Recognition <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_recognition/intro.html>`_: TitaNet, ECAPA_TDNN, SpeakerNet
* `Speaker Diarization <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_diarization/intro.html>`_: TitaNet, ECAPA_TDNN, SpeakerNet
* `Speaker Diarization <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_diarization/intro.html>`_
* Clustering Diarizer: TitaNet, ECAPA_TDNN, SpeakerNet
* Neural Diarizer: MSDD (Multi-scale Diarization Decoder)
* `Pretrained models on different languages. <https://ngc.nvidia.com/catalog/collections/nvidia:nemo_asr>`_: English, Spanish, German, Russian, Chinese, French, Italian, Polish, ...
* `NGC collection of pre-trained speech processing models. <https://ngc.nvidia.com/catalog/collections/nvidia:nemo_asr>`_
* Natural Language Processing
Expand Down
12 changes: 12 additions & 0 deletions docs/source/asr/asr_all.bib
Original file line number Diff line number Diff line change
Expand Up @@ -1057,3 +1057,15 @@ @misc{kim2022squeezeformer
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}

@misc{park2022multi,
doi = {10.48550/ARXIV.2203.15974},
url = {https://arxiv.org/abs/2203.15974},
author = {Park, Tae Jin and Koluguri, Nithin Rao and Balam, Jagadeesh and Ginsburg, Boris},
keywords = {Audio and Speech Processing (eess.AS), Computation and Language (cs.CL), FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Multi-scale Speaker Diarization with Dynamic Scale Weighting},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International}
}

6 changes: 5 additions & 1 deletion docs/source/asr/speaker_diarization/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,11 @@ Model Classes
-------------
.. autoclass:: nemo.collections.asr.models.ClusteringDiarizer
:show-inheritance:
:members:
:members:

.. autoclass:: nemo.collections.asr.models.EncDecDiarLabelModel
:show-inheritance:
:members: add_speaker_model_config, _init_segmentation_info, _init_speaker_model, setup_training_data, setup_validation_data, setup_test_data, get_ms_emb_seq, get_cluster_avg_embs_model, get_ms_mel_feat, forward, forward_infer, training_step, validation_step, compute_accuracies

Mixins
------
Expand Down
155 changes: 140 additions & 15 deletions docs/source/asr/speaker_diarization/configs.rst
Original file line number Diff line number Diff line change
@@ -1,26 +1,152 @@
NeMo Speaker Diarization Configuration Files
============================================

Since speaker diarization model here is not a fully-trainable end-to-end model but an inference pipeline, we use **diarizer** instead of **model** which is used in other tasks.

The diarizer section will generally require information about the dataset(s) being used, models used in this pipeline, as well as inference related parameters such as post processing of each models.
The sections on this page cover each of these in more detail.

Example configuration files for speaker diarization inference can be found in ``<NeMo_git_root>/examples/speaker_tasks/diarization/conf/inference/``. Choose a yaml file that fits your targeted domain. For example, if you want to diarize audio recordings of telephonic speech, choose ``diar_infer_telephonic.yaml``.
Both training and inference of speaker diarization is configured by ``.yaml`` files. The diarizer section will generally require information about the dataset(s) being used, models used in this pipeline, as well as inference related parameters such as post processing of each models. The sections on this page cover each of these in more detail.

.. note::
For model details and deep understanding about configs, fine-tuning, tuning threshold, and evaluation,
please refer to ``<NeMo_git_root>/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb``;
For model details and deep understanding about configs, training, fine-tuning and evaluations,
please refer to ``<NeMo_git_root>/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb`` and ``<NeMo_git_root>/tutorials/speaker_tasks/Speaker_Diarization_Training.ipynb``;
for other applications such as possible integration with ASR, have a look at ``<NeMo_git_root>/tutorials/speaker_tasks/ASR_with_SpeakerDiarization.ipynb``.


Hydra Configurations for Diarization Training
=============================================

Currently, NeMo supports Multi-scale diarization decoder (MSDD) as a neural diarizer model. MSDD is a speaker diarization model based on initializing clustering and multi-scale segmentation input. Example configuration files for MSDD model training can be found in ``<NeMo_git_root>/examples/speaker_tasks/diarization/conf/neural_diarizer/``.

* Model name convention for MSDD: msdd_<number of scales>scl_<longest scale in decimal second (ds)>_<shortest scale in decimal second (ds)>_<overlap percentage of window shifting>Povl_<hidden layer size>x<number of LSTM layers>x<number of CNN output channels>x<repetition count of conv layer>
* Example: ``msdd_5scl_15_05_50Povl_256x3x32x2.yaml`` has 5 scales, the longest scale is 1.5 sec, the shortest scale is 0.5 sec, with 50 percent overlap, hidden layer size is 256, 3 LSTM layers, 32 CNN channels, 2 repeated Conv layers

MSDD model checkpoint (.ckpt) and NeMo file (.nemo) contain speaker embedding model (TitaNet) and the speaker model is loaded along with standalone MSDD module. Note that MSDD models require more than one scale. Thus, the parameters in ``diarizer.speaker_embeddings.parameters`` should have more than one scale to function as a MSDD model.


General Diarizer Configuration
------------------------------

The items (OmegaConfig keys) directly under ``model`` determines segmentation and clustering related parameters. Multi-scale parameters (``window_length_in_sec``, ``shift_length_in_sec`` and ``multiscale_weights``) are specified. ``max_num_of_spks``, ``scale_n``, ``soft_label_thres`` and ``emb_batch_size`` are set here and then assigned to dataset configurations.

.. code-block:: yaml
diarizer:
out_dir: null
oracle_vad: True # If True, uses RTTM files provided in manifest file to get speech activity (VAD) timestamps
speaker_embeddings:
model_path: ??? # .nemo local model path or pretrained model name (titanet_large is recommended)
parameters:
window_length_in_sec: [1.5,1.25,1.0,0.75,0.5] # Window length(s) in sec (floating-point number). either a number or a list. ex) 1.5 or [1.5,1.0,0.5]
shift_length_in_sec: [0.75,0.625,0.5,0.375,0.25] # Shift length(s) in sec (floating-point number). either a number or a list. ex) 0.75 or [0.75,0.5,0.25]
multiscale_weights: [1,1,1,1,1] # Weight for each scale. should be null (for single scale) or a list matched with window/shift scale count. ex) [0.33,0.33,0.33]
save_embeddings: True # Save embeddings as pickle file for each audio input.
num_workers: ${num_workers} # Number of workers used for data-loading.
max_num_of_spks: 2 # Number of speakers per model. This is currently fixed at 2.
scale_n: 5 # Number of scales for MSDD model and initializing clustering.
soft_label_thres: 0.5 # Threshold for creating discretized speaker label from continuous speaker label in RTTM files.
emb_batch_size: 0 # If this value is bigger than 0, corresponding number of embedding vectors are attached to torch graph and trained.
Dataset Configuration
---------------------

Training, validation, and test parameters are specified using the ``train_ds``, ``validation_ds``, and
``test_ds`` sections in the configuration YAML file, respectively. The items such as ``num_spks``, ``soft_label_thres`` and ``emb_batch_size`` follow the settings in ``model`` key. You may also leave fields such as the ``manifest_filepath`` or ``emb_dir`` blank, and then specify it via command-line interface. Note that ``test_ds`` is not used during training and only used for speaker diarization inference.

.. code-block:: yaml
train_ds:
manifest_filepath: ???
emb_dir: ???
sample_rate: ${sample_rate}
num_spks: ${model.max_num_of_spks}
soft_label_thres: ${model.soft_label_thres}
labels: null
batch_size: ${batch_size}
emb_batch_size: ${model.emb_batch_size}
shuffle: True
validation_ds:
manifest_filepath: ???
emb_dir: ???
sample_rate: ${sample_rate}
num_spks: ${model.max_num_of_spks}
soft_label_thres: ${model.soft_label_thres}
labels: null
batch_size: 2
emb_batch_size: ${model.emb_batch_size}
shuffle: False
test_ds:
manifest_filepath: null
emb_dir: null
sample_rate: 16000
num_spks: ${model.max_num_of_spks}
soft_label_thres: ${model.soft_label_thres}
labels: null
batch_size: 2
shuffle: False
seq_eval_mode: False
Pre-processor Configuration
---------------------------

In the MSDD configuration, pre-processor configuration follows the pre-processor of the embedding extractor model.

.. code-block:: yaml
preprocessor:
_target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
normalize: "per_feature"
window_size: 0.025
sample_rate: ${sample_rate}
window_stride: 0.01
window: "hann"
features: 80
n_fft: 512
frame_splicing: 1
dither: 0.00001
Model Architecture Configurations
---------------------------------

The hyper-parameters for MSDD models are under the ``msdd_module`` key. The model architecture can be changed by setting up the ``weighting_scheme`` and ``context_vector_type``. The detailed explanation for architecture can be found in the :doc:`Models <./models>` page.

.. code-block:: yaml
msdd_module:
_target_: nemo.collections.asr.modules.msdd_diarizer.MSDD_module
num_spks: ${model.max_num_of_spks} # Number of speakers per model. This is currently fixed at 2.
hidden_size: 256 # Hidden layer size for linear layers in MSDD module
num_lstm_layers: 3 # Number of stacked LSTM layers
dropout_rate: 0.5 # Dropout rate
cnn_output_ch: 32 # Number of filters in a conv-net layer.
conv_repeat: 2 # Determins the number of conv-net layers. Should be greater or equal to 1.
emb_dim: 192 # Dimension of the speaker embedding vectors
scale_n: ${model.scale_n} # Number of scales for multiscale segmentation input
weighting_scheme: 'conv_scale_weight' # Type of weighting algorithm. Options: ('conv_scale_weight', 'attn_scale_weight')
context_vector_type: 'cos_sim' # Type of context vector: options. Options: ('cos_sim', 'elem_prod')
Loss Configurations
-------------------

Neural diarizer uses a binary cross entropy (BCE) loss. A set of weights for negative (absence of the speaker's speech) and positive (presence of the speaker's speech) can be provided to the loss function.

.. code-block:: yaml
loss:
_target_: nemo.collections.asr.losses.bce_loss.BCELoss
weight: null # Weight for binary cross-entropy loss. Either `null` or list type input. (e.g. [0.5,0.5])
Hydra Configurations for Diarization Inference
----------------------------------------------
==============================================

In contrast to other ASR related tasks or models in NeMo, speaker diarization supported in NeMo is a modular inference pipeline and training is only required for speaker embedding extractor model. Therefore, the datasets provided in manifest format denote the data that you would like to perform speaker diarization on.
Example configuration files for speaker diarization inference can be found in ``<NeMo_git_root>/examples/speaker_tasks/diarization/conf/inference/``. Choose a yaml file that fits your targeted domain. For example, if you want to diarize audio recordings of telephonic speech, choose ``diar_infer_telephonic.yaml``.

The configurations for all the components of diarization inference are included in a single file named ``diar_infer_<domain>.yaml``. Each ``.yaml`` file has a few different sections for the following modules: VAD, Speaker Embedding, Clustering and ASR.

In speaker diarization inference, the datasets provided in manifest format denote the data that you would like to perform speaker diarization on.

Diarizer Configurations
-----------------------

Expand All @@ -38,7 +164,7 @@ An example ``diarizer`` Hydra configuration could look like:
Under ``diarizer`` key, there are ``vad``, ``speaker_embeddings``, ``clustering`` and ``asr`` keys containing configurations for the inference of the corresponding modules.

Configurations for Voice Activity Detector
-----------------------------------------
------------------------------------------

Parameters for VAD model are provided as in the following Hydra config example.

Expand All @@ -62,7 +188,7 @@ Parameters for VAD model are provided as in the following Hydra config example.
filter_speech_first: True
Configurations for Speaker Embedding in Diarization
--------------------------------------------------
---------------------------------------------------

Parameters for speaker embedding model are provided in the following Hydra config example. Note that multiscale parameters either accept list or single floating point number.

Expand All @@ -77,7 +203,7 @@ Parameters for speaker embedding model are provided in the following Hydra confi
save_embeddings: False # Save embeddings as pickle file for each audio input.
Configurations for Clustering in Diarization
-------------------------------------------
--------------------------------------------

Parameters for clustering algorithm are provided in the following Hydra config example.

Expand All @@ -92,7 +218,7 @@ Parameters for clustering algorithm are provided in the following Hydra config e
sparse_search_volume: 30 # The higher the number, the more values will be examined with more time.
Configurations for Diarization with ASR
--------------------------------------
---------------------------------------

The following configuration needs to be appended under ``diarizer`` to run ASR with diarization to get a transcription with speaker labels.

Expand Down Expand Up @@ -124,4 +250,3 @@ The following configuration needs to be appended under ``diarizer`` to run ASR w
min_number_of_words: 3 # Min number of words for the left context.
max_number_of_words: 10 # Max number of words for the right context.
logprob_diff_threshold: 1.2 # The threshold for the difference between two log probability values from two hypotheses.
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ vad_marblenet,EncDecClassificationModel,"https://ngc.nvidia.com/catalog/models/n
vad_telephony_marblenet,EncDecClassificationModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:vad_telephony_marblenet"
titanet_large,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:titanet_large"
ecapa_tdnn,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:ecapa_tdnn"
diar_msdd_telephonic,EncDecDiarLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:diar_msdd_telephonic"
Loading

0 comments on commit 05ca052

Please sign in to comment.