NVIDIA · nithinraok · May 13, 2022 · May 13, 2022 · May 13, 2022 · May 13, 2022
diff --git a/docs/source/asr/speaker_diarization/datasets.rst b/docs/source/asr/speaker_diarization/datasets.rst
@@ -14,11 +14,11 @@ Diarization inference is based on Hydra configurations which are fulfilled by ``
 
   {"audio_filepath": "/path/to/abcd.wav", "offset": 0, "duration": null, "label": "infer", "text": "-", "num_speakers": null, "rttm_filepath": "/path/to/rttm/abcd.rttm", "uem_filepath": "/path/to/uem/abcd.uem"}
 
-In each line of the input manifest file, ``audio_filepath`` item is mandatory while the rest of the items are optional and can be passed for desired diarization setting. We refer to this file as a manifest file. This manifest file can be created by using the script in ``<NeMo_git_root>/scripts/speaker_tasks/pathsfiles_to_manifest.py``. The following example shows how to run ``pathsfiles_to_manifest.py`` by providing path list files.
+In each line of the input manifest file, ``audio_filepath`` item is mandatory while the rest of the items are optional and can be passed for desired diarization setting. We refer to this file as a manifest file. This manifest file can be created by using the script in ``<NeMo_git_root>/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py``. The following example shows how to run ``pathfiles_to_diarize_manifest.py`` by providing path list files.
 
 .. code-block:: bash
 
-    python pathsfiles_to_manifest.py --paths2audio_files /path/to/audio_file_path_list.txt \
+    python pathfiles_to_diarize_manifest.py --paths2audio_files /path/to/audio_file_path_list.txt \
                                      --paths2txt_files /path/to/transcript_file_path_list.txt \
                                      --paths2rttm_files /path/to/rttm_file_path_list.txt \
                                      --paths2uem_files /path/to/uem_file_path_list.txt \
@@ -40,7 +40,7 @@ The ``--paths2audio_files`` and ``--manifest_filepath`` are required arguments.
   /path/to/abcd02.rttm
 
 
-The path list files containing the absolute paths to these WAV, RTTM, TXT, CTM and UEM files should be provided as in the above example. ``pathsfiles_to_manifest.py`` script will match each file using the unique filename (e.g. ``abcd``). Finally, the absolute path of the created manifest file should be provided through Hydra configuration as shown below:
+The path list files containing the absolute paths to these WAV, RTTM, TXT, CTM and UEM files should be provided as in the above example. ``pathsfiles_to_diarize_manifest.py`` script will match each file using the unique filename (e.g. ``abcd``). Finally, the absolute path of the created manifest file should be provided through Hydra configuration as shown below:
 
 .. code-block:: yaml
 
@@ -127,7 +127,7 @@ To evaluate the performance on AMI Meeting Corpus, the following instructions ca
   - Download AMI Meeting Corpus from `AMI website <https://groups.inf.ed.ac.uk/ami/corpus/>`_. Choose ``Headset mix`` which has a mono wav file for each session.
   - Download the test set (whitelist) from `Pyannotate AMI test set whitelist <https://raw.githubusercontent.com/pyannote/pyannote-audio/master/tutorials/data_preparation/AMI/MixHeadset.test.lst>`_.
   - The merged RTTM file for AMI test set can be downloaded from `Pyannotate AMI test set RTTM file <https://raw.githubusercontent.com/pyannote/pyannote-audio/master/tutorials/data_preparation/AMI/MixHeadset.test.rttm>`_. Note that this file should be split into individual rttm files. Download split rttm files for AMI test set from `AMI test set split RTTM files <https://raw.githubusercontent.com/tango4j/diarization_annotation/main/AMI_corpus/test/split_rttms.tar.gz>`_.
-  - Generate an input manifest file using ``<NeMo_git_root>/scripts/speaker_tasks/pathsfiles_to_manifest.py``
+  - Generate an input manifest file using ``<NeMo_git_root>/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py``
 
 
 CallHome American English Speech (CHAES), LDC97S42
@@ -154,5 +154,5 @@ To evaluate the performance on AMI Meeting Corpus, the following instructions ca
   - Download CHAES Meeting Corpus at LDC website `LDC97S42 <https://catalog.ldc.upenn.edu/LDC97S42>`_ (CHAES is not publicly available).
   - Download the CH109 filename list (whitelist) from `CH109 whitelist <https://raw.githubusercontent.com/tango4j/diarization_annotation/main/CH109/ch109_whitelist.txt>`_.
   - Download RTTM files for CH109 set from `CH109 RTTM files <https://raw.githubusercontent.com/tango4j/diarization_annotation/main/CH109/split_rttms.tar.gz>`_.
-  - Generate an input manifest file using ``<NeMo_git_root>/scripts/speaker_tasks/pathsfiles_to_manifest.py``
+  - Generate an input manifest file using ``<NeMo_git_root>/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py``
 
diff --git a/docs/source/asr/speaker_recognition/datasets.rst b/docs/source/asr/speaker_recognition/datasets.rst
@@ -24,35 +24,35 @@ After download and conversion, your `data` folder should contain directories wit
 All-other Datasets
 ------------------
 
-These methods can be applied to any dataset to get similar training manifest files.
+These methods can be applied to any dataset to get similar training or inference manifest files.
 
-First we prepare scp file(s) containing absolute paths to all the wav files required for each of the train, dev, and test set. This can be easily prepared by using ``find`` bash command as follows:
+`filelist_to_manifest.py` script in `$<NeMo_root>/scripts/speaker_tasks/` folder generates manifest file from a text file containing paths to audio files. 
 
-.. code-block:: bash 
-
-    !find {data_dir}/{train_dir}  -iname "*.wav" > data/train_all.scp
-    !head -n 3 data/train_all.scp
+sample `filelist.txt` file contents:
 
+.. code-block:: bash 
 
-Based on the created scp file, we use `scp_to_manifest.py` script to convert it to a manifest file. This script takes three optional arguments:
+    /data/datasets/voxceleb/data/dev/aac_wav/id00179/Q3G6nMr1ji0/00086.wav
+    /data/datasets/voxceleb/data/dev/aac_wav/id00806/VjpQLxHQQe4/00302.wav
+    /data/datasets/voxceleb/data/dev/aac_wav/id01510/k2tzXQXvNPU/00132.wav
 
-* id: This value is used to assign speaker label to each audio file. This is the field number separated by `/` from the audio file path. For example if all audio file paths follow the convention of path/to/speaker_folder/unique_speaker_label/file_name.wav, by picking `id=3 or id=-2` script picks unique_speaker_label as label for that utterance.
-* split: Optional argument to split the manifest in to train and dev json files
-* create_chunks: Optional argument to randomly spit each audio file in to chunks of 1.5 sec, 2 sec and 3 sec for robust training of speaker embedding extractor model.
+This list file is used to generate manifest file. This script has optional arguments to split the whole manifest file in to train and dev and also segment audio files to smaller segments for robust training (for testing, we don't need to create segments for each utterance).
 
+sample usage:
 
-After the download and conversion, your data folder should contain directories with manifest files as:
-
-* `data/<path>/train.json`
-* `data/<path>/dev.json`
-* `data/<path>/train_all.json`
+.. code-block:: bash 
 
-Each line in the manifest file describes a training sample - audio_filepath contains the path to the wav file, duration it's duration in seconds, and label is the speaker class label:
+    python filelist_to_manifest.py --filelist=filelist.txt --id=-3 --out=speaker_manifest.json 
 
+This would create a manifest containing file contents as shown below:
 .. code-block:: json
 
-    {"audio_filepath": "<absolute path to dataset>/audio_file.wav", "duration": 3.9, "label": "speaker_id"}
+    {"audio_filepath": "/data/datasets/voxceleb/data/dev/aac_wav/id00179/Q3G6nMr1ji0/00086.wav", "offset": 0, "duration": 4.16, "label": "id00179"}
+    {"audio_filepath": "/data/datasets/voxceleb/data/dev/aac_wav/id00806/VjpQLxHQQe4/00302.wav", "offset": 0, "duration": 12.288, "label": "id00806"}
+    {"audio_filepath": "/data/datasets/voxceleb/data/dev/aac_wav/id01510/k2tzXQXvNPU/00132.wav", "offset": 0, "duration": 4.608, "label": "id01510"}
 
+For other optional arguments like splitting manifest file to train and dev and for creating segements from each utterance refer to the arguments 
+described in the script.
 
 Tarred Datasets
 ---------------

diff --git a/examples/speaker_tasks/recognition/README.md b/examples/speaker_tasks/recognition/README.md
@@ -48,8 +48,8 @@ We first generate manifest file to get embeddings. The embeddings are then used
 
 ```bash
 # create list of files from voxceleb1 test folder (40 speaker test set)
-find <path/to/voxceleb1_test/directory/> -iname '*.wav' > voxceleb1_test_files.scp
-python <NeMo_root>/scripts/speaker_tasks/scp_to_manifest.py --scp voxceleb1_test_files.scp --id -3 --out voxceleb1_test_manifest.json 
+find <path/to/voxceleb1_test/directory/> -iname '*.wav' > voxceleb1_test_files.txt
+python <NeMo_root>/scripts/speaker_tasks/filelist_to_manifest.py --filelist voxceleb1_test_files.txt --id -3 --out voxceleb1_test_manifest.json 
 ```
 ### Embedding Extraction 
 Now using the manifest file created, we can extract embeddings to `data` folder using:
@@ -92,14 +92,14 @@ ffmpeg -v 8 -i </path/to/m4a/file> -f wav -acodec pcm_s16le <path/to/wav/file>
 
 Generate a list file that contains paths to all the dev audio files from voxceleb1 and voxceleb2 using find command as shown below:
 ```bash 
-find <path/to/voxceleb1/dev/folder/> -iname '*.wav' > voxceleb1_dev.scp
-find <path/to/voxceleb2/dev/folder/> -iname '*.wav' > voxceleb2_dev.scp
-cat voxceleb1_dev.scp voxceleb2_dev.scp > voxceleb12.scp
+find <path/to/voxceleb1/dev/folder/> -iname '*.wav' > voxceleb1_dev.txt
+find <path/to/voxceleb2/dev/folder/> -iname '*.wav' > voxceleb2_dev.txt
+cat voxceleb1_dev.txt voxceleb2_dev.txt > voxceleb12.txt
 ``` 
 
-This list file is now used to generate training and validation manifest files using a script provided in `<NeMo_root>/scripts/speaker_tasks/`. This script has optional arguments to split the whole manifest file in to train and dev and also chunk audio files to smaller chunks for robust training (for testing, we don't need this). 
+This list file is now used to generate training and validation manifest files using a script provided in `<NeMo_root>/scripts/speaker_tasks/`. This script has optional arguments to split the whole manifest file in to train and dev and also chunk audio files to smaller segments for robust training (for testing, we don't need this). 
 
 ```bash
-python <NeMo_root>/scripts/speaker_tasks/scp_to_manifest.py --scp voxceleb12.scp --id -3 --out voxceleb12_manifest.json --split --create_chunks
+python <NeMo_root>/scripts/speaker_tasks/filelist_to_manifest.py --filelist voxceleb12.txt --id -3 --out voxceleb12_manifest.json --split --create_segments
 ```
 This creates `train.json, dev.json` in the current working directory.
diff --git a/scripts/dataset_processing/get_hi-mia_data.py b/scripts/dataset_processing/get_hi-mia_data.py
@@ -135,7 +135,7 @@ def __process_data(data_folder: str, data_set: str):
 
     """
     fullpath = os.path.abspath(data_folder)
-    scp = glob(fullpath + "/**/*.wav", recursive=True)
+    filelist = glob(fullpath + "/**/*.wav", recursive=True)
     out = os.path.join(fullpath, data_set + "_all.json")
     utt2spk = os.path.join(fullpath, "utt2spk")
     utt2spk_file = open(utt2spk, "w")
@@ -152,7 +152,7 @@ def __process_data(data_folder: str, data_set: str):
     speakers = []
     lines = []
     with open(out, "w") as outfile:
-        for line in tqdm(scp):
+        for line in tqdm(filelist):
             line = line.strip()
             y, sr = l.load(line, sr=None)
             if sr != 16000:

diff --git a/scripts/speaker_tasks/filelist_to_manifest.py b/scripts/speaker_tasks/filelist_to_manifest.py
@@ -30,21 +30,21 @@
 This scipt converts a filelist file where each line contains  
 <absolute path of wav file> to a manifest json file. 
 Optionally post processes the manifest file to create dev and train split for speaker embedding 
-training, also optionally chunk an audio file in to segments of random DURATIONS and create those
+training, also optionally segment an audio file in to segments of random DURATIONS and create those
 wav files in CWD. 
 
-While creating chunks, if audio is not sampled at 16Khz, it resamples to 16Khz and write the wav file.
+While creating segments, if audio is not sampled at 16kHz, it resamples to 16kHz and write the wav file.
 Args: 
 --filelist: path to file containing list of audio files
---manifest(optional): if you already have manifest file, but would like to process it for creating chunks and splitting then use manifest ignoring filelist
+--manifest(optional): if you already have manifest file, but would like to process it for creating 
+    segments and splitting then use manifest ignoring filelist
 --id: index of speaker label in filename present in filelist file that is separated by '/'
 --out: output manifest file name
 --split: if you would want to split the  manifest file for training purposes
-        you may not need this for test set. output file names is <out>_<train/dev>.json
-        Defaults to False
---create_chunks:if you would want to chunk each manifest line to chunks of 4 sec or less
-        you may not need this for test set, Defaults to False
---min_spkrs_count: min number of samples per speaker to consider and ignore otherwise
+    you may not need this for test set. output file names is <out>_<train/dev>.json, defaults to False
+--create_segments: if you would want to segment each manifest line to segments of [1,2,3,4] sec or less
+    you may not need this for test set, defaults to False
+--min_spkrs_count: min number of samples per speaker to consider and ignore otherwise, defaults to 0 (all speakers)
 """
 
 DURATIONS = sorted([1, 2, 3, 4], reverse=True)
@@ -60,7 +60,7 @@ def filter_manifest_line(manifest_line):
     dur = manifest_line['duration']
     label = manifest_line['label']
     endname = os.path.splitext(audio_path.split(label, 1)[-1])[0]
-    to_path = os.path.join(CWD, 'chunks', label)
+    to_path = os.path.join(CWD, 'segments', label)
     to_path = os.path.join(to_path, endname[1:])
     os.makedirs(os.path.dirname(to_path), exist_ok=True)
 
@@ -87,8 +87,8 @@ def filter_manifest_line(manifest_line):
 
                 c_start = int(float(start * sr))
                 c_end = c_start + int(float(temp_dur * sr))
-                chunk = signal[c_start:c_end]
-                sf.write(to_file, chunk, sr)
+                segment = signal[c_start:c_end]
+                sf.write(to_file, segment, sr)
 
                 meta = manifest_line.copy()
                 meta['audio_filepath'] = to_file
@@ -172,7 +172,7 @@ def get_labels(lines):
     return labels
 
 
-def main(filelist, manifest, id, out, split=False, create_chunks=False, min_count=10):
+def main(filelist, manifest, id, out, split=False, create_segments=False, min_count=10):
     if os.path.exists(out):
         os.remove(out)
     if filelist:
@@ -185,8 +185,8 @@ def main(filelist, manifest, id, out, split=False, create_chunks=False, min_coun
 
     lines = process_map(get_duration, lines, chunksize=100)
 
-    if create_chunks:
-        print(f"creating and writing chunks to {CWD}")
+    if create_segments:
+        print(f"creating and writing segments to {CWD}")
         lines = process_map(filter_manifest_line, lines, chunksize=100)
         temp = []
         for line in lines:
@@ -197,7 +197,7 @@ def main(filelist, manifest, id, out, split=False, create_chunks=False, min_coun
     speakers = [x['label'] for x in lines]
 
     if min_count:
-        speakers, lines = count_and_consider_only(speakers, lines, min_count)
+        speakers, lines = count_and_consider_only(speakers, lines, abs(min_count))
 
     write_file(out, lines, range(len(lines)))
     path = os.path.dirname(out)
@@ -232,20 +232,20 @@ def main(filelist, manifest, id, out, split=False, create_chunks=False, min_coun
         action='store_true',
     )
     parser.add_argument(
-        "--create_chunks",
-        help="bool if you would want to chunk each manifest line to chunks of 4 sec or less",
+        "--create_segments",
+        help="bool if you would want to segment each manifest line to segments of 4 sec or less",
         required=False,
         action='store_true',
     )
     parser.add_argument(
         "--min_spkrs_count",
-        default=10,
+        default=0,
         type=int,
         help="min number of samples per speaker to consider and ignore otherwise",
     )
 
     args = parser.parse_args()
 
     main(
-        args.filelist, args.manifest, args.id, args.out, args.split, args.create_chunks, args.min_spkrs_count,
+        args.filelist, args.manifest, args.id, args.out, args.split, args.create_segments, args.min_spkrs_count,
     )
diff --git a/...s/speaker_tasks/pathsfiles_to_manifest.py → ...er_tasks/pathfiles_to_diarize_manifest.py b/...s/speaker_tasks/pathsfiles_to_manifest.py → ...er_tasks/pathfiles_to_diarize_manifest.py
diff --git a/tutorials/speaker_tasks/ASR_with_SpeakerDiarization.ipynb b/tutorials/speaker_tasks/ASR_with_SpeakerDiarization.ipynb
@@ -235,7 +235,7 @@
             "cell_type": "markdown",
             "metadata": {},
             "source": [
-                "Lets create a manifest file with the an4 audio and rttm available. If you have more than one file you may also use the script `NeMo/scripts/speaker_tasks/pathsfiles_to_manifest.py` to generate a manifest file from a list of audio files. In addition, you can optionally include rttm files to evaluate the diarization results."
+                "Lets create a manifest file with the an4 audio and rttm available. If you have more than one file you may also use the script `NeMo/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py` to generate a manifest file from a list of audio files. In addition, you can optionally include rttm files to evaluate the diarization results."
             ]
         },
         {
@@ -663,4 +663,4 @@
     },
     "nbformat": 4,
     "nbformat_minor": 4
-}
+}
diff --git a/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb b/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb
@@ -169,7 +169,7 @@
             "cell_type": "markdown",
             "metadata": {},
             "source": [
-                "Lets create manifest with the an4 audio and rttm available. If you have more than one files you may also use the script `pathsfiles_to_manifest.py` to generate manifest file from list of audio files and optionally rttm files "
+                "Lets create manifest with the an4 audio and rttm available. If you have more than one files you may also use the script `pathfiles_to_diarize_manifest.py` to generate manifest file from list of audio files and optionally rttm files "
             ]
         },
         {
@@ -593,4 +593,4 @@
     },
     "nbformat": 4,
     "nbformat_minor": 4
-}
+}