Skip to content

Commit

Permalink
update speaker docs (#4164)
Browse files Browse the repository at this point in the history
* update speaker docs

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

* chunks -> segments

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

* Khz -> kHz

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
  • Loading branch information
nithinraok authored and ericharper committed May 18, 2022
1 parent 3cec6a6 commit ed1985a
Show file tree
Hide file tree
Showing 9 changed files with 65 additions and 65 deletions.
10 changes: 5 additions & 5 deletions docs/source/asr/speaker_diarization/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,11 @@ Diarization inference is based on Hydra configurations which are fulfilled by ``
{"audio_filepath": "/path/to/abcd.wav", "offset": 0, "duration": null, "label": "infer", "text": "-", "num_speakers": null, "rttm_filepath": "/path/to/rttm/abcd.rttm", "uem_filepath": "/path/to/uem/abcd.uem"}
In each line of the input manifest file, ``audio_filepath`` item is mandatory while the rest of the items are optional and can be passed for desired diarization setting. We refer to this file as a manifest file. This manifest file can be created by using the script in ``<NeMo_git_root>/scripts/speaker_tasks/pathsfiles_to_manifest.py``. The following example shows how to run ``pathsfiles_to_manifest.py`` by providing path list files.
In each line of the input manifest file, ``audio_filepath`` item is mandatory while the rest of the items are optional and can be passed for desired diarization setting. We refer to this file as a manifest file. This manifest file can be created by using the script in ``<NeMo_git_root>/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py``. The following example shows how to run ``pathfiles_to_diarize_manifest.py`` by providing path list files.

.. code-block:: bash
python pathsfiles_to_manifest.py --paths2audio_files /path/to/audio_file_path_list.txt \
python pathfiles_to_diarize_manifest.py --paths2audio_files /path/to/audio_file_path_list.txt \
--paths2txt_files /path/to/transcript_file_path_list.txt \
--paths2rttm_files /path/to/rttm_file_path_list.txt \
--paths2uem_files /path/to/uem_file_path_list.txt \
Expand All @@ -40,7 +40,7 @@ The ``--paths2audio_files`` and ``--manifest_filepath`` are required arguments.
/path/to/abcd02.rttm
The path list files containing the absolute paths to these WAV, RTTM, TXT, CTM and UEM files should be provided as in the above example. ``pathsfiles_to_manifest.py`` script will match each file using the unique filename (e.g. ``abcd``). Finally, the absolute path of the created manifest file should be provided through Hydra configuration as shown below:
The path list files containing the absolute paths to these WAV, RTTM, TXT, CTM and UEM files should be provided as in the above example. ``pathsfiles_to_diarize_manifest.py`` script will match each file using the unique filename (e.g. ``abcd``). Finally, the absolute path of the created manifest file should be provided through Hydra configuration as shown below:

.. code-block:: yaml
Expand Down Expand Up @@ -127,7 +127,7 @@ To evaluate the performance on AMI Meeting Corpus, the following instructions ca
- Download AMI Meeting Corpus from `AMI website <https://groups.inf.ed.ac.uk/ami/corpus/>`_. Choose ``Headset mix`` which has a mono wav file for each session.
- Download the test set (whitelist) from `Pyannotate AMI test set whitelist <https://raw.githubusercontent.com/pyannote/pyannote-audio/master/tutorials/data_preparation/AMI/MixHeadset.test.lst>`_.
- The merged RTTM file for AMI test set can be downloaded from `Pyannotate AMI test set RTTM file <https://raw.githubusercontent.com/pyannote/pyannote-audio/master/tutorials/data_preparation/AMI/MixHeadset.test.rttm>`_. Note that this file should be split into individual rttm files. Download split rttm files for AMI test set from `AMI test set split RTTM files <https://raw.githubusercontent.com/tango4j/diarization_annotation/main/AMI_corpus/test/split_rttms.tar.gz>`_.
- Generate an input manifest file using ``<NeMo_git_root>/scripts/speaker_tasks/pathsfiles_to_manifest.py``
- Generate an input manifest file using ``<NeMo_git_root>/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py``


CallHome American English Speech (CHAES), LDC97S42
Expand All @@ -154,5 +154,5 @@ To evaluate the performance on AMI Meeting Corpus, the following instructions ca
- Download CHAES Meeting Corpus at LDC website `LDC97S42 <https://catalog.ldc.upenn.edu/LDC97S42>`_ (CHAES is not publicly available).
- Download the CH109 filename list (whitelist) from `CH109 whitelist <https://raw.githubusercontent.com/tango4j/diarization_annotation/main/CH109/ch109_whitelist.txt>`_.
- Download RTTM files for CH109 set from `CH109 RTTM files <https://raw.githubusercontent.com/tango4j/diarization_annotation/main/CH109/split_rttms.tar.gz>`_.
- Generate an input manifest file using ``<NeMo_git_root>/scripts/speaker_tasks/pathsfiles_to_manifest.py``
- Generate an input manifest file using ``<NeMo_git_root>/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py``

34 changes: 17 additions & 17 deletions docs/source/asr/speaker_recognition/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,35 +24,35 @@ After download and conversion, your `data` folder should contain directories wit
All-other Datasets
------------------

These methods can be applied to any dataset to get similar training manifest files.
These methods can be applied to any dataset to get similar training or inference manifest files.

First we prepare scp file(s) containing absolute paths to all the wav files required for each of the train, dev, and test set. This can be easily prepared by using ``find`` bash command as follows:
`filelist_to_manifest.py` script in `$<NeMo_root>/scripts/speaker_tasks/` folder generates manifest file from a text file containing paths to audio files.

.. code-block:: bash
!find {data_dir}/{train_dir} -iname "*.wav" > data/train_all.scp
!head -n 3 data/train_all.scp
sample `filelist.txt` file contents:

.. code-block:: bash
Based on the created scp file, we use `scp_to_manifest.py` script to convert it to a manifest file. This script takes three optional arguments:
/data/datasets/voxceleb/data/dev/aac_wav/id00179/Q3G6nMr1ji0/00086.wav
/data/datasets/voxceleb/data/dev/aac_wav/id00806/VjpQLxHQQe4/00302.wav
/data/datasets/voxceleb/data/dev/aac_wav/id01510/k2tzXQXvNPU/00132.wav
* id: This value is used to assign speaker label to each audio file. This is the field number separated by `/` from the audio file path. For example if all audio file paths follow the convention of path/to/speaker_folder/unique_speaker_label/file_name.wav, by picking `id=3 or id=-2` script picks unique_speaker_label as label for that utterance.
* split: Optional argument to split the manifest in to train and dev json files
* create_chunks: Optional argument to randomly spit each audio file in to chunks of 1.5 sec, 2 sec and 3 sec for robust training of speaker embedding extractor model.
This list file is used to generate manifest file. This script has optional arguments to split the whole manifest file in to train and dev and also segment audio files to smaller segments for robust training (for testing, we don't need to create segments for each utterance).

sample usage:

After the download and conversion, your data folder should contain directories with manifest files as:

* `data/<path>/train.json`
* `data/<path>/dev.json`
* `data/<path>/train_all.json`
.. code-block:: bash
Each line in the manifest file describes a training sample - audio_filepath contains the path to the wav file, duration it's duration in seconds, and label is the speaker class label:
python filelist_to_manifest.py --filelist=filelist.txt --id=-3 --out=speaker_manifest.json
This would create a manifest containing file contents as shown below:
.. code-block:: json
{"audio_filepath": "<absolute path to dataset>/audio_file.wav", "duration": 3.9, "label": "speaker_id"}
{"audio_filepath": "/data/datasets/voxceleb/data/dev/aac_wav/id00179/Q3G6nMr1ji0/00086.wav", "offset": 0, "duration": 4.16, "label": "id00179"}
{"audio_filepath": "/data/datasets/voxceleb/data/dev/aac_wav/id00806/VjpQLxHQQe4/00302.wav", "offset": 0, "duration": 12.288, "label": "id00806"}
{"audio_filepath": "/data/datasets/voxceleb/data/dev/aac_wav/id01510/k2tzXQXvNPU/00132.wav", "offset": 0, "duration": 4.608, "label": "id01510"}
For other optional arguments like splitting manifest file to train and dev and for creating segements from each utterance refer to the arguments
described in the script.

Tarred Datasets
---------------
Expand Down
14 changes: 7 additions & 7 deletions examples/speaker_tasks/recognition/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@ We first generate manifest file to get embeddings. The embeddings are then used

```bash
# create list of files from voxceleb1 test folder (40 speaker test set)
find <path/to/voxceleb1_test/directory/> -iname '*.wav' > voxceleb1_test_files.scp
python <NeMo_root>/scripts/speaker_tasks/scp_to_manifest.py --scp voxceleb1_test_files.scp --id -3 --out voxceleb1_test_manifest.json
find <path/to/voxceleb1_test/directory/> -iname '*.wav' > voxceleb1_test_files.txt
python <NeMo_root>/scripts/speaker_tasks/filelist_to_manifest.py --filelist voxceleb1_test_files.txt --id -3 --out voxceleb1_test_manifest.json
```
### Embedding Extraction
Now using the manifest file created, we can extract embeddings to `data` folder using:
Expand Down Expand Up @@ -92,14 +92,14 @@ ffmpeg -v 8 -i </path/to/m4a/file> -f wav -acodec pcm_s16le <path/to/wav/file>

Generate a list file that contains paths to all the dev audio files from voxceleb1 and voxceleb2 using find command as shown below:
```bash
find <path/to/voxceleb1/dev/folder/> -iname '*.wav' > voxceleb1_dev.scp
find <path/to/voxceleb2/dev/folder/> -iname '*.wav' > voxceleb2_dev.scp
cat voxceleb1_dev.scp voxceleb2_dev.scp > voxceleb12.scp
find <path/to/voxceleb1/dev/folder/> -iname '*.wav' > voxceleb1_dev.txt
find <path/to/voxceleb2/dev/folder/> -iname '*.wav' > voxceleb2_dev.txt
cat voxceleb1_dev.txt voxceleb2_dev.txt > voxceleb12.txt
```

This list file is now used to generate training and validation manifest files using a script provided in `<NeMo_root>/scripts/speaker_tasks/`. This script has optional arguments to split the whole manifest file in to train and dev and also chunk audio files to smaller chunks for robust training (for testing, we don't need this).
This list file is now used to generate training and validation manifest files using a script provided in `<NeMo_root>/scripts/speaker_tasks/`. This script has optional arguments to split the whole manifest file in to train and dev and also chunk audio files to smaller segments for robust training (for testing, we don't need this).

```bash
python <NeMo_root>/scripts/speaker_tasks/scp_to_manifest.py --scp voxceleb12.scp --id -3 --out voxceleb12_manifest.json --split --create_chunks
python <NeMo_root>/scripts/speaker_tasks/filelist_to_manifest.py --filelist voxceleb12.txt --id -3 --out voxceleb12_manifest.json --split --create_segments
```
This creates `train.json, dev.json` in the current working directory.
4 changes: 2 additions & 2 deletions scripts/dataset_processing/get_hi-mia_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ def __process_data(data_folder: str, data_set: str):
"""
fullpath = os.path.abspath(data_folder)
scp = glob(fullpath + "/**/*.wav", recursive=True)
filelist = glob(fullpath + "/**/*.wav", recursive=True)
out = os.path.join(fullpath, data_set + "_all.json")
utt2spk = os.path.join(fullpath, "utt2spk")
utt2spk_file = open(utt2spk, "w")
Expand All @@ -152,7 +152,7 @@ def __process_data(data_folder: str, data_set: str):
speakers = []
lines = []
with open(out, "w") as outfile:
for line in tqdm(scp):
for line in tqdm(filelist):
line = line.strip()
y, sr = l.load(line, sr=None)
if sr != 16000:
Expand Down
38 changes: 19 additions & 19 deletions scripts/speaker_tasks/filelist_to_manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,21 +30,21 @@
This scipt converts a filelist file where each line contains
<absolute path of wav file> to a manifest json file.
Optionally post processes the manifest file to create dev and train split for speaker embedding
training, also optionally chunk an audio file in to segments of random DURATIONS and create those
training, also optionally segment an audio file in to segments of random DURATIONS and create those
wav files in CWD.
While creating chunks, if audio is not sampled at 16Khz, it resamples to 16Khz and write the wav file.
While creating segments, if audio is not sampled at 16kHz, it resamples to 16kHz and write the wav file.
Args:
--filelist: path to file containing list of audio files
--manifest(optional): if you already have manifest file, but would like to process it for creating chunks and splitting then use manifest ignoring filelist
--manifest(optional): if you already have manifest file, but would like to process it for creating
segments and splitting then use manifest ignoring filelist
--id: index of speaker label in filename present in filelist file that is separated by '/'
--out: output manifest file name
--split: if you would want to split the manifest file for training purposes
you may not need this for test set. output file names is <out>_<train/dev>.json
Defaults to False
--create_chunks:if you would want to chunk each manifest line to chunks of 4 sec or less
you may not need this for test set, Defaults to False
--min_spkrs_count: min number of samples per speaker to consider and ignore otherwise
you may not need this for test set. output file names is <out>_<train/dev>.json, defaults to False
--create_segments: if you would want to segment each manifest line to segments of [1,2,3,4] sec or less
you may not need this for test set, defaults to False
--min_spkrs_count: min number of samples per speaker to consider and ignore otherwise, defaults to 0 (all speakers)
"""

DURATIONS = sorted([1, 2, 3, 4], reverse=True)
Expand All @@ -60,7 +60,7 @@ def filter_manifest_line(manifest_line):
dur = manifest_line['duration']
label = manifest_line['label']
endname = os.path.splitext(audio_path.split(label, 1)[-1])[0]
to_path = os.path.join(CWD, 'chunks', label)
to_path = os.path.join(CWD, 'segments', label)
to_path = os.path.join(to_path, endname[1:])
os.makedirs(os.path.dirname(to_path), exist_ok=True)

Expand All @@ -87,8 +87,8 @@ def filter_manifest_line(manifest_line):

c_start = int(float(start * sr))
c_end = c_start + int(float(temp_dur * sr))
chunk = signal[c_start:c_end]
sf.write(to_file, chunk, sr)
segment = signal[c_start:c_end]
sf.write(to_file, segment, sr)

meta = manifest_line.copy()
meta['audio_filepath'] = to_file
Expand Down Expand Up @@ -172,7 +172,7 @@ def get_labels(lines):
return labels


def main(filelist, manifest, id, out, split=False, create_chunks=False, min_count=10):
def main(filelist, manifest, id, out, split=False, create_segments=False, min_count=10):
if os.path.exists(out):
os.remove(out)
if filelist:
Expand All @@ -185,8 +185,8 @@ def main(filelist, manifest, id, out, split=False, create_chunks=False, min_coun

lines = process_map(get_duration, lines, chunksize=100)

if create_chunks:
print(f"creating and writing chunks to {CWD}")
if create_segments:
print(f"creating and writing segments to {CWD}")
lines = process_map(filter_manifest_line, lines, chunksize=100)
temp = []
for line in lines:
Expand All @@ -197,7 +197,7 @@ def main(filelist, manifest, id, out, split=False, create_chunks=False, min_coun
speakers = [x['label'] for x in lines]

if min_count:
speakers, lines = count_and_consider_only(speakers, lines, min_count)
speakers, lines = count_and_consider_only(speakers, lines, abs(min_count))

write_file(out, lines, range(len(lines)))
path = os.path.dirname(out)
Expand Down Expand Up @@ -232,20 +232,20 @@ def main(filelist, manifest, id, out, split=False, create_chunks=False, min_coun
action='store_true',
)
parser.add_argument(
"--create_chunks",
help="bool if you would want to chunk each manifest line to chunks of 4 sec or less",
"--create_segments",
help="bool if you would want to segment each manifest line to segments of 4 sec or less",
required=False,
action='store_true',
)
parser.add_argument(
"--min_spkrs_count",
default=10,
default=0,
type=int,
help="min number of samples per speaker to consider and ignore otherwise",
)

args = parser.parse_args()

main(
args.filelist, args.manifest, args.id, args.out, args.split, args.create_chunks, args.min_spkrs_count,
args.filelist, args.manifest, args.id, args.out, args.split, args.create_segments, args.min_spkrs_count,
)
4 changes: 2 additions & 2 deletions tutorials/speaker_tasks/ASR_with_SpeakerDiarization.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets create a manifest file with the an4 audio and rttm available. If you have more than one file you may also use the script `NeMo/scripts/speaker_tasks/pathsfiles_to_manifest.py` to generate a manifest file from a list of audio files. In addition, you can optionally include rttm files to evaluate the diarization results."
"Lets create a manifest file with the an4 audio and rttm available. If you have more than one file you may also use the script `NeMo/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py` to generate a manifest file from a list of audio files. In addition, you can optionally include rttm files to evaluate the diarization results."
]
},
{
Expand Down Expand Up @@ -663,4 +663,4 @@
},
"nbformat": 4,
"nbformat_minor": 4
}
}
4 changes: 2 additions & 2 deletions tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets create manifest with the an4 audio and rttm available. If you have more than one files you may also use the script `pathsfiles_to_manifest.py` to generate manifest file from list of audio files and optionally rttm files "
"Lets create manifest with the an4 audio and rttm available. If you have more than one files you may also use the script `pathfiles_to_diarize_manifest.py` to generate manifest file from list of audio files and optionally rttm files "
]
},
{
Expand Down Expand Up @@ -593,4 +593,4 @@
},
"nbformat": 4,
"nbformat_minor": 4
}
}
Loading

0 comments on commit ed1985a

Please sign in to comment.