diff --git a/tools/speech_dataset_processor/README.md b/tools/speech_dataset_processor/README.md index 992c1da656e5..31f22f5d81bf 100644 --- a/tools/speech_dataset_processor/README.md +++ b/tools/speech_dataset_processor/README.md @@ -1,7 +1,69 @@ # Speech Dataset Processor -Toolkit to make it easy to write and share the steps for processing a speech dataset. +Speech Dataset Processor (SDP) is a toolkit to make it easy to: +1. write code to process a new dataset, minimizing the amount of boilerplate code required. +2. share the steps for processing a speech dataset. Sharing processing steps can be as easy as sharing a YAML file. -This toolkit contains many of the most common speech dataset processing operations. To process a new dataset, you simply need to write a YAML file containing the parameters needed for dataset processing. It is also easy to add your own code for various speech dataset processing steps if needed. +SDP's philosophy is to represent processing operations as 'processor' classes. Many common processing operations are provided, and it is easy to add your own. In some cases, all you will need to do to process a new dataset is simply to write a YAML file containing the parameters needed to process your dataset. -TBD +SDP is specifically intended for the use case when you have an existing dataset with the audio & text pairs already specified in some form, and you wish to create a JSON manifest suitable for use with NeMo. SDP allows for intermediate cleaning and filtering steps which involve amending the 'ground truth' `"text"` or dropping utterances which are deemed to be too inaccurate for training on. + +## Quick intro to Speech Dataset Processor + +* The steps to process a dataset are specified by a YAML config file. +* The YAML config file contains a list of processor classes & the args to pass into the constructor. +* Each processor class inputs an existing manifest (except for classes which create an 'initial' manifest from some external transcript file) & outputs a modified version of the manifest. It may change other files in the process, e.g. resample audio. +* To process a manifest, you need to list the chain of processors you wish to use. +* If a processor is not included, you can make your own. + +## YAML config file layout +A simplified version of an SDP file can be: + +```yaml +processors: + + # use existing classes for popular datasets or make your own class + - _target_: sdp.processors.CreateInitialManifestMLS + output_manifest_file: ... + download_dir: ... + ... + + # use existing classes for common operations or write your own + - _target_: sdp.processors.SubSubstringToSubstring + + substring_pairs: { + # specify the parameters needed for your usecase + " mr ": " mister ", + " misteak ": " mistake ", + ... + } + + - _target_: sdp.processors.DropNonAlphabet + alphabet: " abcdefghijklmnopqrstuvwxyz" + output_manifest_file: ... + ... +``` +## Existing processor classes +In addition to those mentioned in the example config file, many more classes are already included in Speech Dataset Processor, for example: +* `sdp.processors.ASRInference` will run inference on the manifest using a specified `pretrained_model`. +* `sdp.processors.DropHighWER` will compute WER between `text` and `pred_text` of each utterance and remove the utterance if WER is greater than the specified `wer_threshold`. +* `sdp.processors.DropHighLowCharrate` will compute the character rate in the utterance using `text` and `duration`, and drop the utterance if it is outside the bounds of the specified `high_charrate_threshold` and `low_charrate_threshold`. Carefully chosen thresholds will allow us to drop utterances with incorrect ground truth `text`. + +## Processor test cases +You can add test cases to verify you have specified your desired changes correctly and to help document why your are making these changes. + +For example: +```yaml +processors: + ... + - _target_: sdp.processors.DropIfRegexInAttribute + attribute_to_regex: + "text" : ["(\\D ){5,20}"] # looks for between 4 and 19 characters surrounded by spaces + + test_cases: + - {input: {text: "some s p a c e d out letters"}, output: null} + - {input: {text: "normal words only"}, output: {text: "normal words only"}} + - {input: {text: "three a b c spaced out letters"}, output: {text: "three a b c spaced out letters"}} + - {input: {text: "four a b c d spaced out letters"}, output: null} + ... +``` \ No newline at end of file diff --git a/tools/speech_dataset_processor/sdp/processors/asr_inference.py b/tools/speech_dataset_processor/sdp/processors/asr_inference.py index 6ace462d7e39..98bf43b45b90 100644 --- a/tools/speech_dataset_processor/sdp/processors/asr_inference.py +++ b/tools/speech_dataset_processor/sdp/processors/asr_inference.py @@ -20,7 +20,14 @@ class ASRInference(BaseProcessor): - """This processor perforce ASR inference. + """This processor performs ASR inference on the input manifest. + + Args: + output_manifest: the path to the output manifest. It will be the same as the input manifest, but will + also have "pred_true" entries for every utterance. + input_manifest_file: the path to the input manifest which will be transcribed. + pretrained_model: the name of the pretrained NeMo ASR model which will be used to do inference. + batch_size: the batch size to use for ASR inference. Note that it does not re-use base parallel implementation, since the ASR inference is already run in batches. @@ -29,7 +36,9 @@ class ASRInference(BaseProcessor): parallelization, but that needs to be tested. """ - def __init__(self, output_manifest_file, input_manifest_file, pretrained_model, batch_size=32): + def __init__( + self, output_manifest_file: str, input_manifest_file: str, pretrained_model: str, batch_size: int = 32 + ): self.output_manifest_file = output_manifest_file self.input_manifest_file = input_manifest_file self.script_path = Path(__file__).parents[4] / "examples" / "asr" / "transcribe_speech.py" diff --git a/tools/speech_dataset_processor/sdp/processors/base_processor.py b/tools/speech_dataset_processor/sdp/processors/base_processor.py index a51b3de1178b..2bbad5da6484 100644 --- a/tools/speech_dataset_processor/sdp/processors/base_processor.py +++ b/tools/speech_dataset_processor/sdp/processors/base_processor.py @@ -34,6 +34,17 @@ class DataEntry: class BaseProcessor(ABC): + """ + Abstract class for SDP processors. + + Args + output_manifest_file: path of where the output manifest file will be located. + input_manifest_file: path of where the input manifest file is located. This arg + is optional - some processors may not take in an input manifest because they + need to create an initial manifest from scratch (ie from some transcript file + that is in a format different to the NeMo manifest format). + """ + def __init__(self, output_manifest_file, input_manifest_file=None): self.output_manifest_file = output_manifest_file self.input_manifest_file = input_manifest_file @@ -55,13 +66,15 @@ def test(self): class BaseParallelProcessor(BaseProcessor): """ - TBD + Processor class which allows operations on each utterance to be parallelized. Parallelization + is done using tqdm.contrib.concurrent.process_map. - input_manifest_file should always be specified unless it's the first - processor that reads from original dataset representation. + Args: + max_workers: maximum number of workers that will be spawned during parallel processing. + chunksize: the size of the chunks that will be sent to worker processes. """ - def __init__(self, max_workers=-1, chunksize=100, **kwargs): + def __init__(self, max_workers: int = -1, chunksize: int = 100, **kwargs): super().__init__(**kwargs) if max_workers == -1: max_workers = multiprocessing.cpu_count() diff --git a/tools/speech_dataset_processor/sdp/processors/create_initial_manifest/create_initial_manifest_mls.py b/tools/speech_dataset_processor/sdp/processors/create_initial_manifest/create_initial_manifest_mls.py index 1ff9e914fe1b..97f224cb69de 100644 --- a/tools/speech_dataset_processor/sdp/processors/create_initial_manifest/create_initial_manifest_mls.py +++ b/tools/speech_dataset_processor/sdp/processors/create_initial_manifest/create_initial_manifest_mls.py @@ -25,8 +25,27 @@ class CreateInitialManifestMLS(BaseParallelProcessor): + """ + Downloads and unzips raw MLS data for the specified language, and creates an initial manifest using + the transcripts provided in the raw data. + + Args: + language: the language of the data you wish to be downloaded. This will be used to format the + URL from which we attempt to download the data. + download_dir: the directory where the downloaded data will be saved. + data_split: the data split for which the initial manifest will be created. + resampled_audio_dir: the directory where the resampled (16kHz) wav files will be stored. + use_test_data: if `True`, will use the test data manifest located at `TEST_DATA_PATH` to carry out tests. + """ + def __init__( - self, language, download_dir, resampled_audio_dir, data_split, use_test_data=False, **kwargs, + self, + language: str, + download_dir: str, + resampled_audio_dir: str, + data_split: str, + use_test_data: bool = False, + **kwargs, ): super().__init__(**kwargs) self.language = language @@ -65,7 +84,7 @@ def read_manifest(self): return dataset_entries - def process_dataset_entry(self, data_entry): + def process_dataset_entry(self, data_entry: str): if len(data_entry.split("\t")) != 2: raise RuntimeError(f"have more than one tab in line {data_entry}") diff --git a/tools/speech_dataset_processor/sdp/processors/modify_manifest/modify_manifest.py b/tools/speech_dataset_processor/sdp/processors/modify_manifest/modify_manifest.py index 5c1c0d808848..5c8ceefebe8e 100644 --- a/tools/speech_dataset_processor/sdp/processors/modify_manifest/modify_manifest.py +++ b/tools/speech_dataset_processor/sdp/processors/modify_manifest/modify_manifest.py @@ -23,12 +23,20 @@ class ModifyManifestTextProcessor(BaseParallelProcessor): """Base class useful for most "text-only" modifications of the manifest. - Will add the following functionality: - - Add space in the beginning and end of sentence for easier regex-based + This adds the following functionality on top of BaseParallelProcessor + - Adds space in the beginning and end of sentence for easier regex-based processing. - Automatically handles common test cases by comparing input to output values. + Args: + test_cases: an optional list of dicts containing test cases for checking + that the processor makes the changes that we are expecting. + The dicts must have a key 'input', the value of which is a dictionary + containing data which is our test input manifest line, and a key + 'output', the value of which is a dictionary containing data which is + the expected output manifest line. + .. note:: This class only supports one-to-one or one-to-none mappings. """ diff --git a/tools/speech_dataset_processor/sdp/processors/write_manifest.py b/tools/speech_dataset_processor/sdp/processors/write_manifest.py index 1f2d3ef12f2b..f601985a1647 100644 --- a/tools/speech_dataset_processor/sdp/processors/write_manifest.py +++ b/tools/speech_dataset_processor/sdp/processors/write_manifest.py @@ -13,13 +13,24 @@ # limitations under the License. import json +from typing import List from sdp.processors.base_processor import BaseProcessor from tqdm import tqdm class WriteManifest(BaseProcessor): - def __init__(self, output_manifest_file, input_manifest_file, fields_to_save): + """ + Saves a copy of a manifest but only with the fields specified in fields_to_save. + + Args: + output_manifest_file: path of where the output file will be saved. + input_manifest_file: path of where the input file that we will be copying is saved. + fields_to_save: list of the fields in the input manifest that we want to copy over. + The output file will only contain these fields. + """ + + def __init__(self, output_manifest_file: str, input_manifest_file: str, fields_to_save: List[str]): self.output_manifest_file = output_manifest_file self.input_manifest_file = input_manifest_file self.fields_to_save = fields_to_save