-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Online Code Switching Dataset for ASR #6579
Conversation
Signed-off-by: Daniel Egert <degert@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Daniel Egert <degert@nvidia.com>
Signed-off-by: Daniel Egert <degert@nvidia.com>
for more information, see https://pre-commit.ci
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
Can someone please review this PR for merging? We really need this functionality for Riva work. |
Signed-off-by: Daniel Egert <degert@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Daniel Egert <degert@nvidia.com>
…into online_code_switch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the pr needs some documentation (inline as well as in NeMo Docs) and checks.
Core logic of dataset, ill leave to others to review
# import here to avoid circular import error | ||
from nemo.collections.asr.parts.preprocessing import AudioSegment | ||
|
||
mb = io.BytesIO() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be super slow right? Better to just not allow augmentation here then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is the part where we apply the augmentor to the final, synthetic utterance only, and why we always pass Augmentor=None
to the individual language sub-datasets. We want all the individual samples from the individual mono-languages to be clean, and then we build the clean synthetic, and then we hit that final synthetic with the augmentation.
As for the logic, it's all in-memory operations with io.BytesIO, nothing is actually being written to disk. I have done several training runs with online CS and augmentation, and it doesn't slow it down by much at all, still very much usable.
There's sadly no way around it, we need to do augmentation on the final, full synthetic sample because many augmentations depend on features of the full waveform to work correctly (for example the new NormPerturbation).
for dataset_idx, (tarred_audio_filepath, manifest_filepath) in enumerate( | ||
zip(tarred_audio_filepaths, manifest_filepaths) | ||
): | ||
lang = config['code_switch_languages'][dataset_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the a manifest contains LID information and we pass LIDs here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a problem, because LIDs in the manifest always take precedence, you can see that here:
if lang is not None: |
The new language parameter is only used as a fallback IFF using AggTokeniser and the manifest has no LID in the manifest.
Signed-off-by: Daniel Egert <degert@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Daniel Egert <degert@nvidia.com>
Signed-off-by: Daniel Egert <degert@nvidia.com>
Signed-off-by: Daniel Egert <degert@nvidia.com>
Signed-off-by: trias702 <25867060+trias702@users.noreply.github.com>
for more information, see https://pre-commit.ci
Signed-off-by: Daniel Egert <degert@nvidia.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @trias702, LGTM!
Signed-off-by: Daniel Egert <degert@nvidia.com>
What does this PR do ?
Adds an IterableDataset class which generates on-the-fly code switched utterances for ASR
Collection: ASR
Changelog
nemo/collections/asr/data/audio_to_text.py
: added language support to the BPE classes to support code-switchingnemo/collections/asr/data/audio_to_text_dataset.py
: added the core create_cs_dataset function and various instantiation logicnemo/collections/common/data/__init__.py
: exposed new CodeSwitchedDataset classnemo/collections/common/data/dataset.py
: contains new CodeSwitchedDataset classnemo/collections/common/parts/preprocessing/collections.py
: added discrete language support of the TokeniserWrapper via a hasattr checknemo/collections/asr/models/*
: changed all of the models to check for IterableDataset instead of reading the is_tarred config item, as I think this makes it more robustUsage
There are many different permutations to using this new dataset, but a basic example shows how to add it to an existing training config yaml. To understand what each parameter does, please look at the class doc for the CodeSwitchedDataset class in nemo.collections.common.data.dataset.
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information