Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Frame-VAD to ASR+VAD pipeline #6464

Merged
merged 33 commits into from
Jun 13, 2023
Merged

Add Frame-VAD to ASR+VAD pipeline #6464

merged 33 commits into from
Jun 13, 2023

Conversation

stevehuang52
Copy link
Collaborator

What does this PR do ?

This is the third PR for Frame-VAD. Please merge the previous two before this: #6441, #6463

This PR adds Frame-VAD to ASR+VAD pipeline, and also adds the drop-frame mode to ASR+VAD, which previously only supports masking mode.

Collection: [ASR]

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

stevehuang52 and others added 8 commits April 18, 2023 15:19
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
stevehuang52 and others added 10 commits April 20, 2023 21:45
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
@github-actions
Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label May 24, 2023
Signed-off-by: stevehuang52 <heh@nvidia.com>
@github-actions github-actions bot removed the common label May 24, 2023
@github-actions github-actions bot removed the stale label May 25, 2023
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
tango4j
tango4j previously requested changes Jun 2, 2023
Copy link
Collaborator

@tango4j tango4j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some grammars and typos to be fixed.

@@ -40,13 +40,15 @@

To enable profiling, set `profiling=True`, but this will significantly slow down the program.

To use or disable feature masking, set `use_rttm` to `True` or `False`.
To use or disable feature masking/droping based on RTTM files, set `use_rttm` to `True` or `False`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

droping->dropping

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


window_length_in_sec (float): Window length in seconds.
shift_length_in_sec (float): Shift length in seconds.
is_regression_task (bool): if True, the labels are treated as regression task.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as regression task -> as a regression task

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

labels (Optional[list]): List of unique labels collected from all samples.
augmentor (Optional): feature augmentation
delimiter (str): delimiter to split the labels.
is_regression_task (bool): if True, the labels are treated as regression task.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as regression task -> as a regression task

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

# Output settings, no need to change
output_dir: Optional[str] = None # will be automatically set by the program
output_filename: Optional[str] = None # will be automatically set by the program
pred_name_postfix: Optional[str] = None # If you need to use another model name, rather than standard one.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLM models are suggesting:

rather than standard one -> other than the standard one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@gabitza-tech
Copy link
Contributor

gabitza-tech commented Jun 5, 2023

Hello,

Will the frame level VAD work with diarization too? I tried using the vad_multilingual_frame_marblenet.nemo model instead of the normal vad_multilingual_marblenet.nemo model, and got the following error:

RuntimeError: Error(s) in loading state_dict for EncDecClassificationModel:
	Unexpected key(s) in state_dict: "loss.weight". 

Thank you in advance!

@stevehuang52
Copy link
Collaborator Author

Will the frame level VAD work with diarization too? I tried using the vad_multilingual_frame_marblenet.nemo model instead of the normal vad_multilingual_marblenet.nemo model, and got the following error:

RuntimeError: Error(s) in loading state_dict for EncDecClassificationModel:
	Unexpected key(s) in state_dict: "loss.weight". 

Hi @gabitza-tech , yes it works with diarization, but needs some modifications in the inference pipeline. The error you're seeing is because Frame-VAD uses a different model calss, you can try from nemo.collections.asr.models.classification_models import EncDecFrameClassificationModel and then vad = EncDecFrameClassificationModel.restore_from("vad_multilingual_frame_marblenet.nemo)

Different from the segment-VAD that needs to first splice the input audios into many small 0.63s segments then outputs one label per segment, the Frame-VAD takes the whole audio as input without segment splicing and outputs one label per 20ms frame.

Please let us know if you need any help~!

stevehuang52 and others added 2 commits June 5, 2023 10:45
Signed-off-by: stevehuang52 <heh@nvidia.com>
@stevehuang52 stevehuang52 requested a review from tango4j June 5, 2023 14:46
@gabitza-tech
Copy link
Contributor

gabitza-tech commented Jun 5, 2023

Will the frame level VAD work with diarization too? I tried using the vad_multilingual_frame_marblenet.nemo model instead of the normal vad_multilingual_marblenet.nemo model, and got the following error:

RuntimeError: Error(s) in loading state_dict for EncDecClassificationModel:
	Unexpected key(s) in state_dict: "loss.weight". 

Hi @gabitza-tech , yes it works with diarization, but needs some modifications in the inference pipeline. The error you're seeing is because Frame-VAD uses a different model calss, you can try from nemo.collections.asr.models.classification_models import EncDecFrameClassificationModel and then vad = EncDecFrameClassificationModel.restore_from("vad_multilingual_frame_marblenet.nemo)

Different from the segment-VAD that needs to first splice the input audios into many small 0.63s segments then outputs one label per segment, the Frame-VAD takes the whole audio as input without segment splicing and outputs one label per 20ms frame.

Please let us know if you need any help~!

Thank you a lot for your response @stevehuang52 ! I would have a couple more question:

  • from prior experience and testing, the vad parameters such as onset/offset/min_duration_on/min_duration_off have a great impact on the vad results. Should i leave them similar to the segment_vad parameters, or do I need to further fine-tune them?
  • What is the maximum audio size that frame level vad works on? I sometimes diarize audios that are longer than 1 hour, does frame level vad work as well as segment vad?
  • Should i leave window_length_in_sec=0.00 and shift_length_in_sec=0.02?

Thank you in advance and big kudos for the work! It is very helpful!

Copy link
Collaborator

@fayejf fayejf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had a quick look. will discuss offline regarding reducing redundant

examples/asr/asr_vad/speech_to_text_with_frame_vad.py Outdated Show resolved Hide resolved
examples/asr/asr_vad/speech_to_text_with_frame_vad.py Outdated Show resolved Hide resolved
@stevehuang52
Copy link
Collaborator Author

stevehuang52 commented Jun 6, 2023

Hi @gabitza-tech,

from prior experience and testing, the vad parameters such as onset/offset/min_duration_on/min_duration_off have a great impact on the vad results. Should i leave them similar to the segment_vad parameters, or do I need to further fine-tune them?

Yes you'll need to tune those parameters. We've roughly tuned the parameters on DIHARD3-dev, the following values generally work well on our cases, but you might need to further tune them:

onset: 0.3 # onset threshold for detecting the beginning and end of a speech
offset: 0.3 # offset threshold for detecting the end of a speech.
pad_onset: 0.5 # adding durations before each speech segment
pad_offset: 0.5 # adding durations after each speech segment
min_duration_on: 0.0 # threshold for short speech deletion
min_duration_off: 0.6 # threshold for short non-speech segment deletion

What is the maximum audio size that frame level vad works on? I sometimes diarize audios that are longer than 1 hour, does frame level vad work as well as segment vad?

We haven't run the model on very long audios. Given that the Frame-VAD uses 1/8 of the memory as Segment-VAD during inference, so Frame-VAD should work with audios much longer than Segment-VAD.

Should i leave window_length_in_sec=0.00 and shift_length_in_sec=0.02?

Yes please leave them as their default values.

stevehuang52 and others added 7 commits June 6, 2023 16:03
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Copy link
Collaborator

@fayejf fayejf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks. please remember to add doc/tutorial regarding data preparation and train/finetune later

@stevehuang52 stevehuang52 dismissed tango4j’s stale review June 12, 2023 20:29

fixed as suggested

@stevehuang52 stevehuang52 merged commit 02c3068 into main Jun 13, 2023
@stevehuang52 stevehuang52 deleted the add_fvad_p3 branch June 13, 2023 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants