Add Frame-VAD to ASR+VAD pipeline #6464

stevehuang52 · 2023-04-20T16:17:15Z

What does this PR do ?

This is the third PR for Frame-VAD. Please merge the previous two before this: #6441, #6463

This PR adds Frame-VAD to ASR+VAD pipeline, and also adds the drop-frame mode to ASR+VAD, which previously only supports masking mode.

Collection: [ASR]

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

Signed-off-by: stevehuang52 <heh@nvidia.com>

…_fvad_p3

Signed-off-by: stevehuang52 <heh@nvidia.com>

github-actions · 2023-05-24T01:55:43Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

Signed-off-by: stevehuang52 <heh@nvidia.com>

tango4j

Some grammars and typos to be fixed.

tango4j · 2023-06-02T23:14:30Z

examples/asr/asr_vad/speech_to_text_with_segment_vad.py

@@ -40,13 +40,15 @@

 To enable profiling, set `profiling=True`, but this will significantly slow down the program.

-To use or disable feature masking, set `use_rttm` to `True` or `False`.
+To use or disable feature masking/droping based on RTTM files, set `use_rttm` to `True` or `False`. 


droping->dropping

tango4j · 2023-06-02T23:20:01Z

nemo/collections/asr/data/feature_to_label.py

-
+        window_length_in_sec (float): Window length in seconds.
+        shift_length_in_sec (float): Shift length in seconds.
+        is_regression_task (bool): if True, the labels are treated as regression task.


as regression task -> as a regression task

tango4j · 2023-06-02T23:20:36Z

nemo/collections/asr/data/feature_to_label.py

+        labels (Optional[list]): List of unique labels collected from all samples.
+        augmentor (Optional): feature augmentation
+        delimiter (str): delimiter to split the labels.
+        is_regression_task (bool): if True, the labels are treated as regression task.


as regression task -> as a regression task

tango4j · 2023-06-02T23:25:10Z

examples/asr/asr_vad/speech_to_text_with_frame_vad.py

+    # Output settings, no need to change
+    output_dir: Optional[str] = None  # will be automatically set by the program
+    output_filename: Optional[str] = None  # will be automatically set by the program
+    pred_name_postfix: Optional[str] = None  # If you need to use another model name, rather than standard one.


LLM models are suggesting:

rather than standard one -> other than the standard one.

gabitza-tech · 2023-06-05T08:21:47Z

Hello,

Will the frame level VAD work with diarization too? I tried using the vad_multilingual_frame_marblenet.nemo model instead of the normal vad_multilingual_marblenet.nemo model, and got the following error:

RuntimeError: Error(s) in loading state_dict for EncDecClassificationModel:
	Unexpected key(s) in state_dict: "loss.weight".

Thank you in advance!

stevehuang52 · 2023-06-05T14:31:09Z

Will the frame level VAD work with diarization too? I tried using the vad_multilingual_frame_marblenet.nemo model instead of the normal vad_multilingual_marblenet.nemo model, and got the following error:
RuntimeError: Error(s) in loading state_dict for EncDecClassificationModel:
	Unexpected key(s) in state_dict: "loss.weight". 

Hi @gabitza-tech , yes it works with diarization, but needs some modifications in the inference pipeline. The error you're seeing is because Frame-VAD uses a different model calss, you can try from nemo.collections.asr.models.classification_models import EncDecFrameClassificationModel and then vad = EncDecFrameClassificationModel.restore_from("vad_multilingual_frame_marblenet.nemo)

Different from the segment-VAD that needs to first splice the input audios into many small 0.63s segments then outputs one label per segment, the Frame-VAD takes the whole audio as input without segment splicing and outputs one label per 20ms frame.

Please let us know if you need any help~!

Signed-off-by: stevehuang52 <heh@nvidia.com>

gabitza-tech · 2023-06-05T22:11:24Z

Will the frame level VAD work with diarization too? I tried using the vad_multilingual_frame_marblenet.nemo model instead of the normal vad_multilingual_marblenet.nemo model, and got the following error:
RuntimeError: Error(s) in loading state_dict for EncDecClassificationModel:
	Unexpected key(s) in state_dict: "loss.weight". 
Hi @gabitza-tech , yes it works with diarization, but needs some modifications in the inference pipeline. The error you're seeing is because Frame-VAD uses a different model calss, you can try from nemo.collections.asr.models.classification_models import EncDecFrameClassificationModel and then vad = EncDecFrameClassificationModel.restore_from("vad_multilingual_frame_marblenet.nemo)

Different from the segment-VAD that needs to first splice the input audios into many small 0.63s segments then outputs one label per segment, the Frame-VAD takes the whole audio as input without segment splicing and outputs one label per 20ms frame.

Please let us know if you need any help~!

Thank you a lot for your response @stevehuang52 ! I would have a couple more question:

from prior experience and testing, the vad parameters such as onset/offset/min_duration_on/min_duration_off have a great impact on the vad results. Should i leave them similar to the segment_vad parameters, or do I need to further fine-tune them?
What is the maximum audio size that frame level vad works on? I sometimes diarize audios that are longer than 1 hour, does frame level vad work as well as segment vad?
Should i leave window_length_in_sec=0.00 and shift_length_in_sec=0.02?

Thank you in advance and big kudos for the work! It is very helpful!

fayejf

had a quick look. will discuss offline regarding reducing redundant

examples/asr/asr_vad/speech_to_text_with_frame_vad.py

stevehuang52 · 2023-06-06T19:28:55Z

Hi @gabitza-tech,

from prior experience and testing, the vad parameters such as onset/offset/min_duration_on/min_duration_off have a great impact on the vad results. Should i leave them similar to the segment_vad parameters, or do I need to further fine-tune them?

Yes you'll need to tune those parameters. We've roughly tuned the parameters on DIHARD3-dev, the following values generally work well on our cases, but you might need to further tune them:

onset: 0.3 # onset threshold for detecting the beginning and end of a speech
offset: 0.3 # offset threshold for detecting the end of a speech.
pad_onset: 0.5 # adding durations before each speech segment
pad_offset: 0.5 # adding durations after each speech segment
min_duration_on: 0.0 # threshold for short speech deletion
min_duration_off: 0.6 # threshold for short non-speech segment deletion

What is the maximum audio size that frame level vad works on? I sometimes diarize audios that are longer than 1 hour, does frame level vad work as well as segment vad?

We haven't run the model on very long audios. Given that the Frame-VAD uses 1/8 of the memory as Segment-VAD during inference, so Frame-VAD should work with audios much longer than Segment-VAD.

Should i leave window_length_in_sec=0.00 and shift_length_in_sec=0.02?

Yes please leave them as their default values.

Signed-off-by: stevehuang52 <heh@nvidia.com>

fayejf

LGTM! Thanks. please remember to add doc/tutorial regarding data preparation and train/finetune later

fixed as suggested

stevehuang52 and others added 8 commits April 18, 2023 15:19

add model, dataset, necessary utils and tests

1efbeb0

Signed-off-by: stevehuang52 <heh@nvidia.com>

Merge branch 'main' of https://github.com/NVIDIA/NeMo into add_fvad_p1

640179d

fix tarred data

68abebd

Signed-off-by: stevehuang52 <heh@nvidia.com>

fix typo

f739598

Signed-off-by: stevehuang52 <heh@nvidia.com>

Merge branch 'main' into add_fvad_p1

426366f

add fvad examples and update utils

a704d43

Signed-off-by: stevehuang52 <heh@nvidia.com>

add copyright

9091b4d

Signed-off-by: stevehuang52 <heh@nvidia.com>

add frame-vad to ASR+VAD pipeline, add drop-frame mode

4019af9

Signed-off-by: stevehuang52 <heh@nvidia.com>

stevehuang52 requested review from tango4j and fayejf April 20, 2023 16:17

github-actions bot added ASR common labels Apr 20, 2023

stevehuang52 and others added 10 commits April 20, 2023 21:45

fix typo

5571f54

Signed-off-by: stevehuang52 <heh@nvidia.com>

Merge branch 'main' into add_fvad_p3

d617a60

Merge branch 'main' into add_fvad_p3

3f21784

update doc

e347f44

Signed-off-by: stevehuang52 <heh@nvidia.com>

Merge branch 'add_fvad_p3' of https://github.com/NVIDIA/NeMo into add…

5cd13eb

…_fvad_p3

fix masking

6be1b76

Signed-off-by: stevehuang52 <heh@nvidia.com>

update doc

8f74da3

Signed-off-by: stevehuang52 <heh@nvidia.com>

slight refactor

56ea200

Signed-off-by: stevehuang52 <heh@nvidia.com>

fix rnnt output

114e7c1

Signed-off-by: stevehuang52 <heh@nvidia.com>

add support for hybrid model

eed2db6

Signed-off-by: stevehuang52 <heh@nvidia.com>

github-actions bot added the stale label May 24, 2023

Merge branch 'main' into add_fvad_p3

33da1e3

Signed-off-by: stevehuang52 <heh@nvidia.com>

github-actions bot removed the common label May 24, 2023

Merge branch 'main' into add_fvad_p3

c3f58fe

github-actions bot removed the stale label May 25, 2023

stevehuang52 added 2 commits May 31, 2023 12:00

update tutorial

6e058d5

Signed-off-by: stevehuang52 <heh@nvidia.com>

merge main

89043b8

Signed-off-by: stevehuang52 <heh@nvidia.com>

update

5396a23

Signed-off-by: stevehuang52 <heh@nvidia.com>

tango4j previously requested changes Jun 2, 2023

View reviewed changes

stevehuang52 and others added 2 commits June 5, 2023 10:45

fix typo

c27835f

Signed-off-by: stevehuang52 <heh@nvidia.com>

Merge branch 'main' into add_fvad_p3

309a30d

stevehuang52 requested a review from tango4j June 5, 2023 14:46

Merge branch 'main' into add_fvad_p3

2f6268c

fayejf requested changes Jun 6, 2023

View reviewed changes

examples/asr/asr_vad/speech_to_text_with_frame_vad.py Outdated Show resolved Hide resolved

examples/asr/asr_vad/speech_to_text_with_frame_vad.py Outdated Show resolved Hide resolved

stevehuang52 and others added 7 commits June 6, 2023 16:03

merge frame- and segment-vad scripts

0b52dfa

Signed-off-by: stevehuang52 <heh@nvidia.com>

update tutorial

a700949

Signed-off-by: stevehuang52 <heh@nvidia.com>

update doc

39473bf

Signed-off-by: stevehuang52 <heh@nvidia.com>

update doc

b7129d0

Signed-off-by: stevehuang52 <heh@nvidia.com>

Merge branch 'main' into add_fvad_p3

cca93ba

Merge branch 'main' into add_fvad_p3

3f0cdb6

Merge branch 'main' into add_fvad_p3

b566ea3

fayejf approved these changes Jun 12, 2023

View reviewed changes

stevehuang52 merged commit 02c3068 into main Jun 13, 2023

stevehuang52 deleted the add_fvad_p3 branch June 13, 2023 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Frame-VAD to ASR+VAD pipeline #6464

Add Frame-VAD to ASR+VAD pipeline #6464

stevehuang52 commented Apr 20, 2023

github-actions bot commented May 24, 2023

tango4j left a comment

tango4j Jun 2, 2023

stevehuang52 Jun 5, 2023

tango4j Jun 2, 2023

stevehuang52 Jun 5, 2023

tango4j Jun 2, 2023

stevehuang52 Jun 5, 2023

tango4j Jun 2, 2023

stevehuang52 Jun 5, 2023

gabitza-tech commented Jun 5, 2023 •

edited

Loading

stevehuang52 commented Jun 5, 2023

gabitza-tech commented Jun 5, 2023 •

edited

Loading

fayejf left a comment

stevehuang52 commented Jun 6, 2023 •

edited

Loading

fayejf left a comment

Add Frame-VAD to ASR+VAD pipeline #6464

Add Frame-VAD to ASR+VAD pipeline #6464

Conversation

stevehuang52 commented Apr 20, 2023

What does this PR do ?

Before your PR is "Ready for review"

github-actions bot commented May 24, 2023

tango4j left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gabitza-tech commented Jun 5, 2023 • edited Loading

stevehuang52 commented Jun 5, 2023

gabitza-tech commented Jun 5, 2023 • edited Loading

fayejf left a comment

Choose a reason for hiding this comment

stevehuang52 commented Jun 6, 2023 • edited Loading

fayejf left a comment

Choose a reason for hiding this comment

gabitza-tech commented Jun 5, 2023 •

edited

Loading

gabitza-tech commented Jun 5, 2023 •

edited

Loading

stevehuang52 commented Jun 6, 2023 •

edited

Loading