Silero VAD for cleaning the dataset from silence #1166

rilshok · 2023-09-29T01:26:06Z

I intend to add a new workflow to Lhotse for processing arbitrary audio datasets by removing silence and preserving only speech using the Silero VAD, which can accurately detect speech in an audio stream. The workflow I'm adding should help users quickly and efficiently convert arbitrary datasets by cutting out silence and retaining only speech. An important aspect of such a process is the ability to preserve all supervision for each segment while considering changes made to the audio file. Before accepting this PR, I invite you to review my code. Currently, the code handles the task in trivial conditions, only processing MonoCut objects and not supporting other Cut types. I want to add support for other Cut types, but I'm not sure about the best approach at the moment. I would appreciate your comments and suggestions for improving the code. I would also be glad if you could try running the code and share your impressions. I'm confident that your feedback and suggestions will help make it even better.

Key Changes

Added the speach_only function, which allows processing audio files by removing silence and preserving speech only.
Added the speech_only workflow, which enables processing datasets from the CLI.
The code is written with the intention of being usable in various scenarios similar in concept to the addressed task.

Issues Requiring Discussion

There are several places in the code where I'm uncertain about the choice of implementation. In these places, I use NotImplementedError to indicate that I need assistance in selecting the best implementation approach. This is mainly related to handling subclasses of the Cut class other than MonoCut. I'm not sure about the best way to handle these cases.

Additionally, I have the _to_mono function, which should convert Recording records to mono format for speech analysis using Silero VAD. I'm confident that there's an elegant way to do this, so please provide some guidance.

I would like to receive feedback on function naming, variable naming, and code architecture. If you have specific suggestions for improvement, I would be glad to hear them.

rilshok · 2023-10-02T05:56:11Z

An example use case could be the need to refine a set of audio recordings where medical professionals discuss the results of patient medical examinations. These recordings contain medical text reflected in noisy supervision data, which was collected using a semi-automatic method. The peculiarity of this supervision data is that the annotation of one piece of text overlaps with another, and it requires refinement using automatic audio transcription algorithms. Prior to this refinement stage, it is necessary to prepare the audio recordings by removing all background noises and periods of silence, without losing context. The segments that remain after the removal of inactivity should include overlaps in the supervision data for subsequent refinement.

rilshok · 2023-10-02T06:02:29Z

We need to be able to re-save the dataset by cutting out the silence sections, so that when working with such datasets in the future we can be sure that it is clean enough and does not contain background noise. The presence of background noise in the dataset slows down experimentation, inefficiently consumes space on discs, and introduces bias in hypothesis testing.

pzelasko · 2023-10-02T22:25:18Z

I think I'm starting to understand what you are trying to achieve. Can you confirm the problem boils down to the following description: Given a cut with N supervisions modify the supervision start and end times according to new external information. Note that I need to have an understanding what's the high level goal of this before I start reviewing.

If the above statement is true, can this problem be solved using the following actions:

Run the VAD on a cut and obtain a list of VAD-supervisions.
Intersect the VAD-supervisions with the original supervisions. Intersection here means creating a new supervision list where the segments cover only the time intervals found in both of the inputs. The result copies all metadata from the original supervision list.
Update the supervisions in the cut.

If the above interpretation is correct, the only thing we're missing in Lhotse is the implementation of the intersection of two supervision sets. This could be added as a new method on Cut/CutSet, e.g. def refine_supervision_times(self: Cut/CutSet, other: List[Supervision]) -> Cut/CutSet. I don't think it requires a separate workflow though.

rilshok · 2023-10-03T07:42:57Z

No, unfortunately the problem does not boil down to the description you suggested. Because it does not take into account the need to refine the silence intervals inside the supervisions.

The main task is to clean up the audio recording. We want to get a new Cut in Recording of which there will be no silence. At the same time, it is important for us to correctly preserve the entire supervision inside the Cut. It is important that we don't want to split the original Cut into CutSet where each element will contain one SupervisionSegment. We want a new Cut that contains all of the original SupervisionSegments (except those that are dropped in the deletion procedure).

I think that in addition to the intersection procedure you suggest, you can also use AlignmentItem to segment the inner speech/silence segments. Either some kind of Recording masking procedure should be applied. And we also need a procedure to load audio taking into account AlignmentItems or audio masking.

desh2608 · 2023-10-05T11:44:44Z

An example use case could be the need to refine a set of audio recordings where medical professionals discuss the results of patient medical examinations. These recordings contain medical text reflected in noisy supervision data, which was collected using a semi-automatic method. The peculiarity of this supervision data is that the annotation of one piece of text overlaps with another, and it requires refinement using automatic audio transcription algorithms. Prior to this refinement stage, it is necessary to prepare the audio recordings by removing all background noises and periods of silence, without losing context. The segments that remain after the removal of inactivity should include overlaps in the supervision data for subsequent refinement.

Again, why can this not be done by appending the cuts corresponding to the supervisions? Why does the "filtered" recording need to be saved beforehand, except perhaps for loading efficiency.

rilshok · 2023-10-05T12:44:44Z

Again, why can this not be done by appending the cuts corresponding to the supervisions?

Could you give a concrete example of how exactly we can override one single source SupervisionSegment, given the silence intervals? Without splitting it into duplicates with different offsets and durations.

Why does the "filtered" recording need to be saved beforehand, except perhaps for loading efficiency.

To simplify further work with a cleaner dataset, and save disk space.

desh2608 · 2023-10-05T12:59:54Z

Again, why can this not be done by appending the cuts corresponding to the supervisions?

Could you give a concrete example of how exactly we can override one single source SupervisionSegment, given the silence intervals? Without splitting it into duplicates with different offsets and durations.

Why does the "filtered" recording need to be saved beforehand, except perhaps for loading efficiency.

To simplify further work with a cleaner dataset, and save disk space.

┌─────────────────────────────────────────────────────────────────────┐
│             Original recording (with speech and noise)              │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
                    .─────────────────────────────.
                   (   Speech activity detection   )
                    `─────────────────────────────'
                                   │
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                   Cut with 2 supervision segments                   │
│            ◁─────────────────▷            ◁─────────────────▷       │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   │
                                   ▼
               .───────────────────────────────────────.
              (   cut.trim_to_supervision_segments()    )
               `───────────────────────────────────────'
                                   │
                      ┌────────────┴───────────────────┐
                      │                                │
                      ▼                                ▼
             ┌────────────────┐              ┌──────────────────┐
             │  Speech cut 1  │              │   Speech cut 2   │
             └────────────────┘              └──────────────────┘
                      │                                │
                      └──────────────┬─────────────────┘
                                     │
                                     ▼
                        .─────────────────────────.
                       (         append()          )
                        `─────────────────────────'
                                     │
                                     ▼
                 ┌───────────────────────────────────────┐
                 │         Combined speech cuts          │
                 └───────────────────────────────────────┘
                                     │
                                     ▼
                        .─────────────────────────.
                       (     cut.save_audio()      )
                        `─────────────────────────'
                                     │
                                     │
                                     │
                                     ▼

rilshok · 2023-10-05T13:20:03Z

      1               2       34      5 6     7     8          9     
┌─────────────────────────────────────────────────────────────────────┐
│                        Cut with Supervision                         │
│     ◁───────────────.───────.▷                                      │
|                     ◁───────.───────.───────.─────▷                 │─┐
│                     .       .       . ◁─────.─────────────────▷     | │
└─────────────────────.───────.───────.───────.───────────────────────┘ │
                      .       .    │  .       .                         │
                      .       .    ▼  .       .                         │
                    .─────────────────────────────.                     │
                   (   Speech activity detection   )                    │
                    `─────────────────────────────'                     │
                      .       .    │  .       .                         │
                      .       .    │  .       .                         │
                      .       .    ▼  .       .                         │
┌─────────────────────.───────.───────.───────.───────────────────────┐ │
│                     .    Silence Supervision.                       │ │
│                     ◁───────▷       ◁───────▷                       │ │
└─────────────────────.───────.───────.───────.───────────────────────┘ │
                      .       .    │  .  ┌──────────────────────────────┘
                      .       .    ▼  .  ▼    .
                      .──────────────────────────.
                     (   Some kind of procedure   )
                      `──────────────────────────'
                      .       .    │  .       .
                      .       .    ▼  .       .
┌─────────────────────.───────.───────.───────.───────────────────────┐
│                     .  Combined speech cuts .                       │
│     ◁───────────────/////////▷      /////////                       │
|                     /////////◁──────/////////─────▷                 │
│                     /////////       /////////◁────────────────▷     |
└─────────────────────────────────────────────────────────────────────┘

rilshok · 2023-10-05T13:20:41Z

How can we describe this with the procedure you suggest?

rilshok · 2023-10-05T13:27:57Z

How can such a resulting Cut be described? Is there any way to guarantee that when loading an audio track with load_audio, the numpy array will be shorter than the original and will not contain silence segments, and only three SupervisionSegments will remain in the cut.supervisions list?

desh2608 · 2023-10-05T13:52:44Z

      1               2       34      5 6     7     8          9     
┌─────────────────────────────────────────────────────────────────────┐
│                        Cut with Supervision                         │
│     ◁───────────────.───────.▷                                      │
|                     ◁───────.───────.───────.─────▷                 │─┐
│                     .       .       . ◁─────.─────────────────▷     | │
└─────────────────────.───────.───────.───────.───────────────────────┘ │
                      .       .    │  .       .                         │
                      .       .    ▼  .       .                         │
                    .─────────────────────────────.                     │
                   (   Speech activity detection   )                    │
                    `─────────────────────────────'                     │
                      .       .    │  .       .                         │
                      .       .    │  .       .                         │
                      .       .    ▼  .       .                         │
┌─────────────────────.───────.───────.───────.───────────────────────┐ │
│                     .    Silence Supervision.                       │ │
│                     ◁───────▷       ◁───────▷                       │ │
└─────────────────────.───────.───────.───────.───────────────────────┘ │
                      .       .    │  .  ┌──────────────────────────────┘
                      .       .    ▼  .  ▼    .
                      .──────────────────────────.
                     (   Some kind of procedure   )
                      `──────────────────────────'
                      .       .    │  .       .
                      .       .    ▼  .       .
┌─────────────────────.───────.───────.───────.───────────────────────┐
│                     .  Combined speech cuts .                       │
│     ◁───────────────/////////▷      /////////                       │
|                     /////////◁──────/////////─────▷                 │
│                     /////////       /////////◁────────────────▷     |
└─────────────────────────────────────────────────────────────────────┘

What do the //// represent? Does this mean you are effectively removing the time segments corresponding to "silence" from your original supervision segments? If so, perhaps this can be achieved by having interval tree operations for the SupervisionSet class, as Piotr suggested. Once you have some defined segments, you can use cut.trim_to_supervision_groups() instead of cut.trim_to_supervisions() if you believe that there may be overlapping segments and you want to keep them together.

rilshok · 2023-10-05T14:17:25Z

Yes //// means that we trimmed the silence and refined the supervisory intervals. In this PR, I implemented the required operations using IntervalTree to achieve the desired result. Since functionality like refine_supervision_times proposed by Peter is not yet part of the basic Cut methods, I may suggest modifying my proposed trim_inactivity workflow in the future once the corresponding functionality is implemented.

desh2608 · 2023-10-05T15:17:55Z

Since you already have the algorithm implemented, would you be willing to put this functionality as a SupervisionSet method, and then this workflow can simply use it? This way, it would also allow other users to directly use that method.

rilshok · 2023-10-05T15:25:38Z

Yes, of course, I am ready to implement such functionality in Cut, SupervisionSet, etc. But we need to strictly agree on how to test this functionality, and in which code points we implement it. Personally, I think this functionality is very exotic, and few people really need it directly when working with CutSet. But if you think it should be included in the backbone of the library, let's do it.

desh2608 · 2023-10-05T15:26:45Z

Let's see what @pzelasko has to say about this.

pzelasko · 2023-10-07T03:14:02Z

I'm still not sure. It looks like your example may be implemented with .truncate()/.split() to remove the detected non-speech segments and .append() to combine whatever cuts remained. The issue that remains is how to interpret an existing supervision segment being "masked out". Once you truncate, it will have to become two sub-segments, but unless you know the alignment, such supervision is not meaningful anymore for tasks such as ASR. Although you may want to replace these sub-segments with a new, merged supervision in the resulting MixedCut. This could probably work and be implemented as a part of the "refine" thingy. What do you think?

rilshok · 2023-10-07T19:48:28Z

I think that the main purpose of the silence detector is to remove silence from the supervised segment of audio. All of the proposed alternative approaches to full track resaving and supervising require splitting the supervised segment into parts. I believe that the operation of duplicating a supervisory segment is disruptive in any task. Supervision cannot be divisible at all if it is represented by offset and duration. I think the best way to natively implement the required functionality in lhotse is to implement the AudioSource masking mechanism. The mask could be described similarly to supervision or aligment using intervals and be a serializable part of the Recording object.

But I would go further in this idea and say that Recording could be described by a sequence of audio segments described by offset and duration. Such that when audio is loaded using load_audio from AudioSource the audio track segments are sequentially loaded and concatenated with each other. Such description will allow not only to cut segments from audio, but also to make repeated, thinned and truncated Recording. In general, this mechanism is partially already implemented in Recording, but in fact there is only one such segment.

pzelasko · 2023-10-08T00:17:10Z

I appreciate the discussion but the design you're suggesting is too complex and not necessary. You can already achieve sequential loading of various audio chunks using cuts. If you need to mask out some portions of the audio, you can do it post-hoc by keeping the mask interval information either as overlapping supervisions (somehow marked as special: ids, custom fields) or in the cut custom fields. However, I don't really see why you would want to mask out silence. If you want to get rid of these segments of the recording instead, you can follow the procedure I suggested above.

To clarify, here's an example (which should be generalized to arbitrary lists of supervisions if you want to go this way):

r = Recording(...)
sups = [
  SupervisionSegment(..., start=2, duration=5),
]

# Assume:
# silence_segments = [
#   SupervisionSegment(..., start=3, duration=2)
# ]
silence_segments = run_vad(r)

# Note: if we used silence segments to cut supervisions, the original supervision would have been split into 
#           two sub-segments of: start=2, duration=1 and start=5, duration=2

# Instead of splitting, we create a cut that skips the silent segment in the recording and has a new supervision 
# that omits the silence:
c = r.to_cut()
new = (
  c
  .truncate(start=2, duration=1)
  .append(
    c.truncate(start=5, duration=2)
  )
)

# We will now add the updated supervision information. Note:
# - we update start=0 because we removed initial silence
# - we update duration=3 because we removed the internal 2s of audio silence that the original supervision over-spanned
new.supervisions = [fastcopy(sups[0], start=0, duration=3)]

rilshok · 2023-10-09T08:22:52Z

@pzelasko and @desh2608, thanks for the suggested solution, I appreciate your contribution on this discussion. I really didn't realize there was an option to glue the Recording together piece by piece. The pseudo code you suggested does look like an applicable and lazy approach that doesn't require interaction with the audio file itself to generate the result. For my part, I am ready to make the required changes in function trim_inactivity to use this mechanism of producing final Cuts. Do you think that the workflow of removing inactivity can be demanded by lhotse users? Does it make sense to spend the effort to transition to your proposed approach in this PR?

pzelasko · 2023-10-10T02:15:47Z

I think that’d work. May I ask you what are you using it for? It seems like a pretty drastic modification, do you find it significantly helps with some task?

rilshok · 2023-10-10T11:43:03Z

That is, would switching to your suggested approach using truncate and append help solve any problem in this PR? It's not a significant change overall, but it will speed up the generation of the resulting CutSets in the trim_inactivity removal pipline when used interactively, such as in jupyter. Since you won't have to physically rearrange the audio when processing without saving to disc. If I misunderstood your question, let me know.

pzelasko · 2023-10-10T21:01:07Z

What you’re saying is clear. I meant assuming this PR is finished and merged, how do you expect people to use it; in what situations; which tasks; and what kind of result improvement would you expect? Im asking because I’ve never encountered this technique used for any task (maybe except speaker ID recipes).

rilshok · 2023-10-11T15:13:08Z

I think it can be useful to other community members in the task of preparing datasets. At least my colleagues say that they would like this workflow based on Silero VAD to appear in lhotse to solve this task. The task is formulated as follows: to take any arbitrary dataset and resave it by deleting all silence sections in audio files + to add the possibility of selecting one specific channel or converting to mono.

pzelasko · 2023-10-11T20:41:42Z

OK cool. If you’re OK with that, let’s move forward with the changes described above. I also suggest to name this workflow “remove_nonspeech” because “trim” implies only suffix and prefix modification.

rilshok · 2023-10-11T21:16:46Z

How about the name - remove_inactivity? Since the workflow involves the possibility of selecting an activity detector, i.e. it does not have to be a VAD.

rilshok · 2023-10-13T15:04:48Z

As a result of merging using the append method, I encountered an issue with MixedCut, where it doesn't allow for the override of supervisions. MixedCut assumes that all supervision is stored within its child segments. However, in our task, the main situation is when an audio segment annotated by supervision is divided into several smaller parts. Consequently, our initial supervision should cover several cuts rather than being confined within them. I suggest reconsidering the approach to handling the supervision property in the MixedCut class, as it currently prevents users from adding any supervision to it. It seems that this behavior contradicts the purpose of the supervision field.

desh2608 · 2023-10-13T15:08:36Z

As a result of merging using the append method, I encountered an issue with MixedCut, where it doesn't allow for the override of supervisions. MixedCut assumes that all supervision is stored within its child segments. However, in our task, the main situation is when an audio segment annotated by supervision is divided into several smaller parts. Consequently, our initial supervision should cover several cuts rather than being confined within them. I suggest reconsidering the approach to handling the supervision property in the MixedCut class, as it currently prevents users from adding any supervision to it. It seems that this behavior contradicts the purpose of the supervision field.

You can add a new track to the MixedCut which covers the full duration, if it is needed. I don't think we should change the whole implementation to benefit one minor use case.

rilshok · 2023-10-13T15:22:20Z

As mentioned in my previous message, the current implementation of the supervision property in the MixedCut class violates the Liskov Substitution Principle. I suggest reconsidering this implementation to adhere to the principle and enhance the functionality of the class.

desh2608 · 2023-10-13T18:31:34Z

I get your point. This would require an extensive re-write of several modules. I don't have time for this at the moment, though.

Or perhaps we can get away with an easier change by allowing MixedCut to have "global" supervisions. However, by definition, a mixed cut is just a collection of cuts (as tracks), so I don't understand what it would mean for it to have such a "global" supervision.

pzelasko · 2023-10-13T21:46:31Z

The easiest way to handle this might be to iterate over tracks, remove all supervisions, and attach the new supervision to the first cut in tracks list.

rilshok added 30 commits September 23, 2023 12:41

initialise the script for activity detection

8636a80

init lhotse.workflows.activity_distillation module

6c50efe

add the Silero VAD model wrapper

9fb33f2

inherit SileroVAD from ActivityDetector

0419467

pass parameters to the model explicitly

bbc5d09

process each channel and return the supervision

7011d06

parallel processing by activity detector

ae29304

number the segments found

c686432

make abstract processing of an individual track

8ad86be

rename module and workflow to activity_detection

c42fafe

standardise detectors by sampling rate

e2db456

rename silero vad models

9117d98

implement a script for supervisory with silero-vad

bd279ae

handle exceptions and user input

9279bd9

allow the path to the output dir

3cf621e

reset the cached state of the model if necessary

1a2acdc

add docs for activity_detection module

3638360

fix if dir does not exist

1d7cf8c

fix cuda issue

2d9e258

add RecordingSet python example

7ce7f32

add base test for silero vad workflow

b335187

add test for silero vad in parallel

819e020

change detector name

7bd2c69

replace the chore option with force_download

729f279

improve user experience

1d33460

add simple test for activity_detection workflow

80163fe

clarify the need to use --force_download

1bc6d6e

rm slash

c04558e

trust the repository since torch>=1.12

41a9085

skip tests if torch version <1.12

5544c25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silero VAD for cleaning the dataset from silence #1166

Silero VAD for cleaning the dataset from silence #1166

rilshok commented Sep 29, 2023

rilshok commented Oct 2, 2023

rilshok commented Oct 2, 2023

pzelasko commented Oct 2, 2023

rilshok commented Oct 3, 2023

desh2608 commented Oct 5, 2023

rilshok commented Oct 5, 2023

desh2608 commented Oct 5, 2023

rilshok commented Oct 5, 2023

rilshok commented Oct 5, 2023

rilshok commented Oct 5, 2023 •

edited

Loading

desh2608 commented Oct 5, 2023

rilshok commented Oct 5, 2023

desh2608 commented Oct 5, 2023 •

edited

Loading

rilshok commented Oct 5, 2023

desh2608 commented Oct 5, 2023

pzelasko commented Oct 7, 2023 •

edited

Loading

rilshok commented Oct 7, 2023 •

edited

Loading

pzelasko commented Oct 8, 2023 •

edited

Loading

rilshok commented Oct 9, 2023

pzelasko commented Oct 10, 2023

rilshok commented Oct 10, 2023

pzelasko commented Oct 10, 2023

rilshok commented Oct 11, 2023

pzelasko commented Oct 11, 2023

rilshok commented Oct 11, 2023

rilshok commented Oct 13, 2023

desh2608 commented Oct 13, 2023

rilshok commented Oct 13, 2023

desh2608 commented Oct 13, 2023 •

edited

Loading

pzelasko commented Oct 13, 2023

Silero VAD for cleaning the dataset from silence #1166

Are you sure you want to change the base?

Silero VAD for cleaning the dataset from silence #1166

Conversation

rilshok commented Sep 29, 2023

Key Changes

Issues Requiring Discussion

rilshok commented Oct 2, 2023

rilshok commented Oct 2, 2023

pzelasko commented Oct 2, 2023

rilshok commented Oct 3, 2023

desh2608 commented Oct 5, 2023

rilshok commented Oct 5, 2023

desh2608 commented Oct 5, 2023

rilshok commented Oct 5, 2023

rilshok commented Oct 5, 2023

rilshok commented Oct 5, 2023 • edited Loading

desh2608 commented Oct 5, 2023

rilshok commented Oct 5, 2023

desh2608 commented Oct 5, 2023 • edited Loading

rilshok commented Oct 5, 2023

desh2608 commented Oct 5, 2023

pzelasko commented Oct 7, 2023 • edited Loading

rilshok commented Oct 7, 2023 • edited Loading

pzelasko commented Oct 8, 2023 • edited Loading

rilshok commented Oct 9, 2023

pzelasko commented Oct 10, 2023

rilshok commented Oct 10, 2023

pzelasko commented Oct 10, 2023

rilshok commented Oct 11, 2023

pzelasko commented Oct 11, 2023

rilshok commented Oct 11, 2023

rilshok commented Oct 13, 2023

desh2608 commented Oct 13, 2023

rilshok commented Oct 13, 2023

desh2608 commented Oct 13, 2023 • edited Loading

pzelasko commented Oct 13, 2023

rilshok commented Oct 5, 2023 •

edited

Loading

desh2608 commented Oct 5, 2023 •

edited

Loading

pzelasko commented Oct 7, 2023 •

edited

Loading

rilshok commented Oct 7, 2023 •

edited

Loading

pzelasko commented Oct 8, 2023 •

edited

Loading

desh2608 commented Oct 13, 2023 •

edited

Loading