Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silero VAD for cleaning the dataset from silence #1166

Open
wants to merge 94 commits into
base: master
Choose a base branch
from

Conversation

rilshok
Copy link
Contributor

@rilshok rilshok commented Sep 29, 2023

I intend to add a new workflow to Lhotse for processing arbitrary audio datasets by removing silence and preserving only speech using the Silero VAD, which can accurately detect speech in an audio stream. The workflow I'm adding should help users quickly and efficiently convert arbitrary datasets by cutting out silence and retaining only speech. An important aspect of such a process is the ability to preserve all supervision for each segment while considering changes made to the audio file. Before accepting this PR, I invite you to review my code. Currently, the code handles the task in trivial conditions, only processing MonoCut objects and not supporting other Cut types. I want to add support for other Cut types, but I'm not sure about the best approach at the moment. I would appreciate your comments and suggestions for improving the code. I would also be glad if you could try running the code and share your impressions. I'm confident that your feedback and suggestions will help make it even better.

Key Changes

  • Added the speach_only function, which allows processing audio files by removing silence and preserving speech only.

  • Added the speech_only workflow, which enables processing datasets from the CLI.

  • The code is written with the intention of being usable in various scenarios similar in concept to the addressed task.

Issues Requiring Discussion

There are several places in the code where I'm uncertain about the choice of implementation. In these places, I use NotImplementedError to indicate that I need assistance in selecting the best implementation approach. This is mainly related to handling subclasses of the Cut class other than MonoCut. I'm not sure about the best way to handle these cases.

Additionally, I have the _to_mono function, which should convert Recording records to mono format for speech analysis using Silero VAD. I'm confident that there's an elegant way to do this, so please provide some guidance.

I would like to receive feedback on function naming, variable naming, and code architecture. If you have specific suggestions for improvement, I would be glad to hear them.

@rilshok
Copy link
Contributor Author

rilshok commented Oct 2, 2023

An example use case could be the need to refine a set of audio recordings where medical professionals discuss the results of patient medical examinations. These recordings contain medical text reflected in noisy supervision data, which was collected using a semi-automatic method. The peculiarity of this supervision data is that the annotation of one piece of text overlaps with another, and it requires refinement using automatic audio transcription algorithms. Prior to this refinement stage, it is necessary to prepare the audio recordings by removing all background noises and periods of silence, without losing context. The segments that remain after the removal of inactivity should include overlaps in the supervision data for subsequent refinement.

@rilshok
Copy link
Contributor Author

rilshok commented Oct 2, 2023

We need to be able to re-save the dataset by cutting out the silence sections, so that when working with such datasets in the future we can be sure that it is clean enough and does not contain background noise. The presence of background noise in the dataset slows down experimentation, inefficiently consumes space on discs, and introduces bias in hypothesis testing.

@pzelasko
Copy link
Collaborator

pzelasko commented Oct 2, 2023

I think I'm starting to understand what you are trying to achieve. Can you confirm the problem boils down to the following description: Given a cut with N supervisions modify the supervision start and end times according to new external information. Note that I need to have an understanding what's the high level goal of this before I start reviewing.

If the above statement is true, can this problem be solved using the following actions:

  1. Run the VAD on a cut and obtain a list of VAD-supervisions.
  2. Intersect the VAD-supervisions with the original supervisions. Intersection here means creating a new supervision list where the segments cover only the time intervals found in both of the inputs. The result copies all metadata from the original supervision list.
  3. Update the supervisions in the cut.

If the above interpretation is correct, the only thing we're missing in Lhotse is the implementation of the intersection of two supervision sets. This could be added as a new method on Cut/CutSet, e.g. def refine_supervision_times(self: Cut/CutSet, other: List[Supervision]) -> Cut/CutSet. I don't think it requires a separate workflow though.

@rilshok
Copy link
Contributor Author

rilshok commented Oct 3, 2023

No, unfortunately the problem does not boil down to the description you suggested. Because it does not take into account the need to refine the silence intervals inside the supervisions.

The main task is to clean up the audio recording. We want to get a new Cut in Recording of which there will be no silence. At the same time, it is important for us to correctly preserve the entire supervision inside the Cut. It is important that we don't want to split the original Cut into CutSet where each element will contain one SupervisionSegment. We want a new Cut that contains all of the original SupervisionSegments (except those that are dropped in the deletion procedure).

I think that in addition to the intersection procedure you suggest, you can also use AlignmentItem to segment the inner speech/silence segments. Either some kind of Recording masking procedure should be applied. And we also need a procedure to load audio taking into account AlignmentItems or audio masking.

@desh2608
Copy link
Collaborator

desh2608 commented Oct 5, 2023

An example use case could be the need to refine a set of audio recordings where medical professionals discuss the results of patient medical examinations. These recordings contain medical text reflected in noisy supervision data, which was collected using a semi-automatic method. The peculiarity of this supervision data is that the annotation of one piece of text overlaps with another, and it requires refinement using automatic audio transcription algorithms. Prior to this refinement stage, it is necessary to prepare the audio recordings by removing all background noises and periods of silence, without losing context. The segments that remain after the removal of inactivity should include overlaps in the supervision data for subsequent refinement.

Again, why can this not be done by appending the cuts corresponding to the supervisions? Why does the "filtered" recording need to be saved beforehand, except perhaps for loading efficiency.

@rilshok
Copy link
Contributor Author

rilshok commented Oct 5, 2023

Again, why can this not be done by appending the cuts corresponding to the supervisions?

Could you give a concrete example of how exactly we can override one single source SupervisionSegment, given the silence intervals? Without splitting it into duplicates with different offsets and durations.

Why does the "filtered" recording need to be saved beforehand, except perhaps for loading efficiency.

To simplify further work with a cleaner dataset, and save disk space.

@desh2608
Copy link
Collaborator

desh2608 commented Oct 5, 2023

Again, why can this not be done by appending the cuts corresponding to the supervisions?

Could you give a concrete example of how exactly we can override one single source SupervisionSegment, given the silence intervals? Without splitting it into duplicates with different offsets and durations.

Why does the "filtered" recording need to be saved beforehand, except perhaps for loading efficiency.

To simplify further work with a cleaner dataset, and save disk space.

┌─────────────────────────────────────────────────────────────────────┐
│             Original recording (with speech and noise)              │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
                    .─────────────────────────────.
                   (   Speech activity detection   )
                    `─────────────────────────────'
                                   │
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                   Cut with 2 supervision segments                   │
│            ◁─────────────────▷            ◁─────────────────▷       │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   │
                                   ▼
               .───────────────────────────────────────.
              (   cut.trim_to_supervision_segments()    )
               `───────────────────────────────────────'
                                   │
                      ┌────────────┴───────────────────┐
                      │                                │
                      ▼                                ▼
             ┌────────────────┐              ┌──────────────────┐
             │  Speech cut 1  │              │   Speech cut 2   │
             └────────────────┘              └──────────────────┘
                      │                                │
                      └──────────────┬─────────────────┘
                                     │
                                     ▼
                        .─────────────────────────.
                       (         append()          )
                        `─────────────────────────'
                                     │
                                     ▼
                 ┌───────────────────────────────────────┐
                 │         Combined speech cuts          │
                 └───────────────────────────────────────┘
                                     │
                                     ▼
                        .─────────────────────────.
                       (     cut.save_audio()      )
                        `─────────────────────────'
                                     │
                                     │
                                     │
                                     ▼

@rilshok
Copy link
Contributor Author

rilshok commented Oct 5, 2023

      1               2       34      5 6     7     8          9     
┌─────────────────────────────────────────────────────────────────────┐
│                        Cut with Supervision                         │
│     ◁───────────────.───────.▷                                      │
|                     ◁───────.───────.───────.─────▷                 │─┐
│                     .       .       . ◁─────.─────────────────▷     | │
└─────────────────────.───────.───────.───────.───────────────────────┘ │
                      .       .    │  .       .                         │
                      .       .    ▼  .       .                         │
                    .─────────────────────────────.                     │
                   (   Speech activity detection   )                    │
                    `─────────────────────────────'                     │
                      .       .    │  .       .                         │
                      .       .    │  .       .                         │
                      .       .    ▼  .       .                         │
┌─────────────────────.───────.───────.───────.───────────────────────┐ │
│                     .    Silence Supervision.                       │ │
│                     ◁───────▷       ◁───────▷                       │ │
└─────────────────────.───────.───────.───────.───────────────────────┘ │
                      .       .    │  .  ┌──────────────────────────────┘
                      .       .    ▼  .  ▼    .
                      .──────────────────────────.
                     (   Some kind of procedure   )
                      `──────────────────────────'
                      .       .    │  .       .
                      .       .    ▼  .       .
┌─────────────────────.───────.───────.───────.───────────────────────┐
│                     .  Combined speech cuts .                       │
│     ◁───────────────/////////▷      /////////                       │
|                     /////////◁──────/////////─────▷                 │
│                     /////////       /////////◁────────────────▷     |
└─────────────────────────────────────────────────────────────────────┘

@rilshok
Copy link
Contributor Author

rilshok commented Oct 5, 2023

How can we describe this with the procedure you suggest?

@rilshok
Copy link
Contributor Author

rilshok commented Oct 5, 2023

How can such a resulting Cut be described? Is there any way to guarantee that when loading an audio track with load_audio, the numpy array will be shorter than the original and will not contain silence segments, and only three SupervisionSegments will remain in the cut.supervisions list?

@desh2608
Copy link
Collaborator

desh2608 commented Oct 5, 2023

      1               2       34      5 6     7     8          9     
┌─────────────────────────────────────────────────────────────────────┐
│                        Cut with Supervision                         │
│     ◁───────────────.───────.▷                                      │
|                     ◁───────.───────.───────.─────▷                 │─┐
│                     .       .       . ◁─────.─────────────────▷     | │
└─────────────────────.───────.───────.───────.───────────────────────┘ │
                      .       .    │  .       .                         │
                      .       .    ▼  .       .                         │
                    .─────────────────────────────.                     │
                   (   Speech activity detection   )                    │
                    `─────────────────────────────'                     │
                      .       .    │  .       .                         │
                      .       .    │  .       .                         │
                      .       .    ▼  .       .                         │
┌─────────────────────.───────.───────.───────.───────────────────────┐ │
│                     .    Silence Supervision.                       │ │
│                     ◁───────▷       ◁───────▷                       │ │
└─────────────────────.───────.───────.───────.───────────────────────┘ │
                      .       .    │  .  ┌──────────────────────────────┘
                      .       .    ▼  .  ▼    .
                      .──────────────────────────.
                     (   Some kind of procedure   )
                      `──────────────────────────'
                      .       .    │  .       .
                      .       .    ▼  .       .
┌─────────────────────.───────.───────.───────.───────────────────────┐
│                     .  Combined speech cuts .                       │
│     ◁───────────────/////////▷      /////////                       │
|                     /////////◁──────/////////─────▷                 │
│                     /////////       /////////◁────────────────▷     |
└─────────────────────────────────────────────────────────────────────┘

What do the //// represent? Does this mean you are effectively removing the time segments corresponding to "silence" from your original supervision segments? If so, perhaps this can be achieved by having interval tree operations for the SupervisionSet class, as Piotr suggested. Once you have some defined segments, you can use cut.trim_to_supervision_groups() instead of cut.trim_to_supervisions() if you believe that there may be overlapping segments and you want to keep them together.

@rilshok
Copy link
Contributor Author

rilshok commented Oct 5, 2023

Yes //// means that we trimmed the silence and refined the supervisory intervals. In this PR, I implemented the required operations using IntervalTree to achieve the desired result. Since functionality like refine_supervision_times proposed by Peter is not yet part of the basic Cut methods, I may suggest modifying my proposed trim_inactivity workflow in the future once the corresponding functionality is implemented.

@desh2608
Copy link
Collaborator

desh2608 commented Oct 5, 2023

Since you already have the algorithm implemented, would you be willing to put this functionality as a SupervisionSet method, and then this workflow can simply use it? This way, it would also allow other users to directly use that method.

@rilshok
Copy link
Contributor Author

rilshok commented Oct 5, 2023

Yes, of course, I am ready to implement such functionality in Cut, SupervisionSet, etc. But we need to strictly agree on how to test this functionality, and in which code points we implement it. Personally, I think this functionality is very exotic, and few people really need it directly when working with CutSet. But if you think it should be included in the backbone of the library, let's do it.

@desh2608
Copy link
Collaborator

desh2608 commented Oct 5, 2023

Let's see what @pzelasko has to say about this.

@pzelasko
Copy link
Collaborator

pzelasko commented Oct 7, 2023

I'm still not sure. It looks like your example may be implemented with .truncate()/.split() to remove the detected non-speech segments and .append() to combine whatever cuts remained. The issue that remains is how to interpret an existing supervision segment being "masked out". Once you truncate, it will have to become two sub-segments, but unless you know the alignment, such supervision is not meaningful anymore for tasks such as ASR. Although you may want to replace these sub-segments with a new, merged supervision in the resulting MixedCut. This could probably work and be implemented as a part of the "refine" thingy. What do you think?

@rilshok
Copy link
Contributor Author

rilshok commented Oct 7, 2023

I think that the main purpose of the silence detector is to remove silence from the supervised segment of audio. All of the proposed alternative approaches to full track resaving and supervising require splitting the supervised segment into parts. I believe that the operation of duplicating a supervisory segment is disruptive in any task. Supervision cannot be divisible at all if it is represented by offset and duration. I think the best way to natively implement the required functionality in lhotse is to implement the AudioSource masking mechanism. The mask could be described similarly to supervision or aligment using intervals and be a serializable part of the Recording object.

But I would go further in this idea and say that Recording could be described by a sequence of audio segments described by offset and duration. Such that when audio is loaded using load_audio from AudioSource the audio track segments are sequentially loaded and concatenated with each other. Such description will allow not only to cut segments from audio, but also to make repeated, thinned and truncated Recording. In general, this mechanism is partially already implemented in Recording, but in fact there is only one such segment.

@pzelasko
Copy link
Collaborator

pzelasko commented Oct 8, 2023

I appreciate the discussion but the design you're suggesting is too complex and not necessary. You can already achieve sequential loading of various audio chunks using cuts. If you need to mask out some portions of the audio, you can do it post-hoc by keeping the mask interval information either as overlapping supervisions (somehow marked as special: ids, custom fields) or in the cut custom fields. However, I don't really see why you would want to mask out silence. If you want to get rid of these segments of the recording instead, you can follow the procedure I suggested above.

To clarify, here's an example (which should be generalized to arbitrary lists of supervisions if you want to go this way):

r = Recording(...)
sups = [
  SupervisionSegment(..., start=2, duration=5),
]

# Assume:
# silence_segments = [
#   SupervisionSegment(..., start=3, duration=2)
# ]
silence_segments = run_vad(r)

# Note: if we used silence segments to cut supervisions, the original supervision would have been split into 
#           two sub-segments of: start=2, duration=1 and start=5, duration=2

# Instead of splitting, we create a cut that skips the silent segment in the recording and has a new supervision 
# that omits the silence:
c = r.to_cut()
new = (
  c
  .truncate(start=2, duration=1)
  .append(
    c.truncate(start=5, duration=2)
  )
)

# We will now add the updated supervision information. Note:
# - we update start=0 because we removed initial silence
# - we update duration=3 because we removed the internal 2s of audio silence that the original supervision over-spanned
new.supervisions = [fastcopy(sups[0], start=0, duration=3)]

@rilshok
Copy link
Contributor Author

rilshok commented Oct 9, 2023

@pzelasko and @desh2608, thanks for the suggested solution, I appreciate your contribution on this discussion. I really didn't realize there was an option to glue the Recording together piece by piece. The pseudo code you suggested does look like an applicable and lazy approach that doesn't require interaction with the audio file itself to generate the result. For my part, I am ready to make the required changes in function trim_inactivity to use this mechanism of producing final Cuts. Do you think that the workflow of removing inactivity can be demanded by lhotse users? Does it make sense to spend the effort to transition to your proposed approach in this PR?

@pzelasko
Copy link
Collaborator

I think that’d work. May I ask you what are you using it for? It seems like a pretty drastic modification, do you find it significantly helps with some task?

@rilshok
Copy link
Contributor Author

rilshok commented Oct 10, 2023

That is, would switching to your suggested approach using truncate and append help solve any problem in this PR? It's not a significant change overall, but it will speed up the generation of the resulting CutSets in the trim_inactivity removal pipline when used interactively, such as in jupyter. Since you won't have to physically rearrange the audio when processing without saving to disc. If I misunderstood your question, let me know.

@pzelasko
Copy link
Collaborator

What you’re saying is clear. I meant assuming this PR is finished and merged, how do you expect people to use it; in what situations; which tasks; and what kind of result improvement would you expect? Im asking because I’ve never encountered this technique used for any task (maybe except speaker ID recipes).

@rilshok
Copy link
Contributor Author

rilshok commented Oct 11, 2023

I think it can be useful to other community members in the task of preparing datasets. At least my colleagues say that they would like this workflow based on Silero VAD to appear in lhotse to solve this task. The task is formulated as follows: to take any arbitrary dataset and resave it by deleting all silence sections in audio files + to add the possibility of selecting one specific channel or converting to mono.

@pzelasko
Copy link
Collaborator

OK cool. If you’re OK with that, let’s move forward with the changes described above. I also suggest to name this workflow “remove_nonspeech” because “trim” implies only suffix and prefix modification.

@rilshok
Copy link
Contributor Author

rilshok commented Oct 11, 2023

How about the name - remove_inactivity? Since the workflow involves the possibility of selecting an activity detector, i.e. it does not have to be a VAD.

@rilshok
Copy link
Contributor Author

rilshok commented Oct 13, 2023

As a result of merging using the append method, I encountered an issue with MixedCut, where it doesn't allow for the override of supervisions. MixedCut assumes that all supervision is stored within its child segments. However, in our task, the main situation is when an audio segment annotated by supervision is divided into several smaller parts. Consequently, our initial supervision should cover several cuts rather than being confined within them. I suggest reconsidering the approach to handling the supervision property in the MixedCut class, as it currently prevents users from adding any supervision to it. It seems that this behavior contradicts the purpose of the supervision field.

@desh2608
Copy link
Collaborator

As a result of merging using the append method, I encountered an issue with MixedCut, where it doesn't allow for the override of supervisions. MixedCut assumes that all supervision is stored within its child segments. However, in our task, the main situation is when an audio segment annotated by supervision is divided into several smaller parts. Consequently, our initial supervision should cover several cuts rather than being confined within them. I suggest reconsidering the approach to handling the supervision property in the MixedCut class, as it currently prevents users from adding any supervision to it. It seems that this behavior contradicts the purpose of the supervision field.

You can add a new track to the MixedCut which covers the full duration, if it is needed. I don't think we should change the whole implementation to benefit one minor use case.

@rilshok
Copy link
Contributor Author

rilshok commented Oct 13, 2023

As mentioned in my previous message, the current implementation of the supervision property in the MixedCut class violates the Liskov Substitution Principle. I suggest reconsidering this implementation to adhere to the principle and enhance the functionality of the class.

@desh2608
Copy link
Collaborator

desh2608 commented Oct 13, 2023

I get your point. This would require an extensive re-write of several modules. I don't have time for this at the moment, though.

Or perhaps we can get away with an easier change by allowing MixedCut to have "global" supervisions. However, by definition, a mixed cut is just a collection of cuts (as tracks), so I don't understand what it would mean for it to have such a "global" supervision.

@pzelasko
Copy link
Collaborator

The easiest way to handle this might be to iterate over tracks, remove all supervisions, and attach the new supervision to the first cut in tracks list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants