-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Silero VAD for cleaning the dataset from silence #1166
base: master
Are you sure you want to change the base?
Conversation
An example use case could be the need to refine a set of audio recordings where medical professionals discuss the results of patient medical examinations. These recordings contain medical text reflected in noisy supervision data, which was collected using a semi-automatic method. The peculiarity of this supervision data is that the annotation of one piece of text overlaps with another, and it requires refinement using automatic audio transcription algorithms. Prior to this refinement stage, it is necessary to prepare the audio recordings by removing all background noises and periods of silence, without losing context. The segments that remain after the removal of inactivity should include overlaps in the supervision data for subsequent refinement. |
We need to be able to re-save the dataset by cutting out the silence sections, so that when working with such datasets in the future we can be sure that it is clean enough and does not contain background noise. The presence of background noise in the dataset slows down experimentation, inefficiently consumes space on discs, and introduces bias in hypothesis testing. |
I think I'm starting to understand what you are trying to achieve. Can you confirm the problem boils down to the following description: If the above statement is true, can this problem be solved using the following actions:
If the above interpretation is correct, the only thing we're missing in Lhotse is the implementation of the intersection of two supervision sets. This could be added as a new method on Cut/CutSet, e.g. |
No, unfortunately the problem does not boil down to the description you suggested. Because it does not take into account the need to refine the silence intervals inside the supervisions. The main task is to clean up the audio recording. We want to get a new I think that in addition to the intersection procedure you suggest, you can also use |
Again, why can this not be done by appending the cuts corresponding to the supervisions? Why does the "filtered" recording need to be saved beforehand, except perhaps for loading efficiency. |
Could you give a concrete example of how exactly we can override one single source SupervisionSegment, given the silence intervals? Without splitting it into duplicates with different offsets and durations.
To simplify further work with a cleaner dataset, and save disk space. |
|
|
How can we describe this with the procedure you suggest? |
How can such a resulting Cut be described? Is there any way to guarantee that when loading an audio track with load_audio, the numpy array will be shorter than the original and will not contain silence segments, and only three SupervisionSegments will remain in the cut.supervisions list? |
What do the |
Yes //// means that we trimmed the silence and refined the supervisory intervals. In this PR, I implemented the required operations using IntervalTree to achieve the desired result. Since functionality like refine_supervision_times proposed by Peter is not yet part of the basic Cut methods, I may suggest modifying my proposed trim_inactivity workflow in the future once the corresponding functionality is implemented. |
Since you already have the algorithm implemented, would you be willing to put this functionality as a |
Yes, of course, I am ready to implement such functionality in Cut, SupervisionSet, etc. But we need to strictly agree on how to test this functionality, and in which code points we implement it. Personally, I think this functionality is very exotic, and few people really need it directly when working with CutSet. But if you think it should be included in the backbone of the library, let's do it. |
Let's see what @pzelasko has to say about this. |
I'm still not sure. It looks like your example may be implemented with |
I think that the main purpose of the silence detector is to remove silence from the supervised segment of audio. All of the proposed alternative approaches to full track resaving and supervising require splitting the supervised segment into parts. I believe that the operation of duplicating a supervisory segment is disruptive in any task. Supervision cannot be divisible at all if it is represented by offset and duration. I think the best way to natively implement the required functionality in lhotse is to implement the AudioSource masking mechanism. The mask could be described similarly to supervision or aligment using intervals and be a serializable part of the Recording object. But I would go further in this idea and say that Recording could be described by a sequence of audio segments described by offset and duration. Such that when audio is loaded using load_audio from AudioSource the audio track segments are sequentially loaded and concatenated with each other. Such description will allow not only to cut segments from audio, but also to make repeated, thinned and truncated Recording. In general, this mechanism is partially already implemented in Recording, but in fact there is only one such segment. |
I appreciate the discussion but the design you're suggesting is too complex and not necessary. You can already achieve sequential loading of various audio chunks using cuts. If you need to mask out some portions of the audio, you can do it post-hoc by keeping the mask interval information either as overlapping supervisions (somehow marked as special: ids, custom fields) or in the cut custom fields. However, I don't really see why you would want to mask out silence. If you want to get rid of these segments of the recording instead, you can follow the procedure I suggested above. To clarify, here's an example (which should be generalized to arbitrary lists of supervisions if you want to go this way): r = Recording(...)
sups = [
SupervisionSegment(..., start=2, duration=5),
]
# Assume:
# silence_segments = [
# SupervisionSegment(..., start=3, duration=2)
# ]
silence_segments = run_vad(r)
# Note: if we used silence segments to cut supervisions, the original supervision would have been split into
# two sub-segments of: start=2, duration=1 and start=5, duration=2
# Instead of splitting, we create a cut that skips the silent segment in the recording and has a new supervision
# that omits the silence:
c = r.to_cut()
new = (
c
.truncate(start=2, duration=1)
.append(
c.truncate(start=5, duration=2)
)
)
# We will now add the updated supervision information. Note:
# - we update start=0 because we removed initial silence
# - we update duration=3 because we removed the internal 2s of audio silence that the original supervision over-spanned
new.supervisions = [fastcopy(sups[0], start=0, duration=3)] |
@pzelasko and @desh2608, thanks for the suggested solution, I appreciate your contribution on this discussion. I really didn't realize there was an option to glue the Recording together piece by piece. The pseudo code you suggested does look like an applicable and lazy approach that doesn't require interaction with the audio file itself to generate the result. For my part, I am ready to make the required changes in function |
I think that’d work. May I ask you what are you using it for? It seems like a pretty drastic modification, do you find it significantly helps with some task? |
That is, would switching to your suggested approach using |
What you’re saying is clear. I meant assuming this PR is finished and merged, how do you expect people to use it; in what situations; which tasks; and what kind of result improvement would you expect? Im asking because I’ve never encountered this technique used for any task (maybe except speaker ID recipes). |
I think it can be useful to other community members in the task of preparing datasets. At least my colleagues say that they would like this workflow based on Silero VAD to appear in lhotse to solve this task. The task is formulated as follows: to take any arbitrary dataset and resave it by deleting all silence sections in audio files + to add the possibility of selecting one specific channel or converting to mono. |
OK cool. If you’re OK with that, let’s move forward with the changes described above. I also suggest to name this workflow “remove_nonspeech” because “trim” implies only suffix and prefix modification. |
How about the name - |
As a result of merging using the |
You can add a new track to the MixedCut which covers the full duration, if it is needed. I don't think we should change the whole implementation to benefit one minor use case. |
As mentioned in my previous message, the current implementation of the |
I get your point. This would require an extensive re-write of several modules. I don't have time for this at the moment, though. Or perhaps we can get away with an easier change by allowing |
The easiest way to handle this might be to iterate over tracks, remove all supervisions, and attach the new supervision to the first cut in tracks list. |
I intend to add a new workflow to Lhotse for processing arbitrary audio datasets by removing silence and preserving only speech using the Silero VAD, which can accurately detect speech in an audio stream. The workflow I'm adding should help users quickly and efficiently convert arbitrary datasets by cutting out silence and retaining only speech. An important aspect of such a process is the ability to preserve all supervision for each segment while considering changes made to the audio file. Before accepting this PR, I invite you to review my code. Currently, the code handles the task in trivial conditions, only processing
MonoCut
objects and not supporting otherCut
types. I want to add support for otherCut
types, but I'm not sure about the best approach at the moment. I would appreciate your comments and suggestions for improving the code. I would also be glad if you could try running the code and share your impressions. I'm confident that your feedback and suggestions will help make it even better.Key Changes
Added the
speach_only
function, which allows processing audio files by removing silence and preserving speech only.Added the
speech_only
workflow, which enables processing datasets from the CLI.The code is written with the intention of being usable in various scenarios similar in concept to the addressed task.
Issues Requiring Discussion
There are several places in the code where I'm uncertain about the choice of implementation. In these places, I use
NotImplementedError
to indicate that I need assistance in selecting the best implementation approach. This is mainly related to handling subclasses of theCut
class other thanMonoCut
. I'm not sure about the best way to handle these cases.Additionally, I have the
_to_mono
function, which should convert Recording records to mono format for speech analysis using Silero VAD. I'm confident that there's an elegant way to do this, so please provide some guidance.I would like to receive feedback on function naming, variable naming, and code architecture. If you have specific suggestions for improvement, I would be glad to hear them.