This is an unofficial implementation of the "Personal VAD" speaker-conditioned VAD method introduced in the PersonalVAD: Speaker-Conditioned Voice Activity Detection paper referenced below.
This project was done as a part of my bachelor's thesis project: SEDLÁČEK, Šimon. Personal Voice Activity Detection. Brno, 2021. Bachelor’s thesis. Brno University of Technology, Faculty of Information Technology. Supervisor Ing. Ján Švec
@inproceedings{personalVAD,
author={Shaojin Ding and Quan Wang and Shuo-Yiin Chang and Li Wan and Ignacio {Lopez Moreno}},
title={Personal VAD: Speaker-Conditioned Voice Activity Detection},
year=2020,
booktitle={Proc. Odyssey 2020 The Speaker and Language Recognition Workshop},
pages={433--439},
doi={10.21437/Odyssey.2020-62},
url={https://arxiv.org/abs/1908.04284}}
Ding, S.,Wang, Q.,Chang, S.-Y.,Wan, L. andLopez Moreno, I. PersonalVAD: Speaker-Conditioned Voice Activity Detection. In:Proc. Odyssey 2020 TheSpeaker and Language Recognition Workshop. 2020, p. 433–439. DOI:10.21437/Odyssey.2020-62. Available at: https://arxiv.org/abs/1908.04284.
src/
-- contains all the personal VAD model definitions, training scripts, utterance concatenation, and and feature extraction scripts.src/kaldi/
-- contains the Kaldi-specific project files used for data augmentation. These will be copied tokaldi/egs/pvad/
once Kaldi is set up.
doc/
-- contains the personal VAD thumbnail image and text files with the evaluation results for all trained models both for clean and augmented speechdata/
-- directory containing the resources for training and evalutaion. Here, the LibriSpeech dataset sample is situated along with its alignments. This will also be the directory, where the extracted features will be situated once the feature extraction process is done.data/embeddings/
-- this directory contains pre-computed d-vector speaker embeddings for all speakers in the LibriSpeech corpus, used by the feature extraction scripts.data/eval_dir/
-- this directory contains the necessary resources used by the model evaluation scriptsdata/eval_dir/models/
-- contains all pre-trained personal VAD models.data/eval_dir/embeddings*/
-- contains pre-computed LibriSpeech speaker embeddings (d-vectors, x-vectors, i-vectors) for evaluation purposes.data/eval_dir/data/
-- contains the sample evaluation dataset use by the evaluation scripts.
prepare_dataset_features.sh
-- run data preparation and feature extraction.prepare_kaldi_dir.sh
-- use this script to prepare the Kaldi project folder after Kaldi is set up and compiled.train_pvad.sh
-- use this script to launch demonstration model training.xsedla1h_thesis.pdf/.zip
-- the thesis.pdf
file and the.zip
file with the LaTeX sources.requirements.txt
-- a list of Python dependencies required by the project.
IMPORTANT! Before running any of the data prep or evaluation scripts, please make sure that you have a cuda-compatible GPU available on the current system. Otherwise, both feature extraction and model evaluation will take much longer to execute. All scripts are meant to be run from the repository root folder.
You can download an ~800MB version of this repository, which contains all the trained models and a small demo dastaset from this address: https://nextcloud.fit.vutbr.cz/s/xSRBxZmmAJJMHNe
Before using any of the included software, please make sure to download all the necessary Python dependencies.
To ensure correct dependency installation, one can use conda virtual environments. For this purpose, please make sure you have Anaconda installed. If you need to install conda, visit for example: https://conda.io/projects/conda/en/latest/user-guide/install/index.html.
After conda is installed and configured, we need to create and activate a new virtual environment:
conda create -n personal_vad python=3.7
conda activate personal_vad
From here, we can install the necessary dependencies:
pip install -r requirements.txt
Of course, it is possible install the dependencies without conda. However, using virtual is a neat way to ensure portability and avoid package conflicts.
Requires Python version 3.7 or newer.
This step is purely OPTIONAL. Augmentation will require additional 16GB of disk space, since the MUSAN and rirs_noises corpuses have to be downloaded.
In order to fully utilize the feature extraction pipeline, please install the Kaldi Speech Recognition Toolkit in the root directory of this project. After downloading Kaldi, please compile the toolkit, as the augmentation process will need some of the Kaldi binaries. More information can be found at: https://kaldi-asr.org/
Once Kaldi is compiled, please run the following script:
$ bash prepare_kaldi_dir.sh
This script will prepare the project folder in kaldi/egs/pvad
, copy some required script files
from src/kaldi/egs/pvad/
and create necessary symlinks to Kaldi binaries and utilities.
If you now wish to run data augmentation along with feature extraction, in
prepare_dataset_features.sh
, change the AUGMENT
variable to true
. The generated dataset
will now be augmented before feature extraction. If you wish to alter the augmentation parameters,
go to the kaldi/egs/pvad/
directory and see the reverberate_augment.sh
script.
There is one LibriSpeech subset enclosed in the data/
directory: dev-clean
, respectively
its .tar.gz
archive. Additionally, there is a .zip
file containing LibriSpeech
alignment information.
After extracting the archives, other LibriSpeech subset directories will also be visible, however,
they only contain the alignment files obtained from https://zenodo.org/record/2619474. You need to
download the other LibriSpeech subsets from https://www.openslr.org/12 and extract them
in the data/
directory in case you wish to use them.
Before running the data preparation and feature extraction processes, please take a look
at the prepare_dataset_features.sh
script. In the preamble, this script contains a few
configuration options, for example which LibriSpeech subsets are to be used, whether to
perform data augmentation or how many CPU workers to use for feature extraction multiprocessing.
By default, one can simply run:
$ bash prepare_dataset_features.sh 0
This will run a demonstration version of data preparation and feature extraction without
augmentation. The generated demonstration datset will consist of 500 unique concatenated
utterances from the dev-clean
subset and will be situated in data/features_demo
once the
feature extraction process is done.
To train a personal VAD model using the sample features in data/eval_dir/data/
, please see
the train_pvad.sh
script. This script contains examples of runnning the python model
training scripts and will guide you if you wish to train your own model on your own data.
The script can be run for example as:
$ bash train_pvad.sh et
where et
is the PVAD architecture name specification. Other alternatives are vad
, st
, set
.
There are two tools to evaluate the trained models in this repository.
The first is the src/evaluate_models.py
script, which will take the sample validation data in
the data/eval_dir/data/
directory and will run evaluation on all models present in
the data/eval_dir/models/
directory (the whole process is quite time-consuming and a GPU is
absolutely crucial here). The evaluation results are written to standard output.
It can be run for example as:
$ python src/evaluate_models.py
To evaluate each model individually, the src/personal_vad_evaluate.ipynb
notebook can be used,
providing better control over the evaluation process. The notebook NEEDS access to
the data/eval_dir/
directory to work correctly -- to ensure this, simply run the notebook
inside the src/
directory.