A powerful two-stage audio processing tool that combines Voice Activity Detection (VAD) and Speech Enhancement to clean and denoise audio files.
-
Two-Stage Processing Pipeline:
- Stage 1: Uses Silero VAD to detect and extract speech segments
- Stage 2: Applies MP-SENet deep learning model to remove noise
-
Memory-Efficient Processing:
- Processes audio in chunks to prevent memory issues
- Automatically converts audio to the required format (16kHz mono WAV)
-
User-Friendly Interface:
- Beautiful Gradio web interface
- Real-time progress reporting
- Compare original, VAD-processed, and denoised versions
-
Create a new conda environment:
conda create -n speech_enhance_new python=3.9 conda activate speech_enhance_new
-
Install dependencies:
conda install numpy=1.22.4 scipy=1.7.3 librosa=0.9.2 pip install torch torchaudio gradio pydub rich
-
Download the MP-SENet model:
- Place the model file in
MP-SENet/best_ckpt/g_best_dns
- Place the config file in
MP-SENet/best_ckpt/config.json
- Place the model file in
-
Run the app:
python run.py
-
Open your web browser and navigate to the provided URL
-
Upload an audio file and adjust the parameters:
- VAD Threshold: Controls voice detection sensitivity (0.1-0.9)
- Max Silence Gap: Controls merging of close speech segments (1-10s)
-
Compare the results:
- Original Audio
- VAD Processed (Speech Only)
- Final Denoised
-
VAD Threshold (0.1-0.9):
- Higher values = stricter voice detection
- Lower values = more lenient detection
- Default: 0.5
-
Max Silence Gap (1-10s):
- Maximum silence duration to consider segments as continuous
- Higher values = fewer segments but may include more silence
- Default: 4.0s
This project combines two powerful models:
- Silero VAD for Voice Activity Detection
- MP-SENet for Speech Enhancement
This project is licensed under the terms specified in the MP-SENet repository.