This repository contains the code for the paper "voc2vec: A Foundation Model for Non-Verbal Vocalization", accepted at ICASSP 2025.
We propose a novel foundation model, voc2vec, specifically designed for non-verbal human data leveraging exclusively open-source non-verbal audio datasets. We employ a collection of 10 datasets covering around 125 hours of non-verbal audio.
Experimental results prove that voc2vec is effective in non-verbal vocalization classification, and it outperforms conventional speech and audio foundation models. Moreover, voc2vec consistently outperforms strong baselines, OpenSmile, and emotion2vec, on six different benchmark datasets.
voc2vec is the first universal representation model for vocalization tasks.
The core contribution of voc2vec lies in the careful selection of diverse, open-source datasets for pre-training, specifically chosen to focus on non-verbal vocalizations. These datasets collectively cover around 125 hours of audio, ensuring that the model is exposed to a wide variety of human vocalizations, typically underrepresented in speech datasets. Each dataset is chosen to represent different forms of non-verbal communication, such as emotional bursts, human reactions, and environmental sounds that involve vocal interaction. The datasets used for pre-training are summarized in the table below.
Dataset | Dur. (h) | # Samples | Avg Dur. (s) |
---|---|---|---|
AudioSet (vocalization) | 36.94 | 13439 | 9.90 |
FreeSound (babies) | 23.42 | 1450 | 58.15 |
HumanVoiceDataset | 0.06 | 179 | 1.21 |
NNIME | 3.55 | 5596 | 2.28 |
NonSpeech7K | 6.72 | 6983 | 3.46 |
ReCANVo | 2.46 | 7077 | 1.25 |
SingingDatabase | 3.97 | 113 | 126.48 |
TUT (babies) | 13.17 | 1540 | 30.79 |
VocalSketch | 10.53 | 10705 | 3.54 |
VocalSound | 24.37 | 20985 | 4.18 |
Voc125 (Total) | 125.19 | 68067 | 6.67 |
We evaluate voc2vec on six classification tasks using diverse datasets, each covering different types of non-verbal vocalizations. The datasets and their characteristics are summarized in the table below.
Dataset | # Classes | Dur. (h) | # Samples | # Avg Dur. (s) |
---|---|---|---|---|
ASVP-ESD | 13 | 15.07 | 12625 | 4.30 |
ASVP-ESD (babies) | 7 | 2.91 | 1339 | 8.22 |
CNVVE | 6 | 0.2 | 921 | 0.78 |
Donate A Cry | 5 | 0.88 | 457 | 6.93 |
NonVerbal Vocalization | 16 | 0.6 | 800 | 3.10 |
VIVAE | 6 | 0.27 | 1085 | 0.90 |
The model can be loaded using the transformers
library. You need to install the following dependencies:
pip install transformers
pip install librosa
Then, you can load and use the model as follows:
import torch
import librosa
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
## Load an audio file
audio_array, sr = librosa.load("path_to_audio.wav", sr=16000)
## Load model and feature extractor
model = AutoModelForAudioClassification.from_pretrained("alkiskoudounas/voc2vec")
feature_extractor = AutoFeatureExtractor.from_pretrained("alkiskoudounas/voc2vec")
## Extract features
inputs = feature_extractor(audio_array.squeeze(), sampling_rate=feature_extractor.sampling_rate, padding=True, return_tensors="pt")
## Compute logits
logits = model(**inputs).logits
We open-source three models:
Model | Description | Link |
---|---|---|
voc2vec | Pre-trained model on 125 hours of non-verbal audio. | 🔗 Model |
voc2vec-as-pt | Continues pre-training from a model that was initially trained on the AudioSet dataset. | 🔗 Model |
voc2vec-ls-pt | Continues pre-training from a model that was initially trained on the LibriSpeech dataset. | 🔗 Model |
For more information about the model, please refer to the paper.
Important
If you use this model in your research, please cite the following paper:
@INPROCEEDINGS{koudounas25_icassp,
author={Koudounas, Alkis and La Quatra, Moreno and Siniscalchi, Sabato Marco and Baralis, Elena},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={voc2vec: A Foundation Model for Non-Verbal Vocalization},
year={2025},
volume={},
number={},
pages={},
keywords={},
doi={}
}
This code and the models are released under the Apache 2.0 license. See the LICENSE file for more details.