Datasets

Audio Classification datasets that are useful for practical tasks that can be perform on microcontrollers and small embedded systems.

Relevant tasks:

Wakeword detection
Keyword spotting
Speech Command Recognition
Noise source identification
Smart home event detection. Firealarm,babycry etc

Not so relevant:

(general) Automatic Speech Recognition
Speaker recognition/identification

To be reviewed.

DCASE 2016.

Not relevant

NOIZEUS: A noisy speech corpus for evaluation of speech enhancement algorithms 30 sentences corrupted by 8 real-world noises.
VoxCeleb, 100k utterances for 1251 celebrities. Task: Speaker Reconition.
Speakers in the Wild Task: Speaker Reconition.
Google AudioSet. 2,084,320 human-labeled 10-second sounds, 632 audio event classes. Based on YouTube videos.
Whale Detection Challenge. https://www.kaggle.com/c/whale-detection-challenge
Mozilla Common Voice, crowd sourcing. Compiled dataset on Kaggle, 500 hours of transcribed sentences. Has speaker demographics. Task: Automatic Speech Recognition. Not something to do on microcontroller. Could maybe be used for Transfer Learning for more relevant speech tasks.
DCASE2018 Task 5. Domestic activities. 10 second segments. 9 classes. From 4 separate microphone arrays (in single room). Each array has 4 microphones

Relevant but lacking

Hey Snips. https://github.com/snipsco/keyword-spotting-research-datasets Task: Wakeword detetion, Vocal Activity Detection. Restricted licencese terms. Academic/research use only. Must contact via email for download. By Snips, developing Private-by-Design, decentralized, open source voice assistants.

Relevant

TUT Rare Sound Events 2017

Used for DCASE2017 Task 2. Baby crying, Glass Breaking, Gunshot. 3 classes, but separate binary classifiers encouraged. Part of TUT Acoustic Scenes 2016. Train 100 hours. Approx 100 sound examples per class isolated, 500 mixtures (weakly labeled). Event detection. Event-based error rate. Onset only. 500 ms collar. Also F1 score. Baseline system available. FC DNN. 40 bands melspec, 5 frames. F1 0.72 Around 11 other systems submitted. Ranging 0.65-0.93 F1 score. http://www.cs.tut.fi/sgn/arg/dcase2017/challenge/task-rare-sound-event-detection

Relevant as examples of single-function systems, security

YorNoise

Dataset from York with "traffic" and "rail" classes. Same structure as Urbansoun8k dataset. 1527 samples a 4 seconds. Split into 10 folds. https://github.com/fadymedhat/YorNoise

DCASE2018 Task 2, General Purpose Audio Tagging

http://dcase.community/challenge2018/task-general-purpose-audio-tagging

Task: Acoustic event tagging. Based on FreeSound data. 41 classes. Using AudioNet ontology 9.5k samples train, ~3.7k manually-verified annotations and ~5.8k non-verified annotations. Test ~1.6k manually-verified annotations. Systems reach 0.90-0.95 mAP@3. Baseline CNN on log melspec. 0.70 mAP@3

Relevant for: context-aware-computing, smarthome?

DCASE2018 Task3, Bird Audio Detection.

Binary classification.

Relevant for on-edge pre-processing / efficient data collection.

DCASE2018 Task4

Event Detection with precise time information. Events from domestic tasks. 10 classes. Subset of Audioset.

Relevant for: smarthome and context-aware-computing

TUT Urban Acoustic Scenes 2018

Used in DCASE2018 Task 1.

Task: Acoustic Scene Classification. 10 classes. airport,shopping_mall,metro_station About 30GB of data, around 24 hours training. One variant dataset has parallel recording with multiple devices, for testing mismatched case.

Relevant for: context-aware-computing?

TUT Acoustic Scenes 2017. Used for DCASE2017 Task 1.

Scenes from urban environments. 15 classes. 10 second segments. Baseline system available. 18% F1 Relatively hard, systems achieved 35-55% F1.

Relevant for context-aware-computing?

DCASE2017 Task 4, Large Scale Sound Event detection

http://www.cs.tut.fi/sgn/arg/dcase2017/challenge/task-large-scale-sound-event-detection 17 classes from 2 categories, Warning sounds and Vehicle sounds.

Relevant for autonomous vehicles?

TUT Sound Events 2017

Used for DCASE2017 Task 3, Sound event detection in real life audio

Events related to car/driving. 6 classes. Multiple overlapping events present. Both in training and testing. Hard, systems only achieved 40%-45% F1. Quite small. 2 GB dataset total. Relevant for autonomous vehicles?

Microcontroller implementation

Audio processing

MFCC Feature extration

KWS Runs on ARM Cortex M(4F). Uses CMSIS for FFT. Clear code struture. Some things, like filterbank, can be precomputed in Python? Apache 2.0
libmfcc. Takes FFT spectrum as input. MIT.
Fixed-point is challenging. A naive approach to fixed-point FFT causes noise to go up a lot, and classification ability is drastically reduced. Optimized implementation proposed in Accuracy of MFCC-Based Speaker Recognition in Series 60 Device

FFT on microcontroller

STM32F103 (Cortex M3 at 72MHz) can do 1024 point FFT in 3ms using CMSIS, Q15/Q31 fixed point. radix-4 FFT. STM32F091 (Cortex M0 at 48Mhz) takes 20 ms. STM32 DSP. Using software-emulated floating point for FFT on Cortex M4 is 10x slower than the FPU unit. M4F is 3-4 times as energy efficient as the M3 (when using floats?).
EMF32 DSP. CMSIS FFT is about 3-4x faster than a generic KissFFT-based version.
Teensy 3.2 was able to do approx 400 ops/sec (3ms) on 512 point FFT with generic version, using int32.2 OpenAudio Benchmarking FFT.
FFT on ARM-Based Low-Power Microcontrollers found that CMSIS FFT with Q31 had slightly less error than with F32.
esp32-fft. 1024 lenght float32 FFT in 1ms on ESP32.

Goertzel filter

embedded.com The Goertzel Algorithm, example code in C++.
embedded.com Single tone detection with Goertzel. Example code in C++
Efficiently detecting a frequency using a Goertzel filter, several implementation variants in C.
Matched Filter Design The Goertzel algorithm is advantageous compared to the FFT when M < 5/6 log_2(N), with DFT length N and number of desired pins M. N=1024, M=8.
Overlap Add STFT implementation of linear filters Faster than convolution in time domain for FIR filters with n>64 taps, which can happen in audio without noticable delay
https://stackoverflow.com/questions/11579367/implementation-of-goertzel-algorithm-in-c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

audio.md

audio.md

Datasets

To be reviewed.

Not relevant

Relevant but lacking

Relevant

TUT Rare Sound Events 2017

YorNoise

DCASE2018 Task 2, General Purpose Audio Tagging

DCASE2018 Task3, Bird Audio Detection.

DCASE2018 Task4

TUT Urban Acoustic Scenes 2018

TUT Acoustic Scenes 2017. Used for DCASE2017 Task 1.

DCASE2017 Task 4, Large Scale Sound Event detection

TUT Sound Events 2017

Microcontroller implementation

Audio processing

Files

audio.md

Latest commit

History

audio.md

File metadata and controls

Datasets

To be reviewed.

Not relevant

Relevant but lacking

Relevant

TUT Rare Sound Events 2017

YorNoise

DCASE2018 Task 2, General Purpose Audio Tagging

DCASE2018 Task3, Bird Audio Detection.

DCASE2018 Task4

TUT Urban Acoustic Scenes 2018

TUT Acoustic Scenes 2017. Used for DCASE2017 Task 1.

DCASE2017 Task 4, Large Scale Sound Event detection

TUT Sound Events 2017

Microcontroller implementation

Audio processing