Audio Classification datasets that are useful for practical tasks that can be perform on microcontrollers and small embedded systems.
Relevant tasks:
- Wakeword detection
- Keyword spotting
- Speech Command Recognition
- Noise source identification
- Smart home event detection. Firealarm,babycry etc
Not so relevant:
- (general) Automatic Speech Recognition
- Speaker recognition/identification
- DCASE 2016.
- NOIZEUS: A noisy speech corpus for evaluation of speech enhancement algorithms 30 sentences corrupted by 8 real-world noises.
- VoxCeleb, 100k utterances for 1251 celebrities. Task: Speaker Reconition.
- Speakers in the Wild Task: Speaker Reconition.
- Google AudioSet. 2,084,320 human-labeled 10-second sounds, 632 audio event classes. Based on YouTube videos.
- Whale Detection Challenge. https://www.kaggle.com/c/whale-detection-challenge
- Mozilla Common Voice, crowd sourcing. Compiled dataset on Kaggle, 500 hours of transcribed sentences. Has speaker demographics. Task: Automatic Speech Recognition. Not something to do on microcontroller. Could maybe be used for Transfer Learning for more relevant speech tasks.
- DCASE2018 Task 5. Domestic activities. 10 second segments. 9 classes. From 4 separate microphone arrays (in single room). Each array has 4 microphones
- Hey Snips. https://github.com/snipsco/keyword-spotting-research-datasets Task: Wakeword detetion, Vocal Activity Detection. Restricted licencese terms. Academic/research use only. Must contact via email for download. By Snips, developing Private-by-Design, decentralized, open source voice assistants.
Used for DCASE2017 Task 2. Baby crying, Glass Breaking, Gunshot. 3 classes, but separate binary classifiers encouraged. Part of TUT Acoustic Scenes 2016. Train 100 hours. Approx 100 sound examples per class isolated, 500 mixtures (weakly labeled). Event detection. Event-based error rate. Onset only. 500 ms collar. Also F1 score. Baseline system available. FC DNN. 40 bands melspec, 5 frames. F1 0.72 Around 11 other systems submitted. Ranging 0.65-0.93 F1 score. http://www.cs.tut.fi/sgn/arg/dcase2017/challenge/task-rare-sound-event-detection
Relevant as examples of single-function systems, security
Dataset from York with "traffic" and "rail" classes. Same structure as Urbansoun8k dataset. 1527 samples a 4 seconds. Split into 10 folds. https://github.com/fadymedhat/YorNoise
http://dcase.community/challenge2018/task-general-purpose-audio-tagging
Task: Acoustic event tagging. Based on FreeSound data. 41 classes. Using AudioNet ontology 9.5k samples train, ~3.7k manually-verified annotations and ~5.8k non-verified annotations. Test ~1.6k manually-verified annotations. Systems reach 0.90-0.95 mAP@3. Baseline CNN on log melspec. 0.70 mAP@3
Relevant for: context-aware-computing, smarthome?
Binary classification.
Relevant for on-edge pre-processing / efficient data collection.
Event Detection with precise time information. Events from domestic tasks. 10 classes. Subset of Audioset.
Relevant for: smarthome and context-aware-computing
Used in DCASE2018 Task 1.
Task: Acoustic Scene Classification. 10 classes. airport,shopping_mall,metro_station About 30GB of data, around 24 hours training. One variant dataset has parallel recording with multiple devices, for testing mismatched case.
Relevant for: context-aware-computing?
Scenes from urban environments. 15 classes. 10 second segments. Baseline system available. 18% F1 Relatively hard, systems achieved 35-55% F1.
Relevant for context-aware-computing?
http://www.cs.tut.fi/sgn/arg/dcase2017/challenge/task-large-scale-sound-event-detection 17 classes from 2 categories, Warning sounds and Vehicle sounds.
Relevant for autonomous vehicles?
Used for DCASE2017 Task 3, Sound event detection in real life audio
Events related to car/driving. 6 classes. Multiple overlapping events present. Both in training and testing. Hard, systems only achieved 40%-45% F1. Quite small. 2 GB dataset total. Relevant for autonomous vehicles?
MFCC Feature extration
- KWS Runs on ARM Cortex M(4F). Uses CMSIS for FFT. Clear code struture. Some things, like filterbank, can be precomputed in Python? Apache 2.0
- libmfcc. Takes FFT spectrum as input. MIT.
- Fixed-point is challenging. A naive approach to fixed-point FFT causes noise to go up a lot, and classification ability is drastically reduced. Optimized implementation proposed in Accuracy of MFCC-Based Speaker Recognition in Series 60 Device
FFT on microcontroller
- STM32F103 (Cortex M3 at 72MHz) can do 1024 point FFT in 3ms using CMSIS, Q15/Q31 fixed point. radix-4 FFT. STM32F091 (Cortex M0 at 48Mhz) takes 20 ms. STM32 DSP. Using software-emulated floating point for FFT on Cortex M4 is 10x slower than the FPU unit. M4F is 3-4 times as energy efficient as the M3 (when using floats?).
- EMF32 DSP. CMSIS FFT is about 3-4x faster than a generic KissFFT-based version.
- Teensy 3.2 was able to do approx 400 ops/sec (3ms) on 512 point FFT with generic version, using int32.2 OpenAudio Benchmarking FFT.
- FFT on ARM-Based Low-Power Microcontrollers found that CMSIS FFT with Q31 had slightly less error than with F32.
- esp32-fft. 1024 lenght float32 FFT in 1ms on ESP32.
Goertzel filter
- embedded.com The Goertzel Algorithm, example code in C++.
- embedded.com Single tone detection with Goertzel. Example code in C++
- Efficiently detecting a frequency using a Goertzel filter, several implementation variants in C.
- Matched Filter Design
The Goertzel algorithm is advantageous compared to the FFT when
M < 5/6 log_2(N)
, with DFT length N and number of desired pins M. N=1024, M=8. - Overlap Add STFT implementation of linear filters Faster than convolution in time domain for FIR filters with n>64 taps, which can happen in audio without noticable delay
- https://stackoverflow.com/questions/11579367/implementation-of-goertzel-algorithm-in-c