Skip to content

Latest commit

 

History

History
186 lines (139 loc) · 7.55 KB

audio.md

File metadata and controls

186 lines (139 loc) · 7.55 KB

Datasets

Audio Classification datasets that are useful for practical tasks that can be perform on microcontrollers and small embedded systems.

Relevant tasks:

  • Wakeword detection
  • Keyword spotting
  • Speech Command Recognition
  • Noise source identification
  • Smart home event detection. Firealarm,babycry etc

Not so relevant:

  • (general) Automatic Speech Recognition
  • Speaker recognition/identification

To be reviewed.

  • DCASE 2016.

Not relevant

Relevant but lacking

Relevant

TUT Rare Sound Events 2017

Used for DCASE2017 Task 2. Baby crying, Glass Breaking, Gunshot. 3 classes, but separate binary classifiers encouraged. Part of TUT Acoustic Scenes 2016. Train 100 hours. Approx 100 sound examples per class isolated, 500 mixtures (weakly labeled). Event detection. Event-based error rate. Onset only. 500 ms collar. Also F1 score. Baseline system available. FC DNN. 40 bands melspec, 5 frames. F1 0.72 Around 11 other systems submitted. Ranging 0.65-0.93 F1 score. http://www.cs.tut.fi/sgn/arg/dcase2017/challenge/task-rare-sound-event-detection

Relevant as examples of single-function systems, security

YorNoise

Dataset from York with "traffic" and "rail" classes. Same structure as Urbansoun8k dataset. 1527 samples a 4 seconds. Split into 10 folds. https://github.com/fadymedhat/YorNoise

DCASE2018 Task 2, General Purpose Audio Tagging

http://dcase.community/challenge2018/task-general-purpose-audio-tagging

Task: Acoustic event tagging. Based on FreeSound data. 41 classes. Using AudioNet ontology 9.5k samples train, ~3.7k manually-verified annotations and ~5.8k non-verified annotations. Test ~1.6k manually-verified annotations. Systems reach 0.90-0.95 mAP@3. Baseline CNN on log melspec. 0.70 mAP@3

Relevant for: context-aware-computing, smarthome?

DCASE2018 Task3, Bird Audio Detection.

Binary classification.

Relevant for on-edge pre-processing / efficient data collection.

DCASE2018 Task4

Event Detection with precise time information. Events from domestic tasks. 10 classes. Subset of Audioset.

Relevant for: smarthome and context-aware-computing

TUT Urban Acoustic Scenes 2018

Used in DCASE2018 Task 1.

Task: Acoustic Scene Classification. 10 classes. airport,shopping_mall,metro_station About 30GB of data, around 24 hours training. One variant dataset has parallel recording with multiple devices, for testing mismatched case.

Relevant for: context-aware-computing?

TUT Acoustic Scenes 2017. Used for DCASE2017 Task 1.

Scenes from urban environments. 15 classes. 10 second segments. Baseline system available. 18% F1 Relatively hard, systems achieved 35-55% F1.

Relevant for context-aware-computing?

DCASE2017 Task 4, Large Scale Sound Event detection

http://www.cs.tut.fi/sgn/arg/dcase2017/challenge/task-large-scale-sound-event-detection 17 classes from 2 categories, Warning sounds and Vehicle sounds.

Relevant for autonomous vehicles?

TUT Sound Events 2017

Used for DCASE2017 Task 3, Sound event detection in real life audio

Events related to car/driving. 6 classes. Multiple overlapping events present. Both in training and testing. Hard, systems only achieved 40%-45% F1. Quite small. 2 GB dataset total. Relevant for autonomous vehicles?

Microcontroller implementation

Audio processing

MFCC Feature extration

  • KWS Runs on ARM Cortex M(4F). Uses CMSIS for FFT. Clear code struture. Some things, like filterbank, can be precomputed in Python? Apache 2.0
  • libmfcc. Takes FFT spectrum as input. MIT.
  • Fixed-point is challenging. A naive approach to fixed-point FFT causes noise to go up a lot, and classification ability is drastically reduced. Optimized implementation proposed in Accuracy of MFCC-Based Speaker Recognition in Series 60 Device

FFT on microcontroller

  • STM32F103 (Cortex M3 at 72MHz) can do 1024 point FFT in 3ms using CMSIS, Q15/Q31 fixed point. radix-4 FFT. STM32F091 (Cortex M0 at 48Mhz) takes 20 ms. STM32 DSP. Using software-emulated floating point for FFT on Cortex M4 is 10x slower than the FPU unit. M4F is 3-4 times as energy efficient as the M3 (when using floats?).
  • EMF32 DSP. CMSIS FFT is about 3-4x faster than a generic KissFFT-based version.
  • Teensy 3.2 was able to do approx 400 ops/sec (3ms) on 512 point FFT with generic version, using int32.2 OpenAudio Benchmarking FFT.
  • FFT on ARM-Based Low-Power Microcontrollers found that CMSIS FFT with Q31 had slightly less error than with F32.
  • esp32-fft. 1024 lenght float32 FFT in 1ms on ESP32.

Goertzel filter