Content
pip install -r requirements.txt
Download the speech_commands_v0.02 dataset from
Warden P. (2018) and unpack it in the Dataset
folder.
Dataset
├── data-speech_commands_v0.02
├── _background_noise_
├── backward
...
└── zero
Warden, P.: Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition (april 2018) https://arxiv.org/abs/1804.03209
We split the dataset in a train and test dataset with the ratio 80:20.
To train the model we use the mel-spectogram instead of raw audio for a better feature detection.
Run the jupyter notebook train.ipynb
. The section "Params" at the top of the notebook allows certain settings like
model type and optimizer type. After the notebook finished the folder Weights
contains the trained weights (.pth)
and model with trainied weights in the ONNX-format.
We compared the VGG19BN and ResNet34 such as SGD and Adam. All plots based on the test dataset
VGG19 + Adam-Optimizer (best result)
Download pretrained models from here.
[Verified for 2019.4.1f - Windows only]
- See our video tutorial Unicornn - HowTo for a quick 10 minutes introduction
- Contains the same information as the following Readme.md
- Start PlayMode and wait till monitoring says Python running and circle is green
- A background python process is initiated (see requirements.txt)
- librosa
- numpy
- datetime
- pylab
- PIL
- numba 0.48
- You can activate console window by selecting useShell [x] in GameManager -> PythonInterface.cs to see process output
- Process is terminated automatically after leaving PlayMode
- Start voice recording by pressing Start Button
- Say two words with silence (± 1 sec) in between
- Check your Mic level with VU-Meter on the right side after first recording
- Check Treshold if there's background noise (GameManager -> MicrophoneInput.cs)
- Possible Words: e.g: Zero [...] Left
Objects | Actions | |
---|---|---|
Zero | Forward | |
One | Backward | |
Two | Left | |
Three | Right | |
Four | Up | |
Five | Down | |
Six | ||
Seven | ||
Eight | ||
Nine |
- Stop recording with Stop Button and wait
- Word splitting and processing is done automatically
- You can see the detected words and their probability next to our Unicornn
| Unicornn
- | GameManager
- | SpeechCommands.cs: translate prediction to action in scene
- | MicrophoneInput.cs: process microphone input, slice words, use threshold
- | PythonInterface.cs: run background process (librosa) to create spectograms
- | Agent
- | Agent.cs: take .onnx as model and input spectograms from sliced words, find prediction * | SceneStuff
- | UI elements, buttons and visual elements
- Barracuda can be installed via the Unity Unity PackageManager. For futher informations look at: https://docs.unity3d.com/Packages/com.unity.barracuda@0.7/manual/index.html
- Project was tested with Barracuda 1.0.0
- You can exchange our different trained models by using the Button in the lower left corner
- If you want to use own models drag them to the coressponding field in Agents.cs
- By default we chose the model with the best results to start with (VGG + Adam)
- By default we chose a very low threshold to detect silence between words
- If you have a louder environment it could happen, that a "silent" moment is still above our threshold
- Only change GameManager -> MicrophoneInput -> Threshold to a bigger value
- We process the audio input in Unity3D, save the sliced float arrays as .wav and use librosa to generate mel-spectograms
- The background python process listens for existing file-names in the project folder and processes them if they exist
- After processing the process deletes the .wav files for the next iteration
- The script writes its own pid-ID to a text-file because it is started via cmd.exe
- To terminate all processes we need the process-ID for all children-processes
- By checking if process exited we can monitor the status of our background process
process.hasExited()