The tutorial is composed mainly by three big steps:
graph LR;
A[Data<br>preparation] --> B[GMM<br>training];
B --> C[DNN<br>training];
All three are accomplished through stages across the script run.sh
. Data
preparation occurs on stage one, GMM training on stages two until eight, and
finally DNN training on stage nine.
According to Kaldi's tutorial for dummies, the directory tree for new projects must follow the structure below:
path/to/kaldi/egs/YOUR_PROJECT_NAME/s5
ββ path.sh
ββ cmd.sh
ββ run.sh
β
.--------------.-------.-------:------.-------.
| | | | | |
mfcc/ data/ utils/ steps/ exp/ conf/
| ββ decode.config
.--------------:--------------. ββ mfcc.conf
β β β
train/ test/ local/
ββ spkTR_1/ ββ spkTE_1/ ββ dict/
ββ spkTR_2/ ββ spkTE_2/ ββ lexicon.txt
ββ spkTR_3/ ββ spkTE_3/ ββ non_silence_phones.txt
ββ spkTR_n/ ββ spkTE_n/ ββ optional_silence.txt
β β ββ silence_phones.txt
ββ spk2gender ββ spk2gender ββ extra_questions.txt
ββ wav.scp ββ wav.scp
ββ text ββ text
ββ utt2spk ββ utt2spk
ββ corpus.txt ββ corpus.txt
The script prep_env.sh
gives the kick-off by initializing the directory and
file structure tree, mostly by making symbolic links to the mini_librispeech
stuff. But the resources are create by the first stage of run.sh
.
The default data downloaded by the scripts and used during training is the
LapsBenchmark dataset (check the
lapsbm16k
repo).
When you switch to your own dataset please keep in mind the pattern followed in LapsBM to name files and directories, where 'M' and 'F' in the dirname correspond to the gender of the speaker.
$ for dir in $(ls | sort -R | head -n 8) ; do tree $dir -C | head -n 6 ; done
LapsBM-M031 LapsBM-F019 LapsBM-M010 LapsBM-F006
βββ LapsBM_0601.txt βββ LapsBM_0361.txt βββ LapsBM_0181.txt βββ LapsBM_0101.txt
βββ LapsBM_0601.wav βββ LapsBM_0361.wav βββ LapsBM_0181.wav βββ LapsBM_0101.wav
βββ LapsBM_0602.txt βββ LapsBM_0362.txt βββ LapsBM_0182.txt βββ LapsBM_0102.txt
βββ LapsBM_0602.wav βββ LapsBM_0362.wav βββ LapsBM_0182.wav βββ LapsBM_0102.wav
βββ LapsBM_0603.txt βββ LapsBM_0363.txt βββ LapsBM_0183.txt βββ LapsBM_0103.txt
LapsBM-M033 LapsBM-F014 LapsBM-F013 LapsBM-M027
βββ LapsBM_0641.txt βββ LapsBM_0261.txt βββ LapsBM_0241.txt βββ LapsBM_0521.txt
βββ LapsBM_0641.wav βββ LapsBM_0261.wav βββ LapsBM_0241.wav βββ LapsBM_0521.wav
βββ LapsBM_0642.txt βββ LapsBM_0262.txt βββ LapsBM_0242.txt βββ LapsBM_0522.txt
βββ LapsBM_0642.wav βββ LapsBM_0262.wav βββ LapsBM_0242.wav βββ LapsBM_0522.wav
βββ LapsBM_0643.txt βββ LapsBM_0263.txt βββ LapsBM_0243.txt βββ LapsBM_0523.txt
... ... ... ...
The recipe downloads a phonetic dictionary from
nlp-resources
, which was
generated over the 200k most frequent words of Brazilian Portuguese language.
You better check if your transription files contain some words that are not in
the dictionary yet. If so, then you will need our
nlp-generator
software in order to
generate the G2P conversion for such missing words. Java is required to be
installed for the generator.
An already-trained 3-gram language model is available at our
nlp-resources
repo. It is also automatically downloaded.
The schematic below shows the pipeline to training a HMM-DNN acoustic model
using Kaldi (for more details read our
paper).
These steps are accomplished by running stages 2 to 8 in run.sh
.
Stage 9 in run.sh
calls a script called run_tdnn.sh
, which actually follows
this entire pipeline below.
- Kaldi Tutorial by Eleanor Chodroff
- Understanding Kaldi mini librispeech recipe - part I - GMM by Qianhui Wan
- Understanding Kaldi mini librispeech recipe - part II - DNN by Qiangui Wan
train.py
is very memory-, IO-, and CPU-hungry: it took
more than 3 days (~76h) to train the DNN over 20 epochs in a single NVIDIA GPU
using an audio corpora of approximately 180h data. Other scripts such as
train_ivector_extractor.sh
, OTOH, are CPU-intensive and take some hours to
run on a 64-core cluster (~5h).
We recorded an entire training over LapsBM corpus via Linux's
script
command. You can watch us training the model in a sort of live manner by
running scriptreplay
. Although it originally takes about 1.5 hour, you can
always speed things up by specifying a very low value to the -m
flag:
$ scriptreplay -s doc/kaldi.log -t doc/time.log -m 0.02
Here's a screenshot of how things go when the script reaches the DNN training
part. Kaldi's nnet3-chain-train
script that runs on GPU spawns a single
thread on GPU, which speeds things up by a lot.
Grupo FalaBrasil (2022) - https://ufpafalabrasil.gitlab.io/
Universidade Federal do ParΓ‘ (UFPA) - https://portal.ufpa.br/
Cassio Batista - https://cassota.gitlab.io/
Larissa Dias - larissa.engcomp@gmail.com