Skip to content

Latest commit

Β 

History

History
152 lines (118 loc) Β· 7.38 KB

TUTORIAL.md

File metadata and controls

152 lines (118 loc) Β· 7.38 KB

Tutorial for training models with Kaldi

The tutorial is composed mainly by three big steps:

graph LR;
    A[Data<br>preparation] --> B[GMM<br>training];
    B                      --> C[DNN<br>training];
Loading

All three are accomplished through stages across the script run.sh. Data preparation occurs on stage one, GMM training on stages two until eight, and finally DNN training on stage nine.

Data preparation

According to Kaldi's tutorial for dummies, the directory tree for new projects must follow the structure below:

           path/to/kaldi/egs/YOUR_PROJECT_NAME/s5
                                 β”œβ”€ path.sh
                                 β”œβ”€ cmd.sh
                                 β”œβ”€ run.sh
                                 β”‚ 
  .--------------.-------.-------:------.-------.
  |              |       |       |      |       |
 mfcc/         data/   utils/  steps/  exp/   conf/
                 |                              β”œβ”€ decode.config
  .--------------:--------------.               └─ mfcc.conf
  β”‚              β”‚              β”‚
train/          test/         local/
  β”œβ”€ spkTR_1/    β”œβ”€ spkTE_1/    └─ dict/
  β”œβ”€ spkTR_2/    β”œβ”€ spkTE_2/        β”œβ”€ lexicon.txt
  β”œβ”€ spkTR_3/    β”œβ”€ spkTE_3/        β”œβ”€ non_silence_phones.txt
  β”œβ”€ spkTR_n/    β”œβ”€ spkTE_n/        β”œβ”€ optional_silence.txt
  β”‚              β”‚                  β”œβ”€ silence_phones.txt
  β”œβ”€ spk2gender  β”œβ”€ spk2gender      └─ extra_questions.txt
  β”œβ”€ wav.scp     β”œβ”€ wav.scp   
  β”œβ”€ text        β”œβ”€ text      
  β”œβ”€ utt2spk     β”œβ”€ utt2spk   
  └─ corpus.txt  └─ corpus.txt

The script prep_env.sh gives the kick-off by initializing the directory and file structure tree, mostly by making symbolic links to the mini_librispeech stuff. But the resources are create by the first stage of run.sh.

Audio corpus

The default data downloaded by the scripts and used during training is the LapsBenchmark dataset (check the lapsbm16k repo).

When you switch to your own dataset please keep in mind the pattern followed in LapsBM to name files and directories, where 'M' and 'F' in the dirname correspond to the gender of the speaker.

$ for dir in $(ls | sort -R | head -n 8) ; do tree $dir -C | head -n 6 ; done
 
LapsBM-M031              LapsBM-F019               LapsBM-M010              LapsBM-F006
β”œβ”€β”€ LapsBM_0601.txt      β”œβ”€β”€ LapsBM_0361.txt       β”œβ”€β”€ LapsBM_0181.txt      β”œβ”€β”€ LapsBM_0101.txt
β”œβ”€β”€ LapsBM_0601.wav      β”œβ”€β”€ LapsBM_0361.wav       β”œβ”€β”€ LapsBM_0181.wav      β”œβ”€β”€ LapsBM_0101.wav
β”œβ”€β”€ LapsBM_0602.txt      β”œβ”€β”€ LapsBM_0362.txt       β”œβ”€β”€ LapsBM_0182.txt      β”œβ”€β”€ LapsBM_0102.txt
β”œβ”€β”€ LapsBM_0602.wav      β”œβ”€β”€ LapsBM_0362.wav       β”œβ”€β”€ LapsBM_0182.wav      β”œβ”€β”€ LapsBM_0102.wav
β”œβ”€β”€ LapsBM_0603.txt      β”œβ”€β”€ LapsBM_0363.txt       β”œβ”€β”€ LapsBM_0183.txt      β”œβ”€β”€ LapsBM_0103.txt

LapsBM-M033              LapsBM-F014               LapsBM-F013              LapsBM-M027
β”œβ”€β”€ LapsBM_0641.txt      β”œβ”€β”€ LapsBM_0261.txt       β”œβ”€β”€ LapsBM_0241.txt      β”œβ”€β”€ LapsBM_0521.txt
β”œβ”€β”€ LapsBM_0641.wav      β”œβ”€β”€ LapsBM_0261.wav       β”œβ”€β”€ LapsBM_0241.wav      β”œβ”€β”€ LapsBM_0521.wav
β”œβ”€β”€ LapsBM_0642.txt      β”œβ”€β”€ LapsBM_0262.txt       β”œβ”€β”€ LapsBM_0242.txt      β”œβ”€β”€ LapsBM_0522.txt
β”œβ”€β”€ LapsBM_0642.wav      β”œβ”€β”€ LapsBM_0262.wav       β”œβ”€β”€ LapsBM_0242.wav      β”œβ”€β”€ LapsBM_0522.wav
β”œβ”€β”€ LapsBM_0643.txt      β”œβ”€β”€ LapsBM_0263.txt       β”œβ”€β”€ LapsBM_0243.txt      β”œβ”€β”€ LapsBM_0523.txt
...                      ...                       ...                      ...

⚠️ this corpus contains less than an hour of recorded speech and is being used for just for demonstration of the script's correctness (and because it is faster to train). Therefore, it will NOT give you good results, as you probably need hundreds or rather thousands of hours of recorded data for a proper recognizer to reliably work.

Dictionary (lexicon)

The recipe downloads a phonetic dictionary from nlp-resources, which was generated over the 200k most frequent words of Brazilian Portuguese language. You better check if your transription files contain some words that are not in the dictionary yet. If so, then you will need our nlp-generator software in order to generate the G2P conversion for such missing words. Java is required to be installed for the generator.

Language model

An already-trained 3-gram language model is available at our nlp-resources repo. It is also automatically downloaded.

GMM model training

The schematic below shows the pipeline to training a HMM-DNN acoustic model using Kaldi (for more details read our paper). These steps are accomplished by running stages 2 to 8 in run.sh.

alt text

DNN model training

Stage 9 in run.sh calls a script called run_tdnn.sh, which actually follows this entire pipeline below.

References

⚠️ Beware that train.py is very memory-, IO-, and CPU-hungry: it took more than 3 days (~76h) to train the DNN over 20 epochs in a single NVIDIA GPU using an audio corpora of approximately 180h data. Other scripts such as train_ivector_extractor.sh, OTOH, are CPU-intensive and take some hours to run on a 64-core cluster (~5h).

Log

We recorded an entire training over LapsBM corpus via Linux's script command. You can watch us training the model in a sort of live manner by running scriptreplay. Although it originally takes about 1.5 hour, you can always speed things up by specifying a very low value to the -m flag:

$ scriptreplay -s doc/kaldi.log -t doc/time.log -m 0.02

Here's a screenshot of how things go when the script reaches the DNN training part. Kaldi's nnet3-chain-train script that runs on GPU spawns a single thread on GPU, which speeds things up by a lot.

FalaBrasil UFPA

Grupo FalaBrasil (2022) - https://ufpafalabrasil.gitlab.io/
Universidade Federal do ParΓ‘ (UFPA) - https://portal.ufpa.br/
Cassio Batista - https://cassota.gitlab.io/
Larissa Dias - larissa.engcomp@gmail.com