Automated speech recognition in Rhasspy voice assistant with Kaldi.
- Python 3.7
- Kaldi
- Expects
$KALDI_DIR
in environment
- Expects
- Opengrm
- Expects
ngram*
in$PATH
- Expects
- Phonetisaurus
- Expects
phonetisaurus-apply
in$PATH
- Expects
See pre-built apps for pre-compiled binaries.
$ git clone https://github.com/rhasspy/rhasspy-asr-kaldi
$ cd rhasspy-asr-kaldi
$ ./configure
$ make
$ make install
Use python3 -m rhasspyasr_kaldi transcribe <ARGS>
usage: rhasspy-asr-kaldi transcribe [-h] --model-dir MODEL_DIR
[--graph-dir GRAPH_DIR]
[--model-type MODEL_TYPE]
[--frames-in-chunk FRAMES_IN_CHUNK]
[wav_file [wav_file ...]]
positional arguments:
wav_file WAV file(s) to transcribe
optional arguments:
-h, --help show this help message and exit
--model-dir MODEL_DIR
Path to Kaldi model directory (with conf, data)
--graph-dir GRAPH_DIR
Path to Kaldi graph directory (with HCLG.fst)
--model-type MODEL_TYPE
Either nnet3 or gmm (default: nnet3)
--frames-in-chunk FRAMES_IN_CHUNK
Number of frames to process at a time
For nnet3 models, the online2-tcp-nnet3-decode-faster
program is used to handle streaming audio. For gmm models, audio is buffered and packaged as a WAV file before being transcribed.
Use python3 -m rhasspyasr_kaldi train <ARGS>
usage: rhasspy-asr-kaldi train [-h] --model-dir MODEL_DIR
[--graph-dir GRAPH_DIR]
[--intent-graph INTENT_GRAPH]
[--dictionary DICTIONARY]
[--dictionary-casing {upper,lower,ignore}]
[--language-model LANGUAGE_MODEL]
--base-dictionary BASE_DICTIONARY
[--g2p-model G2P_MODEL]
[--g2p-casing {upper,lower,ignore}]
optional arguments:
-h, --help show this help message and exit
--model-dir MODEL_DIR
Path to Kaldi model directory (with conf, data)
--graph-dir GRAPH_DIR
Path to Kaldi graph directory (with HCLG.fst)
--intent-graph INTENT_GRAPH
Path to intent graph JSON file (default: stdin)
--dictionary DICTIONARY
Path to write custom pronunciation dictionary
--dictionary-casing {upper,lower,ignore}
Case transformation for dictionary words (training,
default: ignore)
--language-model LANGUAGE_MODEL
Path to write custom language model
--base-dictionary BASE_DICTIONARY
Paths to pronunciation dictionaries
--g2p-model G2P_MODEL
Path to Phonetisaurus grapheme-to-phoneme FST model
--g2p-casing {upper,lower,ignore}
Case transformation for g2p words (training, default:
ignore)
This will generate a custom HCLG.fst
from an intent graph created using rhasspy-nlu. Your Kaldi model directory should be laid out like this:
- my_model/ (
--model-dir
)- conf/
- mfcc_hires.conf
- data/
- local/
- dict/
- lexicon.txt (copied from
--dictionary
)
- lexicon.txt (copied from
- lang/
- lm.arpa.gz (copied from
--language-model
)
- lm.arpa.gz (copied from
- dict/
- local/
- graph/ (
--graph-dir
)- HCLG.fst (generated)
- model/
- final.mdl
- phones/
- extra_questions.txt
- nonsilence_phones.txt
- optional_silence.txt
- silence_phones.txt
- online/ (nnet3 only)
- extractor/ (nnet3 only)
- conf/
When using the train
command, you will need to specify the following arguments:
--intent-graph
- path to graph json file generated using rhasspy-nlu--model-type
- either nnet3 or gmm--model-dir
- path to top-level model directory (my_model in example above)--graph-dir
- path to directory where HCLG.fst should be written (my_model/graph in example above)--base-dictionary
- pronunciation dictionary with all words from intent graph (can be used multiple times)--dictionary
- path to write custom pronunciation dictionary (optional)--language-model
- path to write custom ARPA language model (optional)
rhasspy-asr-kaldi
depends on the following programs that must be compiled:
- Kaldi
- Speech to text engine
- Opengrm
- Create ARPA language models
- Phonetisaurus
- Guesses pronunciations for unknown words
Make sure you have the necessary dependencies installed:
sudo apt-get install \
build-essential \
libatlas-base-dev libatlas3-base gfortran \
automake autoconf unzip sox libtool subversion \
python3 python \
git zlib1g-dev
Download Kaldi and extract it:
wget -O kaldi-master.tar.gz \
'https://github.com/kaldi-asr/kaldi/archive/master.tar.gz'
tar -xvf kaldi-master.tar.gz
First, build Kaldi's tools:
cd kaldi-master/tools
make
Use make -j 4
if you have multiple CPU cores. This will take a long time.
Next, build Kaldi itself:
cd kaldi-master
./configure --shared --mathlib=ATLAS
make depend
make
Use make depend -j 4
and make -j 4
if you have multiple CPU cores. This will take a long time.
There is no installation step. The kaldi-master
directory contains all the libraries and programs that Rhasspy will need to access.
See docker-kaldi for a Docker build script.
Make sure you have the necessary dependencies installed:
sudo apt-get install build-essential
First, download and build OpenFST 1.6.2
wget http://www.openfst.org/twiki/pub/FST/FstDownload/openfst-1.6.2.tar.gz
tar -xvf openfst-1.6.2.tar.gz
cd openfst-1.6.2
./configure \
"--prefix=$(pwd)/build" \
--enable-static --enable-shared \
--enable-far --enable-ngram-fsts
make
make install
Use make -j 4
if you have multiple CPU cores. This will take a long time.
Next, download and extract Phonetisaurus:
wget -O phonetisaurus-master.tar.gz \
'https://github.com/AdolfVonKleist/Phonetisaurus/archive/master.tar.gz'
tar -xvf phonetisaurus-master.tar.gz
Finally, build Phonetisaurus (where /path/to/openfst
is the openfst-1.6.2
directory from above):
cd Phonetisaurus-master
./configure \
--with-openfst-includes=/path/to/openfst/build/include \
--with-openfst-libs=/path/to/openfst/build/lib
make
make install
Use make -j 4
if you have multiple CPU cores. This will take a long time.
You should now be able to run the phonetisaurus-align
program.
See docker-phonetisaurus for a Docker build script.