-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wake-word detection #3467
Wake-word detection #3467
Conversation
Cool!
y.
…On Mon, Jul 15, 2019 at 5:39 PM Yiming Wang ***@***.***> wrote:
Results of the regular LF-MMI based recipes:
Mobvoi: EER=~0.2%, FRR=1.02% at FAH=1.5 vs. FRR=3.8% at FAH=1.5 (Mobvoi
paper http://lxie.nwpu-aslp.org/papers/2019ICASSP-XiongWang.pdf
<http://url>)
SNIPS: EER=~0.1%, FRR=0.08% at FAH=0.5 vs. FRR=0.12% at FAH=0.5 (SNIPS
paper https://arxiv.org/pdf/1811.07684.pdf <http://url>)
E2E LF-MMI recipes are still being run to confirm the reproducibility of
the previous results.
------------------------------
You can view, comment on, or merge this pull request online at:
#3467
Commit Summary
- mobivoi data prep
- chain recipe
- fix
- updates
- use existing list for data split
- language config related
- make more than one word predictions for each example
- fix
- snips recipes
- revise mobvoi recipes
File Changes
- *A* egs/mobvoi/v1/cmd.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-0> (24)
- *A* egs/mobvoi/v1/conf/mfcc.conf
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-1> (1)
- *A* egs/mobvoi/v1/conf/mfcc_hires.conf
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-2> (9)
- *A* egs/mobvoi/v1/conf/online_cmvn.conf
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-3> (1)
- *A* egs/mobvoi/v1/local/add_prefix_to_scp.py
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-4> (38)
- *A* egs/mobvoi/v1/local/chain/build_tree.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-5> (141)
- *A* egs/mobvoi/v1/local/chain/run_e2e_tdnn.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-6> (280)
- *A* egs/mobvoi/v1/local/chain/run_tdnn.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-7> (1)
- *A* egs/mobvoi/v1/local/chain/run_tdnn_1a.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-8> (314)
- *A* egs/mobvoi/v1/local/chain/tuning/run_e2e_tdnn_1a.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-9> (280)
- *A* egs/mobvoi/v1/local/chain/tuning/run_tdnn_1a.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-10> (340)
- *A* egs/mobvoi/v1/local/compute_metrics.py
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-11> (68)
- *A* egs/mobvoi/v1/local/compute_min_dcf.py
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-12> (201)
- *A* egs/mobvoi/v1/local/copy_lat_dir.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-13> (53)
- *A* egs/mobvoi/v1/local/gen_topo.pl
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-14> (79)
- *A* egs/mobvoi/v1/local/make_musan.py
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-15> (1)
- *A* egs/mobvoi/v1/local/make_musan.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-16> (1)
- *A* egs/mobvoi/v1/local/mobvoi_data_download.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-17> (50)
- *A* egs/mobvoi/v1/local/parse_cost.py
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-18> (78)
- *A* egs/mobvoi/v1/local/plot_scatter.py
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-19> (62)
- *A* egs/mobvoi/v1/local/prepare_dict.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-20> (23)
- *A* egs/mobvoi/v1/local/prepare_wav.py
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-21> (64)
- *A* egs/mobvoi/v1/local/process_lattice.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-22> (86)
- *A* egs/mobvoi/v1/local/score.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-23> (74)
- *A* egs/mobvoi/v1/local/split_datasets.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-24> (61)
- *A* egs/mobvoi/v1/local/wer_output_filter
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-25> (24)
- *A* egs/mobvoi/v1/path.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-26> (6)
- *A* egs/mobvoi/v1/run.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-27> (253)
- *A* egs/mobvoi/v1/run_e2e.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-28> (221)
- *A* egs/mobvoi/v1/steps
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-29> (1)
- *A* egs/mobvoi/v1/utils
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-30> (1)
- *A* egs/snips/v1/cmd.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-31> (24)
- *A* egs/snips/v1/conf/mfcc.conf
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-32> (1)
- *A* egs/snips/v1/conf/mfcc_hires.conf
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-33> (9)
- *A* egs/snips/v1/conf/online_cmvn.conf
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-34> (1)
- *A* egs/snips/v1/local/add_prefix_to_scp.py
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-35> (1)
- *A* egs/snips/v1/local/chain/build_tree.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-36> (1)
- *A* egs/snips/v1/local/chain/run_e2e_tdnn.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-37> (1)
- *A* egs/snips/v1/local/chain/run_tdnn.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-38> (1)
- *A* egs/snips/v1/local/chain/tuning/run_e2e_tdnn_1a.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-39> (280)
- *A* egs/snips/v1/local/chain/tuning/run_tdnn_1a.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-40> (341)
- *A* egs/snips/v1/local/compute_metrics.py
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-41> (1)
- *A* egs/snips/v1/local/compute_min_dcf.py
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-42> (1)
- *A* egs/snips/v1/local/copy_lat_dir.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-43> (1)
- *A* egs/snips/v1/local/gen_topo.pl
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-44> (1)
- *A* egs/snips/v1/local/make_musan.py
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-45> (1)
- *A* egs/snips/v1/local/make_musan.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-46> (1)
- *A* egs/snips/v1/local/parse_cost.py
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-47> (1)
- *A* egs/snips/v1/local/plot_scatter.py
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-48> (1)
- *A* egs/snips/v1/local/prepare_data.py
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-49> (50)
- *A* egs/snips/v1/local/prepare_dict.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-50> (23)
- *A* egs/snips/v1/local/process_lattice.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-51> (1)
- *A* egs/snips/v1/local/score.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-52> (1)
- *A* egs/snips/v1/local/snips_data_download.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-53> (23)
- *A* egs/snips/v1/local/wer_output_filter
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-54> (24)
- *A* egs/snips/v1/path.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-55> (6)
- *A* egs/snips/v1/run.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-56> (242)
- *A* egs/snips/v1/run_e2e.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-57> (210)
- *A* egs/snips/v1/steps
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-58> (1)
- *A* egs/snips/v1/utils
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-59> (1)
- *A* egs/wsj/s5/steps/data/augment_data_dir_for_asr.py
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-60> (245)
- *M* egs/wsj/s5/steps/lmrescore.sh
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-61> (3)
- *M* egs/wsj/s5/utils/gen_topo.pl
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-62> (2)
- *M* src/fstbin/Makefile
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-63> (2)
- *A* src/fstbin/fsts-clear-labels.cc
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-64> (81)
- *A* src/fstbin/fsttablecomposelog.cc
<https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-65> (212)
Patch Links:
- https://github.com/kaldi-asr/kaldi/pull/3467.patch
- https://github.com/kaldi-asr/kaldi/pull/3467.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3467?email_source=notifications&email_token=ACUKYX5B4B2T2XAHLMI6V7LP7TVCZA5CNFSM4ID3AY2KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G7KNPRQ>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACUKYX2EFSAKCK76PPGCUG3P7TVCZANCNFSM4ID3AY2A>
.
|
I don't see the advantage of making it a separate file, unless you hae
something more complicated in mind than just one word.
…On Mon, Jul 15, 2019 at 8:08 PM Xingyu Na ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In egs/mobvoi/v1/local/chain/run_e2e_tdnn.sh
<#3467 (comment)>:
> +remove_egs=false
+
+# training options
+srand=0
+num_epochs=2
+num_jobs_initial=2
+num_jobs_final=5
+minibatch_size=150=128,64/300=100,64,32/600=50,32,16/1200=16,8
+common_egs_dir=
+dim=80
+bn_dim=20
+frames_per_iter=3000000
+bs_scale=0.0
+train_set=train_shorter_combined_spe2e
+test_sets="dev eval"
+wake_word="嗨小问"
What if maintaining a wake word file in conf instead?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3467?email_source=notifications&email_token=AAZFLO7ZTD5C2EVOZKKE3A3P7U3TNA5CNFSM4ID3AY2KYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB6QF5KI#pullrequestreview-262168233>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAZFLO6B253IONCHFNEGEKDP7U3TNANCNFSM4ID3AY2A>
.
|
53d3ee6
to
92e774b
Compare
fc65580
to
f12cf50
Compare
@danpovey the current recipes in this PR are ready to review |
@danpovey |
I think it can be done with a click when merging .
On Mon, Aug 19, 2019 at 1:44 AM csukuangfj ***@***.***> wrote:
@danpovey <https://github.com/danpovey>
should the pullrequest be squashed into a single commit?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3467?email_source=notifications&email_token=AA2YBERVSRNW5UXRXI4PSILQFIXKRA5CNFSM4ID3AY2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4RYIRQ#issuecomment-522421318>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA2YBES2QBFCHMNZ2674GRLQFIXKRANCNFSM4ID3AY2A>
.
--
Sent from my iPhone
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small comments
OK. 0.4 seconds right context is a little on the long side, in terms of
latency.
…On Sat, Aug 24, 2019 at 1:01 PM Yiming Wang ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In egs/snips/v1/local/chain/tuning/run_e2e_tdnn_1a.sh
<#3467 (comment)>:
> + input dim=40 name=input
+
+ # please note that it is important to have input layer with the name=input
+ # as the layer immediately preceding the fixed-affine-layer to enable
+ # the use of short notation for the descriptor
+
+ relu-batchnorm-dropout-layer name=tdnn1 input=Append(-2,-1,0,1,2) $affine_opts dim=$dim
+ tdnnf-layer name=tdnnf2 $tdnnf_opts dim=$dim bottleneck-dim=$bn_dim time-stride=1
+ tdnnf-layer name=tdnnf3 $tdnnf_opts dim=$dim bottleneck-dim=$bn_dim time-stride=1
+ tdnnf-layer name=tdnnf4 $tdnnf_opts dim=$dim bottleneck-dim=$bn_dim time-stride=1
+ tdnnf-layer name=tdnnf5 $tdnnf_opts dim=$dim bottleneck-dim=$bn_dim time-stride=1
+ tdnnf-layer name=tdnnf6 $tdnnf_opts dim=$dim bottleneck-dim=$bn_dim time-stride=1
+ tdnnf-layer name=tdnnf7 $tdnnf_opts dim=$dim bottleneck-dim=$bn_dim time-stride=1
+ tdnnf-layer name=tdnnf8 $tdnnf_opts dim=$dim bottleneck-dim=$bn_dim time-stride=1
+ tdnnf-layer name=tdnnf9 $tdnnf_opts dim=$dim bottleneck-dim=$bn_dim time-stride=0
+ tdnnf-layer name=tdnnf10 $tdnnf_opts dim=$dim bottleneck-dim=$bn_dim time-stride=3
I remember I added layers one by one until it reached optimal. The
left,right context now is both 41, which covers 0.8s duration. This optimal
is obtained under a previous setup where negative examples are not
segmented into pieces. I can check if these many layers are still optimal
for the current setup.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3467?email_source=notifications&email_token=AAZFLOZ5KAY57NFNFNPA4XDQGGHQLA5CNFSM4ID3AY2KYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCCS6XWA#discussion_r317372820>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAZFLO7GV2TW7JPPUBXLP33QGGHQLANCNFSM4ID3AY2A>
.
|
138d5c7
to
27f32e5
Compare
868fc0e
to
9e2471d
Compare
@freewym Thanks for the reply. It is found that LM does not make significant difference to the FPR and FNR. Does it mean that we don't need the decoder (AM+LM)? The results are still good using only nnet3-compute (AM only)? |
Have you tried to increase the left/right context? |
How do you get word-level hypotheses without an LM or decoding graph? The output of nnet3-compute is a sequence of probs of pdf-ids, which I assume needs decoding or post-processing to obtain word-level hypotheses. |
No. I think the current receptive field (80) is already large enough, and we should only reduce it if we can but not further increase it. |
That is say we have 18-dimension outputs with pdf-ids 0 to 17. The 0-1 are for SIL, 2-9 are for wake word and 10-17 non-wake-word. In wake word case, not ASR, will the sequence of max values of 18d-probs be just sequence of 2-9s or 10-17s? 0-1s are at the beginning and end of the sequence. I have not check that, it's just a guess. BTW, after decoding. In ali.txt, the alignments of the wake word are long sequence of 3s (11s for non-wake-word). 2 and 4-9 hardly apprear. Is it normal? |
In HMMs you need to use Viterbi decoding rather than simply apply argmax on individual frames to obtain the most probable sequence, as there are transition constraints between HMM states, i.e. we are looking for the most probable sequence, rather than the sequence of most probable individual frames. Yes I also have the same observation: it only happens for E2E LF-MMI system. If you are using GMM alignment for regular LF-MMI, there will be no such thing. The reason maybe E2E LF-MMI is maximizing the sequence loss and it has much more freedom to learn the alignment. |
It seems that this kind of alignments does not cause any problem in the following script run_tdnn_e2eali.sh. And it's interesting that if I set num-nonsil-states=1, the results are still promising. I don't know whether it is the same for regular LF-MMI yet. |
Yes, running run_tdnn_e2eali.sh may even further improve the metrics. I tried num-nonsil-states=1 for regular LF-MMI, and the GMM model seems not good enough to generate alignment for LF-MMI. I had a plan to try it on E2E LF-MMI but haven't done it. Did you find it still achieves comparable performance? |
@freewym Yes, if I haven't done anything wrong. I change num-nonsil-states to 1 in run_e2e.sh. However, the num-nonsil-states option does not propagate. So I change num-nonsil-states and gen_topo.pl 4 1 too in local/chain/run_e2e_tdnn.sh. I notice that you have two LMs in local/chain/run_e2e_tdnn.sh -- fst.txt and phone_lm.txt. Are they equivalent? The paths of fst.txt are clear. I am a little confused about phone_lm.txt. |
@freewym Is phone_lm.txt created so that the denominator graph is acyclic? |
Yes |
Thanks |
Hi @freewym, thanks for your work!, I'm running this script with a dataset of keyword spotting, and I ran into some problems. The original data set is a variety of near field speech. I cut out the positive samples and concatenate the negative samples as training data and test data. Is my training data too small? or whether the wake word task is not suitable for keyword spotting in long speech? |
Perhaps the issue is your training and test data mismatch, e.g. all your
examples were isolated?
…On Sun, Nov 29, 2020 at 10:57 AM ybNo1 ***@***.***> wrote:
Hi @freewym <https://github.com/freewym>, thanks for your work!, I'm
running this script with a dataset of keyword spotting, and I ran into some
problems.
I use my own dataset for train and test.
I found that the accuracy and recall rate of my trained model in testing a
single 1s wake word/FREETEXT were similar to that of the original script on
the mobvoi HotWords dataset(98%). When I test the audio with wake-up words
in long speech for example 3s or 5s, the recognition result is probably
freetext, especially when wake word appears at the middle or back of the
speech,no matter how I adjust the weight of LM.
The original data set is a variety of near field speech. I cut out the
positive samples and concatenate the negative samples as training data and
test data.
There are about 10 wake words, and each training sample has 1000 audio of
about 0.8~1.2 sec,and
negative samples about 30 hours speech without wake words.And the dataset
is trained with mobvoihotwords/v1/run_e2e.sh with almost no changes.
Is my training data too small? or whether the wake word task is not
suitable for keyword spotting in long speech?
After all, I think the gap between pos/neg audio in mobvoi dataset is
bigger than within speech. And for pure speech dataset , the data
distribution is basically the same.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#3467 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO6SFIXLK4PFBYURIBTSSG2AFANCNFSM4ID3AY2A>
.
|
When the wake word appears in the middle or at the end in your test examples, is there silence surrounding the wake word, or the wake word immediately follows the other words? Does your training data also contain examples with such condition? If no, I think it's probably because during training the model only sees the positive examples with wake word present solely, i.e. the wake word is surrounded by silence only, and the model is not trained to differentiate the case where the wake word is within the sentence. |
Thanks for your reply! sorry I don't understand "isolated". My test script is like: #cat online_decoding.sh
I construct 3 test audios, I use 1s wake_word.wav and 1s sil.wav which stands for only silence.:
The result and alignment is :
And the trans_id : #show-transitions data/lang/phones.txt exp/chain/ete_tdnn_1a/final.mdl
|
I wonder if the results would be better if you do the same thing to the training data, i.e. add silence before or after the wake word randomly. |
Thanks!!As you said the pos audio of train set is isolated.I'll try add random speech/noise to the begin/end of pos audio, and I'll later post the result of the experiment. thanks again for replying. |
I suggest only add random silence for now, as the current recipe assumes the positive training examples only contain wake word and silence. |
Actually I just take silence audios as a test case, other test results using none wake word speech is similar. |
You can first look at if adding silence to training helps on test example added with silence. Then we can at least confirm our conjecture. |
OK, I'll try only adding silence. |
Adding silence to training data works for me, now the trained model can recognize correctly with silence appended before/after wake word. |
If you were to add freetext to positive examples, you need also need to modify their corresponding text file to reflect the change in the audio. I may also need to modify the phone LM topo to include paths like freetext->wake-word->freetext |
If I add freetext to pos data, the text format should be like "utt_id FREETEXT WORD FREETEXT", is it right? |
if FREETEXT is added only to the beginning for example, better to change the text like "FREETEXT WORD". RE adding silence, it is not necessary to change the text as silence has already been considered as "optional silence" before and after each word when specifying the lexicon. |
Hi, @freewym thanks for your work. I ran in to the same problem as ybNo1 ,and I try to add random 0-1.5s sil in keyword audio start and end. Then I use this data to train a modelmm. But I ran in to a other problem in this time. I create a eval set which add 1s silence in all audio start and end. The model is 100% recall and 622 times per hour false alarms in the add silence eval set . |
I am not sure if the added silence has some artifacts that made the network learn wrong signals. Maybe you can try also add silence to the negative training examples |
Thanks for your reply. I will try the methed you gave. |
Results of the regular LF-MMI based recipes:
Mobvoi: EER=~0.2%, FRR=1.02% at FAH=1.5 vs. FRR=3.8% at FAH=1.5 (Mobvoi paper)
SNIPS: EER=~0.1%, FRR=0.08% at FAH=0.5 vs. FRR=0.12% at FAH=0.5 (SNIPS paper)
E2E LF-MMI recipes are still being run to confirm the reproducibility of the previous results.