Wake-word detection #3467

freewym · 2019-07-15T21:39:33Z

Results of the regular LF-MMI based recipes:

Mobvoi: EER=~0.2%, FRR=1.02% at FAH=1.5 vs. FRR=3.8% at FAH=1.5 (Mobvoi paper)

SNIPS: EER=~0.1%, FRR=0.08% at FAH=0.5 vs. FRR=0.12% at FAH=0.5 (SNIPS paper)

E2E LF-MMI recipes are still being run to confirm the reproducibility of the previous results.

jtrmal · 2019-07-15T21:58:16Z

Cool! y.

On Mon, Jul 15, 2019 at 5:39 PM Yiming Wang ***@***.***> wrote: Results of the regular LF-MMI based recipes: Mobvoi: EER=~0.2%, FRR=1.02% at FAH=1.5 vs. FRR=3.8% at FAH=1.5 (Mobvoi paper http://lxie.nwpu-aslp.org/papers/2019ICASSP-XiongWang.pdf <http://url>) SNIPS: EER=~0.1%, FRR=0.08% at FAH=0.5 vs. FRR=0.12% at FAH=0.5 (SNIPS paper https://arxiv.org/pdf/1811.07684.pdf <http://url>) E2E LF-MMI recipes are still being run to confirm the reproducibility of the previous results. ------------------------------ You can view, comment on, or merge this pull request online at: #3467 Commit Summary - mobivoi data prep - chain recipe - fix - updates - use existing list for data split - language config related - make more than one word predictions for each example - fix - snips recipes - revise mobvoi recipes File Changes - *A* egs/mobvoi/v1/cmd.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-0> (24) - *A* egs/mobvoi/v1/conf/mfcc.conf <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-1> (1) - *A* egs/mobvoi/v1/conf/mfcc_hires.conf <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-2> (9) - *A* egs/mobvoi/v1/conf/online_cmvn.conf <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-3> (1) - *A* egs/mobvoi/v1/local/add_prefix_to_scp.py <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-4> (38) - *A* egs/mobvoi/v1/local/chain/build_tree.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-5> (141) - *A* egs/mobvoi/v1/local/chain/run_e2e_tdnn.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-6> (280) - *A* egs/mobvoi/v1/local/chain/run_tdnn.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-7> (1) - *A* egs/mobvoi/v1/local/chain/run_tdnn_1a.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-8> (314) - *A* egs/mobvoi/v1/local/chain/tuning/run_e2e_tdnn_1a.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-9> (280) - *A* egs/mobvoi/v1/local/chain/tuning/run_tdnn_1a.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-10> (340) - *A* egs/mobvoi/v1/local/compute_metrics.py <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-11> (68) - *A* egs/mobvoi/v1/local/compute_min_dcf.py <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-12> (201) - *A* egs/mobvoi/v1/local/copy_lat_dir.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-13> (53) - *A* egs/mobvoi/v1/local/gen_topo.pl <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-14> (79) - *A* egs/mobvoi/v1/local/make_musan.py <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-15> (1) - *A* egs/mobvoi/v1/local/make_musan.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-16> (1) - *A* egs/mobvoi/v1/local/mobvoi_data_download.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-17> (50) - *A* egs/mobvoi/v1/local/parse_cost.py <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-18> (78) - *A* egs/mobvoi/v1/local/plot_scatter.py <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-19> (62) - *A* egs/mobvoi/v1/local/prepare_dict.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-20> (23) - *A* egs/mobvoi/v1/local/prepare_wav.py <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-21> (64) - *A* egs/mobvoi/v1/local/process_lattice.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-22> (86) - *A* egs/mobvoi/v1/local/score.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-23> (74) - *A* egs/mobvoi/v1/local/split_datasets.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-24> (61) - *A* egs/mobvoi/v1/local/wer_output_filter <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-25> (24) - *A* egs/mobvoi/v1/path.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-26> (6) - *A* egs/mobvoi/v1/run.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-27> (253) - *A* egs/mobvoi/v1/run_e2e.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-28> (221) - *A* egs/mobvoi/v1/steps <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-29> (1) - *A* egs/mobvoi/v1/utils <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-30> (1) - *A* egs/snips/v1/cmd.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-31> (24) - *A* egs/snips/v1/conf/mfcc.conf <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-32> (1) - *A* egs/snips/v1/conf/mfcc_hires.conf <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-33> (9) - *A* egs/snips/v1/conf/online_cmvn.conf <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-34> (1) - *A* egs/snips/v1/local/add_prefix_to_scp.py <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-35> (1) - *A* egs/snips/v1/local/chain/build_tree.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-36> (1) - *A* egs/snips/v1/local/chain/run_e2e_tdnn.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-37> (1) - *A* egs/snips/v1/local/chain/run_tdnn.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-38> (1) - *A* egs/snips/v1/local/chain/tuning/run_e2e_tdnn_1a.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-39> (280) - *A* egs/snips/v1/local/chain/tuning/run_tdnn_1a.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-40> (341) - *A* egs/snips/v1/local/compute_metrics.py <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-41> (1) - *A* egs/snips/v1/local/compute_min_dcf.py <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-42> (1) - *A* egs/snips/v1/local/copy_lat_dir.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-43> (1) - *A* egs/snips/v1/local/gen_topo.pl <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-44> (1) - *A* egs/snips/v1/local/make_musan.py <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-45> (1) - *A* egs/snips/v1/local/make_musan.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-46> (1) - *A* egs/snips/v1/local/parse_cost.py <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-47> (1) - *A* egs/snips/v1/local/plot_scatter.py <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-48> (1) - *A* egs/snips/v1/local/prepare_data.py <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-49> (50) - *A* egs/snips/v1/local/prepare_dict.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-50> (23) - *A* egs/snips/v1/local/process_lattice.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-51> (1) - *A* egs/snips/v1/local/score.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-52> (1) - *A* egs/snips/v1/local/snips_data_download.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-53> (23) - *A* egs/snips/v1/local/wer_output_filter <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-54> (24) - *A* egs/snips/v1/path.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-55> (6) - *A* egs/snips/v1/run.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-56> (242) - *A* egs/snips/v1/run_e2e.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-57> (210) - *A* egs/snips/v1/steps <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-58> (1) - *A* egs/snips/v1/utils <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-59> (1) - *A* egs/wsj/s5/steps/data/augment_data_dir_for_asr.py <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-60> (245) - *M* egs/wsj/s5/steps/lmrescore.sh <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-61> (3) - *M* egs/wsj/s5/utils/gen_topo.pl <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-62> (2) - *M* src/fstbin/Makefile <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-63> (2) - *A* src/fstbin/fsts-clear-labels.cc <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-64> (81) - *A* src/fstbin/fsttablecomposelog.cc <https://github.com/kaldi-asr/kaldi/pull/3467/files#diff-65> (212) Patch Links: - https://github.com/kaldi-asr/kaldi/pull/3467.patch - https://github.com/kaldi-asr/kaldi/pull/3467.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3467?email_source=notifications&email_token=ACUKYX5B4B2T2XAHLMI6V7LP7TVCZA5CNFSM4ID3AY2KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G7KNPRQ>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACUKYX2EFSAKCK76PPGCUG3P7TVCZANCNFSM4ID3AY2A> .

egs/mobvoi/v1/local/chain/run_e2e_tdnn.sh

danpovey · 2019-07-16T03:09:53Z

I don't see the advantage of making it a separate file, unless you hae something more complicated in mind than just one word.

…

On Mon, Jul 15, 2019 at 8:08 PM Xingyu Na ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/mobvoi/v1/local/chain/run_e2e_tdnn.sh <#3467 (comment)>: > +remove_egs=false + +# training options +srand=0 +num_epochs=2 +num_jobs_initial=2 +num_jobs_final=5 +minibatch_size=150=128,64/300=100,64,32/600=50,32,16/1200=16,8 +common_egs_dir= +dim=80 +bn_dim=20 +frames_per_iter=3000000 +bs_scale=0.0 +train_set=train_shorter_combined_spe2e +test_sets="dev eval" +wake_word="嗨小问" What if maintaining a wake word file in conf instead? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3467?email_source=notifications&email_token=AAZFLO7ZTD5C2EVOZKKE3A3P7U3TNA5CNFSM4ID3AY2KYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB6QF5KI#pullrequestreview-262168233>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZFLO6B253IONCHFNEGEKDP7U3TNANCNFSM4ID3AY2A> .

freewym · 2019-08-15T19:28:18Z

@danpovey the current recipes in this PR are ready to review

egs/mobvoi/v1/local/mobvoi_data_download.sh

egs/mobvoi/v1/local/split_datasets.sh

egs/mobvoi/v1/run.sh

egs/mobvoi/v1/run_e2e.sh

csukuangfj · 2019-08-19T05:43:31Z

@danpovey
should the pullrequest be squashed into a single commit?

freewym · 2019-08-19T05:48:16Z

I think it can be done with a click when merging .

On Mon, Aug 19, 2019 at 1:44 AM csukuangfj ***@***.***> wrote: @danpovey <https://github.com/danpovey> should the pullrequest be squashed into a single commit? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3467?email_source=notifications&email_token=AA2YBERVSRNW5UXRXI4PSILQFIXKRA5CNFSM4ID3AY2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4RYIRQ#issuecomment-522421318>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA2YBES2QBFCHMNZ2674GRLQFIXKRANCNFSM4ID3AY2A> .

-- Sent from my iPhone

danpovey

A few small comments

egs/mobvoi/v1/local/process_lattice.sh

egs/mobvoi/v1/local/score.sh

egs/mobvoi/v1/run_e2e.sh

egs/snips/v1/local/chain/tuning/run_e2e_tdnn_1a.sh

egs/snips/v1/local/chain/tuning/run_tdnn_1a.sh

egs/snips/v1/local/gen_topo.pl

danpovey · 2019-08-24T20:12:44Z

OK. 0.4 seconds right context is a little on the long side, in terms of latency.

…

On Sat, Aug 24, 2019 at 1:01 PM Yiming Wang ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/snips/v1/local/chain/tuning/run_e2e_tdnn_1a.sh <#3467 (comment)>: > + input dim=40 name=input + + # please note that it is important to have input layer with the name=input + # as the layer immediately preceding the fixed-affine-layer to enable + # the use of short notation for the descriptor + + relu-batchnorm-dropout-layer name=tdnn1 input=Append(-2,-1,0,1,2) $affine_opts dim=$dim + tdnnf-layer name=tdnnf2 $tdnnf_opts dim=$dim bottleneck-dim=$bn_dim time-stride=1 + tdnnf-layer name=tdnnf3 $tdnnf_opts dim=$dim bottleneck-dim=$bn_dim time-stride=1 + tdnnf-layer name=tdnnf4 $tdnnf_opts dim=$dim bottleneck-dim=$bn_dim time-stride=1 + tdnnf-layer name=tdnnf5 $tdnnf_opts dim=$dim bottleneck-dim=$bn_dim time-stride=1 + tdnnf-layer name=tdnnf6 $tdnnf_opts dim=$dim bottleneck-dim=$bn_dim time-stride=1 + tdnnf-layer name=tdnnf7 $tdnnf_opts dim=$dim bottleneck-dim=$bn_dim time-stride=1 + tdnnf-layer name=tdnnf8 $tdnnf_opts dim=$dim bottleneck-dim=$bn_dim time-stride=1 + tdnnf-layer name=tdnnf9 $tdnnf_opts dim=$dim bottleneck-dim=$bn_dim time-stride=0 + tdnnf-layer name=tdnnf10 $tdnnf_opts dim=$dim bottleneck-dim=$bn_dim time-stride=3 I remember I added layers one by one until it reached optimal. The left,right context now is both 41, which covers 0.8s duration. This optimal is obtained under a previous setup where negative examples are not segmented into pieces. I can check if these many layers are still optimal for the current setup. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3467?email_source=notifications&email_token=AAZFLOZ5KAY57NFNFNPA4XDQGGHQLA5CNFSM4ID3AY2KYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCCS6XWA#discussion_r317372820>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZFLO7GV2TW7JPPUBXLP33QGGHQLANCNFSM4ID3AY2A> .

egs/mobvoi/v1/local/chain/run_tdnn_1a.sh

framsc · 2020-05-13T01:02:26Z

@freewym Thanks for the reply. It is found that LM does not make significant difference to the FPR and FNR. Does it mean that we don't need the decoder (AM+LM)? The results are still good using only nnet3-compute (AM only)?
Please tell me if there is something wrong. And thanks again for sharing the great work.

csukuangfj · 2020-05-13T02:06:55Z

I tried to reduce context by removing layers from the current network config, and the performance slightly degraded

Have you tried to increase the left/right context?

freewym · 2020-05-13T04:59:59Z

@freewym Thanks for the reply. It is found that LM does not make significant difference to the FPR and FNR. Does it mean that we don't need the decoder (AM+LM)? The results are still good using only nnet3-compute (AM only)?
Please tell me if there is something wrong. And thanks again for sharing the great work.

How do you get word-level hypotheses without an LM or decoding graph? The output of nnet3-compute is a sequence of probs of pdf-ids, which I assume needs decoding or post-processing to obtain word-level hypotheses.

freewym · 2020-05-13T05:01:44Z

I tried to reduce context by removing layers from the current network config, and the performance slightly degraded

Have you tried to increase the left/right context?

No. I think the current receptive field (80) is already large enough, and we should only reduce it if we can but not further increase it.

framsc · 2020-05-13T06:32:52Z

@freewym Thanks for the reply. It is found that LM does not make significant difference to the FPR and FNR. Does it mean that we don't need the decoder (AM+LM)? The results are still good using only nnet3-compute (AM only)?
Please tell me if there is something wrong. And thanks again for sharing the great work.

How do you get word-level hypotheses without an LM or decoding graph? The output of nnet3-compute is a sequence of probs of pdf-ids, which I assume needs decoding or post-processing to obtain word-level hypotheses.

That is say we have 18-dimension outputs with pdf-ids 0 to 17. The 0-1 are for SIL, 2-9 are for wake word and 10-17 non-wake-word. In wake word case, not ASR, will the sequence of max values of 18d-probs be just sequence of 2-9s or 10-17s? 0-1s are at the beginning and end of the sequence. I have not check that, it's just a guess.

BTW, after decoding. In ali.txt, the alignments of the wake word are long sequence of 3s (11s for non-wake-word). 2 and 4-9 hardly apprear. Is it normal?

freewym · 2020-05-13T06:47:06Z

@freewym Thanks for the reply. It is found that LM does not make significant difference to the FPR and FNR. Does it mean that we don't need the decoder (AM+LM)? The results are still good using only nnet3-compute (AM only)?
Please tell me if there is something wrong. And thanks again for sharing the great work.

How do you get word-level hypotheses without an LM or decoding graph? The output of nnet3-compute is a sequence of probs of pdf-ids, which I assume needs decoding or post-processing to obtain word-level hypotheses.

That is say we have 18-dimension outputs with pdf-ids 0 to 17. The 0-1 are for SIL, 2-9 are for wake word and 10-17 non-wake-word. In wake word case, not ASR, will the sequence of max values of 18d-probs be just sequence of 2-9s or 10-17s? 0-1s are at the beginning and end of the sequence. I have not check that, it's just a guess.

BTW, after decoding. In ali.txt, the alignments of the wake word are long sequence of 3s (11s for non-wake-word). 2 and 4-9 hardly apprear. Is it normal?

In HMMs you need to use Viterbi decoding rather than simply apply argmax on individual frames to obtain the most probable sequence, as there are transition constraints between HMM states, i.e. we are looking for the most probable sequence, rather than the sequence of most probable individual frames.

Yes I also have the same observation: it only happens for E2E LF-MMI system. If you are using GMM alignment for regular LF-MMI, there will be no such thing. The reason maybe E2E LF-MMI is maximizing the sequence loss and it has much more freedom to learn the alignment.

framsc · 2020-05-13T12:41:01Z

It seems that this kind of alignments does not cause any problem in the following script run_tdnn_e2eali.sh. And it's interesting that if I set num-nonsil-states=1, the results are still promising. I don't know whether it is the same for regular LF-MMI yet.

freewym · 2020-05-13T18:11:25Z

It seems that this kind of alignments does not cause any problem in the following script run_tdnn_e2eali.sh. And it's interesting that if I set num-nonsil-states=1, the results are still promising. I don't know whether it is the same for regular LF-MMI yet.

Yes, running run_tdnn_e2eali.sh may even further improve the metrics. I tried num-nonsil-states=1 for regular LF-MMI, and the GMM model seems not good enough to generate alignment for LF-MMI. I had a plan to try it on E2E LF-MMI but haven't done it. Did you find it still achieves comparable performance?

framsc · 2020-05-14T06:17:29Z

@freewym Yes, if I haven't done anything wrong. I change num-nonsil-states to 1 in run_e2e.sh. However, the num-nonsil-states option does not propagate. So I change num-nonsil-states and gen_topo.pl 4 1 too in local/chain/run_e2e_tdnn.sh.

I notice that you have two LMs in local/chain/run_e2e_tdnn.sh -- fst.txt and phone_lm.txt. Are they equivalent? The paths of fst.txt are clear. I am a little confused about phone_lm.txt.

framsc · 2020-05-21T09:28:12Z

@freewym Is phone_lm.txt created so that the denominator graph is acyclic?

freewym · 2020-05-22T19:53:47Z

@freewym Is phone_lm.txt created so that the denominator graph is acyclic?

Yes

framsc · 2020-05-23T11:54:16Z

@freewym Is phone_lm.txt created so that the denominator graph is acyclic?

Yes

Thanks

ybNo1 · 2020-11-29T02:56:55Z

Hi @freewym, thanks for your work!, I'm running this script with a dataset of keyword spotting, and I ran into some problems.
I use my own dataset for train and test.
I found that the accuracy and recall rate of my trained model in testing a single 1s wake word/FREETEXT were similar to that of the original script on the mobvoi HotWords dataset(98%). When I test the audio with wake words in long speech for example 3s or 5s, the recognition result is probably freetext, especially when wake word appears at the middle or end of the speech,no matter how I adjust the weight of LM.

The original data set is a variety of near field speech. I cut out the positive samples and concatenate the negative samples as training data and test data.
There are about 10 wake words, and each training sample has 1000 audio of about 0.8~1.2 sec,and
negative samples about 30 hours speech without wake words.And the dataset is trained with mobvoihotwords/v1/run_e2e.sh with almost no changes.

Is my training data too small? or whether the wake word task is not suitable for keyword spotting in long speech?
After all, I think the gap between pos/neg audio in mobvoi dataset is bigger than within speech. And for pure speech dataset , the data distribution is basically the same.

danpovey · 2020-11-29T03:38:58Z

Perhaps the issue is your training and test data mismatch, e.g. all your examples were isolated?

…

On Sun, Nov 29, 2020 at 10:57 AM ybNo1 ***@***.***> wrote: Hi @freewym <https://github.com/freewym>, thanks for your work!, I'm running this script with a dataset of keyword spotting, and I ran into some problems. I use my own dataset for train and test. I found that the accuracy and recall rate of my trained model in testing a single 1s wake word/FREETEXT were similar to that of the original script on the mobvoi HotWords dataset(98%). When I test the audio with wake-up words in long speech for example 3s or 5s, the recognition result is probably freetext, especially when wake word appears at the middle or back of the speech,no matter how I adjust the weight of LM. The original data set is a variety of near field speech. I cut out the positive samples and concatenate the negative samples as training data and test data. There are about 10 wake words, and each training sample has 1000 audio of about 0.8~1.2 sec,and negative samples about 30 hours speech without wake words.And the dataset is trained with mobvoihotwords/v1/run_e2e.sh with almost no changes. Is my training data too small? or whether the wake word task is not suitable for keyword spotting in long speech? After all, I think the gap between pos/neg audio in mobvoi dataset is bigger than within speech. And for pure speech dataset , the data distribution is basically the same. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#3467 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO6SFIXLK4PFBYURIBTSSG2AFANCNFSM4ID3AY2A> .

freewym · 2020-11-29T03:39:04Z

When the wake word appears in the middle or at the end in your test examples, is there silence surrounding the wake word, or the wake word immediately follows the other words?

Does your training data also contain examples with such condition? If no, I think it's probably because during training the model only sees the positive examples with wake word present solely, i.e. the wake word is surrounded by silence only, and the model is not trained to differentiate the case where the wake word is within the sentence.

ybNo1 · 2020-11-29T04:34:48Z

Perhaps the issue is your training and test data mismatch, e.g. all your examples were isolated?
…
On Sun, Nov 29, 2020 at 10:57 AM ybNo1 @.***> wrote: Hi @freewym https://github.com/freewym, thanks for your work!, I'm running this script with a dataset of keyword spotting, and I ran into some problems. I use my own dataset for train and test. I found that the accuracy and recall rate of my trained model in testing a single 1s wake word/FREETEXT were similar to that of the original script on the mobvoi HotWords dataset(98%). When I test the audio with wake-up words in long speech for example 3s or 5s, the recognition result is probably freetext, especially when wake word appears at the middle or back of the speech,no matter how I adjust the weight of LM. The original data set is a variety of near field speech. I cut out the positive samples and concatenate the negative samples as training data and test data. There are about 10 wake words, and each training sample has 1000 audio of about 0.8~1.2 sec,and negative samples about 30 hours speech without wake words.And the dataset is trained with mobvoihotwords/v1/run_e2e.sh with almost no changes. Is my training data too small? or whether the wake word task is not suitable for keyword spotting in long speech? After all, I think the gap between pos/neg audio in mobvoi dataset is bigger than within speech. And for pure speech dataset , the data distribution is basically the same. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#3467 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6SFIXLK4PFBYURIBTSSG2AFANCNFSM4ID3AY2A .

Thanks for your reply! sorry I don't understand "isolated".
My training and 1sec test dataset is splited from the same dataset, and my 3s or 5s dataset is just concatenate my neg and pos data from train set.
And what puzzles me is that the model can recognize short duration of wake word audio which is 1sec. And when I concatenate the exact pos audio with some FREETEXT from train set, or just silence , it recognise it as FREETEXT.
Below is an example:

My test script is like:

#cat online_decoding.sh

wake_word_id=$1
audio=$2

~/software/kaldi/src/online2bin/online2-wav-nnet3-wake-word-decoder-faster \
  --frames-per-chunk=150 --extra-left-context-initial=0 \
  --frame-subsampling-factor=3 \
  --min-active=200 --max-active=7000 \
  --beam=10 \
  --acoustic-scale=1.0 \
  --config=model/conf/online.conf \
  --wake-word-id=$wake_word_id \
  model/final.mdl \
  model/HCLG.fst \
  "ark,t:echo utt1 utt1|" \
  "scp:echo utt1 $audio |" \
  model/words.txt \
  "ark,t:| >/dev/null" \
  "ark,t:-" #2>/dev/null

I construct 3 test audios, I use 1s wake_word.wav and 1s sil.wav which stands for only silence.:

sox wake_word.wav test1.wav
sox wake_word.wav sil.wav test2.wav
sox sil.wav wake_word.wav sil.wav test3.wav

The result and alignment is :

./online_decoding.sh 4 test1.wav
"""
<sil> WAKE_WORD5 <sil> 

utt1 2 44 43 43 43 43 43 43 43 43 43 43 43 46 45 45 45 48 50 49 49 49 2
"""
./online_decoding.sh 4 test2.wav
"""
<sil> WAKE_WORD5 <sil> <sil> ELSE <sil> 

utt1 2 44 43 43 43 43 43 43 46 48 50 2 2 12 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 14 16 18 2
"""
./online_decoding.sh 4 test3.wav
"""
<sil> FREETEXT <sil> 

utt1 2 12 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 14 16 18 2
"""

And the trans_id :

#show-transitions data/lang/phones.txt exp/chain/ete_tdnn_1a/final.mdl

Transition-state 1: phone = SIL hmm-state = 0 forward-pdf = 0 self-loop-pdf = 1
 Transition-id = 1 p = 0.5 [self-loop]
 Transition-id = 2 p = 0.5 [0 -> 1]
Transition-state 2: phone = WAKE_WORD1 hmm-state = 0 forward-pdf = 2 self-loop-pdf = 3
 Transition-id = 3 p = 0.5 [self-loop]
 Transition-id = 4 p = 0.5 [0 -> 1]
Transition-state 3: phone = WAKE_WORD1 hmm-state = 1 forward-pdf = 4 self-loop-pdf = 5
 Transition-id = 5 p = 0.5 [self-loop]
 Transition-id = 6 p = 0.5 [1 -> 2]
Transition-state 4: phone = WAKE_WORD1 hmm-state = 2 forward-pdf = 6 self-loop-pdf = 7
 Transition-id = 7 p = 0.5 [self-loop]
 Transition-id = 8 p = 0.5 [2 -> 3]
Transition-state 5: phone = WAKE_WORD1 hmm-state = 3 forward-pdf = 8 self-loop-pdf = 9
 Transition-id = 9 p = 0.5 [self-loop]
 Transition-id = 10 p = 0.5 [3 -> 4]
Transition-state 6: phone = FREETEXT hmm-state = 0 forward-pdf = 10 self-loop-pdf = 11
 Transition-id = 11 p = 0.5 [self-loop]
 Transition-id = 12 p = 0.5 [0 -> 1]
Transition-state 7: phone = FREETEXT hmm-state = 1 forward-pdf = 12 self-loop-pdf = 13
 Transition-id = 13 p = 0.5 [self-loop]
 Transition-id = 14 p = 0.5 [1 -> 2]
Transition-state 8: phone = FREETEXT hmm-state = 2 forward-pdf = 14 self-loop-pdf = 15
 Transition-id = 15 p = 0.5 [self-loop]
 Transition-id = 16 p = 0.5 [2 -> 3]
Transition-state 9: phone = FREETEXT hmm-state = 3 forward-pdf = 16 self-loop-pdf = 17
 Transition-id = 17 p = 0.5 [self-loop]
 Transition-id = 18 p = 0.5 [3 -> 4]
Transition-state 10: phone = WAKE_WORD2 hmm-state = 0 forward-pdf = 18 self-loop-pdf = 19
 Transition-id = 19 p = 0.5 [self-loop]
 Transition-id = 20 p = 0.5 [0 -> 1]
Transition-state 11: phone = WAKE_WORD2 hmm-state = 1 forward-pdf = 20 self-loop-pdf = 21
 Transition-id = 21 p = 0.5 [self-loop]
 Transition-id = 22 p = 0.5 [1 -> 2]
Transition-state 12: phone = WAKE_WORD2 hmm-state = 2 forward-pdf = 22 self-loop-pdf = 23
 Transition-id = 23 p = 0.5 [self-loop]
 Transition-id = 24 p = 0.5 [2 -> 3]
Transition-state 13: phone = WAKE_WORD2 hmm-state = 3 forward-pdf = 24 self-loop-pdf = 25
Transition-id = 25 p = 0.5 [self-loop]
 Transition-id = 26 p = 0.5 [3 -> 4]
Transition-state 14: phone = WAKE_WORD3 hmm-state = 0 forward-pdf = 26 self-loop-pdf = 27
 Transition-id = 27 p = 0.5 [self-loop]
 Transition-id = 28 p = 0.5 [0 -> 1]
Transition-state 15: phone = WAKE_WORD3 hmm-state = 1 forward-pdf = 28 self-loop-pdf = 29
 Transition-id = 29 p = 0.5 [self-loop]
 Transition-id = 30 p = 0.5 [1 -> 2]
Transition-state 16: phone = WAKE_WORD3 hmm-state = 2 forward-pdf = 30 self-loop-pdf = 31
 Transition-id = 31 p = 0.5 [self-loop]
 Transition-id = 32 p = 0.5 [2 -> 3]
Transition-state 17: phone = WAKE_WORD3 hmm-state = 3 forward-pdf = 32 self-loop-pdf = 33
 Transition-id = 33 p = 0.5 [self-loop]
 Transition-id = 34 p = 0.5 [3 -> 4]
Transition-state 18: phone = WAKE_WORD4 hmm-state = 0 forward-pdf = 34 self-loop-pdf = 35
 Transition-id = 35 p = 0.5 [self-loop]
 Transition-id = 36 p = 0.5 [0 -> 1]
Transition-state 19: phone = WAKE_WORD4 hmm-state = 1 forward-pdf = 36 self-loop-pdf = 37
 Transition-id = 37 p = 0.5 [self-loop]
 Transition-id = 38 p = 0.5 [1 -> 2]
Transition-state 20: phone = WAKE_WORD4 hmm-state = 2 forward-pdf = 38 self-loop-pdf = 39
 Transition-id = 39 p = 0.5 [self-loop]
 Transition-id = 40 p = 0.5 [2 -> 3]
Transition-state 21: phone = WAKE_WORD4 hmm-state = 3 forward-pdf = 40 self-loop-pdf = 41
 Transition-id = 41 p = 0.5 [self-loop]
 Transition-id = 42 p = 0.5 [3 -> 4]
Transition-state 22: phone = WAKE_WORD5 hmm-state = 0 forward-pdf = 42 self-loop-pdf = 43
 Transition-id = 43 p = 0.5 [self-loop]
 Transition-id = 44 p = 0.5 [0 -> 1]
Transition-state 23: phone = WAKE_WORD5 hmm-state = 1 forward-pdf = 44 self-loop-pdf = 45
 Transition-id = 45 p = 0.5 [self-loop]
 Transition-id = 46 p = 0.5 [1 -> 2]
Transition-state 24: phone = WAKE_WORD5 hmm-state = 2 forward-pdf = 46 self-loop-pdf = 47
 Transition-id = 47 p = 0.5 [self-loop]
 Transition-id = 48 p = 0.5 [2 -> 3]
Transition-state 25: phone = WAKE_WORD5 hmm-state = 3 forward-pdf = 48 self-loop-pdf = 49
 Transition-id = 49 p = 0.5 [self-loop]
 Transition-id = 50 p = 0.5 [3 -> 4]

freewym · 2020-11-29T05:06:26Z

I wonder if the results would be better if you do the same thing to the training data, i.e. add silence before or after the wake word randomly.

ybNo1 · 2020-11-29T05:52:34Z

I wonder if the results would be better if you do the same thing to the training data, i.e. add silence before or after the wake word randomly.

Thanks!!As you said the pos audio of train set is isolated.I'll try add random speech/noise to the begin/end of pos audio, and I'll later post the result of the experiment. thanks again for replying.

freewym · 2020-11-29T05:55:30Z

I suggest only add random silence for now, as the current recipe assumes the positive training examples only contain wake word and silence.

ybNo1 · 2020-11-29T06:07:34Z

I suggest only add random silence for now, as the current recipe assumes the positive training examples only contain wake word and silence.

Actually I just take silence audios as a test case, other test results using none wake word speech is similar.

freewym · 2020-11-29T06:10:49Z

You can first look at if adding silence to training helps on test example added with silence. Then we can at least confirm our conjecture.

ybNo1 · 2020-11-29T06:12:19Z

You can first look at if adding silence to training helps on test example added with silence. Then we can at least confirm our conjecture.

OK, I'll try only adding silence.

ybNo1 · 2020-11-30T14:04:44Z

You can first look at if adding silence to training helps on test example added with silence. Then we can at least confirm our conjecture.

Adding silence to training data works for me, now the trained model can recognize correctly with silence appended before/after wake word.
Besides, I then add random freetext to pos data, and the model can achieve 50% at wake word EER, better than using orgin data which is only about 10% recall. I guess I should try different ways of adding random freetext.
Thanks for your advice! And sorry about asking issues in PR, I didn't mean to. And thanks again!

freewym · 2020-11-30T16:14:10Z

If you were to add freetext to positive examples, you need also need to modify their corresponding text file to reflect the change in the audio. I may also need to modify the phone LM topo to include paths like freetext->wake-word->freetext

ybNo1 · 2020-12-01T02:24:23Z

If you were to add freetext to positive examples, you need also need to modify their corresponding text file to reflect the change in the audio. I may also need to modify the phone LM topo to include paths like freetext->wake-word->freetext

If I add freetext to pos data, the text format should be like "utt_id FREETEXT WORD FREETEXT", is it right?
So when adding 12% sil to data, why don't change text to 'utt_id <sil> WORD <sil>'?

freewym · 2020-12-01T03:19:41Z

if FREETEXT is added only to the beginning for example, better to change the text like "FREETEXT WORD".

RE adding silence, it is not necessary to change the text as silence has already been considered as "optional silence" before and after each word when specifying the lexicon.

amurawite · 2020-12-11T09:33:10Z

Hi, @freewym thanks for your work. I ran in to the same problem as ybNo1 ,and I try to add random 0-1.5s sil in keyword audio start and end. Then I use this data to train a modelmm. But I ran in to a other problem in this time. I create a eval set which add 1s silence in all audio start and end. The model is 100% recall and 622 times per hour false alarms in the add silence eval set .

freewym · 2020-12-11T19:55:16Z

I am not sure if the added silence has some artifacts that made the network learn wrong signals. Maybe you can try also add silence to the negative training examples

amurawite · 2020-12-14T02:48:07Z

I am not sure if the added silence has some artifacts that made the network learn wrong signals. Maybe you can try also add silence to the negative training examples

Thanks for your reply. I will try the methed you gave.

freewym changed the title ~~Wake-word detection~~ WIP: Wake-word detection Jul 15, 2019

freewym changed the title ~~WIP: Wake-word detection~~ [WIP] Wake-word detection Jul 15, 2019

naxingyu reviewed Jul 16, 2019

View reviewed changes

egs/mobvoi/v1/local/chain/run_e2e_tdnn.sh Outdated Show resolved Hide resolved

freewym force-pushed the wake-word branch 2 times, most recently from 53d3ee6 to 92e774b Compare July 18, 2019 21:33

freewym force-pushed the wake-word branch 2 times, most recently from fc65580 to f12cf50 Compare August 6, 2019 06:28

freewym changed the title ~~[WIP] Wake-word detection~~ Wake-word detection Aug 7, 2019