Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wake-word detection #3467

Merged
merged 20 commits into from
Apr 19, 2020
Merged

Wake-word detection #3467

merged 20 commits into from
Apr 19, 2020

Conversation

freewym
Copy link
Contributor

@freewym freewym commented Jul 15, 2019

Results of the regular LF-MMI based recipes:

Mobvoi: EER=~0.2%, FRR=1.02% at FAH=1.5 vs. FRR=3.8% at FAH=1.5 (Mobvoi paper)

SNIPS: EER=~0.1%, FRR=0.08% at FAH=0.5 vs. FRR=0.12% at FAH=0.5 (SNIPS paper)

E2E LF-MMI recipes are still being run to confirm the reproducibility of the previous results.

@freewym freewym changed the title Wake-word detection WIP: Wake-word detection Jul 15, 2019
@freewym freewym changed the title WIP: Wake-word detection [WIP] Wake-word detection Jul 15, 2019
@jtrmal
Copy link
Contributor

jtrmal commented Jul 15, 2019 via email

@danpovey
Copy link
Contributor

danpovey commented Jul 16, 2019 via email

@freewym freewym force-pushed the wake-word branch 2 times, most recently from 53d3ee6 to 92e774b Compare July 18, 2019 21:33
@freewym freewym force-pushed the wake-word branch 2 times, most recently from fc65580 to f12cf50 Compare August 6, 2019 06:28
@freewym freewym changed the title [WIP] Wake-word detection Wake-word detection Aug 7, 2019
@freewym
Copy link
Contributor Author

freewym commented Aug 15, 2019

@danpovey the current recipes in this PR are ready to review

egs/mobvoi/v1/run.sh Outdated Show resolved Hide resolved
@csukuangfj
Copy link
Contributor

@danpovey
should the pullrequest be squashed into a single commit?

@freewym
Copy link
Contributor Author

freewym commented Aug 19, 2019 via email

Copy link
Contributor

@danpovey danpovey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments

egs/mobvoi/v1/local/process_lattice.sh Outdated Show resolved Hide resolved
egs/mobvoi/v1/local/score.sh Outdated Show resolved Hide resolved
egs/mobvoi/v1/run_e2e.sh Show resolved Hide resolved
egs/snips/v1/local/gen_topo.pl Outdated Show resolved Hide resolved
@danpovey
Copy link
Contributor

danpovey commented Aug 24, 2019 via email

@freewym freewym force-pushed the wake-word branch 3 times, most recently from 138d5c7 to 27f32e5 Compare August 27, 2019 18:05
@freewym freewym force-pushed the wake-word branch 2 times, most recently from 868fc0e to 9e2471d Compare September 13, 2019 20:18
@framsc
Copy link

framsc commented May 13, 2020

@freewym Thanks for the reply. It is found that LM does not make significant difference to the FPR and FNR. Does it mean that we don't need the decoder (AM+LM)? The results are still good using only nnet3-compute (AM only)?
Please tell me if there is something wrong. And thanks again for sharing the great work.

@csukuangfj
Copy link
Contributor

I tried to reduce context by removing layers from the current network config, and the performance slightly degraded

Have you tried to increase the left/right context?

@freewym
Copy link
Contributor Author

freewym commented May 13, 2020

@freewym Thanks for the reply. It is found that LM does not make significant difference to the FPR and FNR. Does it mean that we don't need the decoder (AM+LM)? The results are still good using only nnet3-compute (AM only)?
Please tell me if there is something wrong. And thanks again for sharing the great work.

How do you get word-level hypotheses without an LM or decoding graph? The output of nnet3-compute is a sequence of probs of pdf-ids, which I assume needs decoding or post-processing to obtain word-level hypotheses.

@freewym
Copy link
Contributor Author

freewym commented May 13, 2020

I tried to reduce context by removing layers from the current network config, and the performance slightly degraded

Have you tried to increase the left/right context?

No. I think the current receptive field (80) is already large enough, and we should only reduce it if we can but not further increase it.

@framsc
Copy link

framsc commented May 13, 2020

@freewym Thanks for the reply. It is found that LM does not make significant difference to the FPR and FNR. Does it mean that we don't need the decoder (AM+LM)? The results are still good using only nnet3-compute (AM only)?
Please tell me if there is something wrong. And thanks again for sharing the great work.

How do you get word-level hypotheses without an LM or decoding graph? The output of nnet3-compute is a sequence of probs of pdf-ids, which I assume needs decoding or post-processing to obtain word-level hypotheses.

That is say we have 18-dimension outputs with pdf-ids 0 to 17. The 0-1 are for SIL, 2-9 are for wake word and 10-17 non-wake-word. In wake word case, not ASR, will the sequence of max values of 18d-probs be just sequence of 2-9s or 10-17s? 0-1s are at the beginning and end of the sequence. I have not check that, it's just a guess.

BTW, after decoding. In ali.txt, the alignments of the wake word are long sequence of 3s (11s for non-wake-word). 2 and 4-9 hardly apprear. Is it normal?

@freewym
Copy link
Contributor Author

freewym commented May 13, 2020

@freewym Thanks for the reply. It is found that LM does not make significant difference to the FPR and FNR. Does it mean that we don't need the decoder (AM+LM)? The results are still good using only nnet3-compute (AM only)?
Please tell me if there is something wrong. And thanks again for sharing the great work.

How do you get word-level hypotheses without an LM or decoding graph? The output of nnet3-compute is a sequence of probs of pdf-ids, which I assume needs decoding or post-processing to obtain word-level hypotheses.

That is say we have 18-dimension outputs with pdf-ids 0 to 17. The 0-1 are for SIL, 2-9 are for wake word and 10-17 non-wake-word. In wake word case, not ASR, will the sequence of max values of 18d-probs be just sequence of 2-9s or 10-17s? 0-1s are at the beginning and end of the sequence. I have not check that, it's just a guess.

BTW, after decoding. In ali.txt, the alignments of the wake word are long sequence of 3s (11s for non-wake-word). 2 and 4-9 hardly apprear. Is it normal?

In HMMs you need to use Viterbi decoding rather than simply apply argmax on individual frames to obtain the most probable sequence, as there are transition constraints between HMM states, i.e. we are looking for the most probable sequence, rather than the sequence of most probable individual frames.

Yes I also have the same observation: it only happens for E2E LF-MMI system. If you are using GMM alignment for regular LF-MMI, there will be no such thing. The reason maybe E2E LF-MMI is maximizing the sequence loss and it has much more freedom to learn the alignment.

@framsc
Copy link

framsc commented May 13, 2020

It seems that this kind of alignments does not cause any problem in the following script run_tdnn_e2eali.sh. And it's interesting that if I set num-nonsil-states=1, the results are still promising. I don't know whether it is the same for regular LF-MMI yet.

@freewym
Copy link
Contributor Author

freewym commented May 13, 2020

It seems that this kind of alignments does not cause any problem in the following script run_tdnn_e2eali.sh. And it's interesting that if I set num-nonsil-states=1, the results are still promising. I don't know whether it is the same for regular LF-MMI yet.

Yes, running run_tdnn_e2eali.sh may even further improve the metrics. I tried num-nonsil-states=1 for regular LF-MMI, and the GMM model seems not good enough to generate alignment for LF-MMI. I had a plan to try it on E2E LF-MMI but haven't done it. Did you find it still achieves comparable performance?

@framsc
Copy link

framsc commented May 14, 2020

@freewym Yes, if I haven't done anything wrong. I change num-nonsil-states to 1 in run_e2e.sh. However, the num-nonsil-states option does not propagate. So I change num-nonsil-states and gen_topo.pl 4 1 too in local/chain/run_e2e_tdnn.sh.

I notice that you have two LMs in local/chain/run_e2e_tdnn.sh -- fst.txt and phone_lm.txt. Are they equivalent? The paths of fst.txt are clear. I am a little confused about phone_lm.txt.

@framsc
Copy link

framsc commented May 21, 2020

@freewym Is phone_lm.txt created so that the denominator graph is acyclic?

@freewym
Copy link
Contributor Author

freewym commented May 22, 2020

@freewym Is phone_lm.txt created so that the denominator graph is acyclic?

Yes

@framsc
Copy link

framsc commented May 23, 2020

@freewym Is phone_lm.txt created so that the denominator graph is acyclic?

Yes

Thanks

pc-seawind pushed a commit to pc-seawind/kaldi that referenced this pull request Jun 4, 2020
@ybNo1
Copy link

ybNo1 commented Nov 29, 2020

Hi @freewym, thanks for your work!, I'm running this script with a dataset of keyword spotting, and I ran into some problems.
I use my own dataset for train and test.
I found that the accuracy and recall rate of my trained model in testing a single 1s wake word/FREETEXT were similar to that of the original script on the mobvoi HotWords dataset(98%). When I test the audio with wake words in long speech for example 3s or 5s, the recognition result is probably freetext, especially when wake word appears at the middle or end of the speech,no matter how I adjust the weight of LM.

The original data set is a variety of near field speech. I cut out the positive samples and concatenate the negative samples as training data and test data.
There are about 10 wake words, and each training sample has 1000 audio of about 0.8~1.2 sec,and
negative samples about 30 hours speech without wake words.And the dataset is trained with mobvoihotwords/v1/run_e2e.sh with almost no changes.

Is my training data too small? or whether the wake word task is not suitable for keyword spotting in long speech?
After all, I think the gap between pos/neg audio in mobvoi dataset is bigger than within speech. And for pure speech dataset , the data distribution is basically the same.

@danpovey
Copy link
Contributor

danpovey commented Nov 29, 2020 via email

@freewym
Copy link
Contributor Author

freewym commented Nov 29, 2020

When the wake word appears in the middle or at the end in your test examples, is there silence surrounding the wake word, or the wake word immediately follows the other words?

Does your training data also contain examples with such condition? If no, I think it's probably because during training the model only sees the positive examples with wake word present solely, i.e. the wake word is surrounded by silence only, and the model is not trained to differentiate the case where the wake word is within the sentence.

@ybNo1
Copy link

ybNo1 commented Nov 29, 2020

Perhaps the issue is your training and test data mismatch, e.g. all your examples were isolated?

On Sun, Nov 29, 2020 at 10:57 AM ybNo1 @.***> wrote: Hi @freewym https://github.com/freewym, thanks for your work!, I'm running this script with a dataset of keyword spotting, and I ran into some problems. I use my own dataset for train and test. I found that the accuracy and recall rate of my trained model in testing a single 1s wake word/FREETEXT were similar to that of the original script on the mobvoi HotWords dataset(98%). When I test the audio with wake-up words in long speech for example 3s or 5s, the recognition result is probably freetext, especially when wake word appears at the middle or back of the speech,no matter how I adjust the weight of LM. The original data set is a variety of near field speech. I cut out the positive samples and concatenate the negative samples as training data and test data. There are about 10 wake words, and each training sample has 1000 audio of about 0.8~1.2 sec,and negative samples about 30 hours speech without wake words.And the dataset is trained with mobvoihotwords/v1/run_e2e.sh with almost no changes. Is my training data too small? or whether the wake word task is not suitable for keyword spotting in long speech? After all, I think the gap between pos/neg audio in mobvoi dataset is bigger than within speech. And for pure speech dataset , the data distribution is basically the same. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#3467 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6SFIXLK4PFBYURIBTSSG2AFANCNFSM4ID3AY2A .

Thanks for your reply! sorry I don't understand "isolated".
My training and 1sec test dataset is splited from the same dataset, and my 3s or 5s dataset is just concatenate my neg and pos data from train set.
And what puzzles me is that the model can recognize short duration of wake word audio which is 1sec. And when I concatenate the exact pos audio with some FREETEXT from train set, or just silence , it recognise it as FREETEXT.
Below is an example:

My test script is like:

#cat online_decoding.sh

wake_word_id=$1
audio=$2

~/software/kaldi/src/online2bin/online2-wav-nnet3-wake-word-decoder-faster \
  --frames-per-chunk=150 --extra-left-context-initial=0 \
  --frame-subsampling-factor=3 \
  --min-active=200 --max-active=7000 \
  --beam=10 \
  --acoustic-scale=1.0 \
  --config=model/conf/online.conf \
  --wake-word-id=$wake_word_id \
  model/final.mdl \
  model/HCLG.fst \
  "ark,t:echo utt1 utt1|" \
  "scp:echo utt1 $audio |" \
  model/words.txt \
  "ark,t:| >/dev/null" \
  "ark,t:-" #2>/dev/null

I construct 3 test audios, I use 1s wake_word.wav and 1s sil.wav which stands for only silence.:

sox wake_word.wav test1.wav
sox wake_word.wav sil.wav test2.wav
sox sil.wav wake_word.wav sil.wav test3.wav 

The result and alignment is :

./online_decoding.sh 4 test1.wav
"""
<sil> WAKE_WORD5 <sil> 

utt1 2 44 43 43 43 43 43 43 43 43 43 43 43 46 45 45 45 48 50 49 49 49 2
"""
./online_decoding.sh 4 test2.wav
"""
<sil> WAKE_WORD5 <sil> <sil> ELSE <sil> 

utt1 2 44 43 43 43 43 43 43 46 48 50 2 2 12 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 14 16 18 2
"""
./online_decoding.sh 4 test3.wav
"""
<sil> FREETEXT <sil> 

utt1 2 12 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 14 16 18 2
"""

And the trans_id :

#show-transitions data/lang/phones.txt exp/chain/ete_tdnn_1a/final.mdl

Transition-state 1: phone = SIL hmm-state = 0 forward-pdf = 0 self-loop-pdf = 1
 Transition-id = 1 p = 0.5 [self-loop]
 Transition-id = 2 p = 0.5 [0 -> 1]
Transition-state 2: phone = WAKE_WORD1 hmm-state = 0 forward-pdf = 2 self-loop-pdf = 3
 Transition-id = 3 p = 0.5 [self-loop]
 Transition-id = 4 p = 0.5 [0 -> 1]
Transition-state 3: phone = WAKE_WORD1 hmm-state = 1 forward-pdf = 4 self-loop-pdf = 5
 Transition-id = 5 p = 0.5 [self-loop]
 Transition-id = 6 p = 0.5 [1 -> 2]
Transition-state 4: phone = WAKE_WORD1 hmm-state = 2 forward-pdf = 6 self-loop-pdf = 7
 Transition-id = 7 p = 0.5 [self-loop]
 Transition-id = 8 p = 0.5 [2 -> 3]
Transition-state 5: phone = WAKE_WORD1 hmm-state = 3 forward-pdf = 8 self-loop-pdf = 9
 Transition-id = 9 p = 0.5 [self-loop]
 Transition-id = 10 p = 0.5 [3 -> 4]
Transition-state 6: phone = FREETEXT hmm-state = 0 forward-pdf = 10 self-loop-pdf = 11
 Transition-id = 11 p = 0.5 [self-loop]
 Transition-id = 12 p = 0.5 [0 -> 1]
Transition-state 7: phone = FREETEXT hmm-state = 1 forward-pdf = 12 self-loop-pdf = 13
 Transition-id = 13 p = 0.5 [self-loop]
 Transition-id = 14 p = 0.5 [1 -> 2]
Transition-state 8: phone = FREETEXT hmm-state = 2 forward-pdf = 14 self-loop-pdf = 15
 Transition-id = 15 p = 0.5 [self-loop]
 Transition-id = 16 p = 0.5 [2 -> 3]
Transition-state 9: phone = FREETEXT hmm-state = 3 forward-pdf = 16 self-loop-pdf = 17
 Transition-id = 17 p = 0.5 [self-loop]
 Transition-id = 18 p = 0.5 [3 -> 4]
Transition-state 10: phone = WAKE_WORD2 hmm-state = 0 forward-pdf = 18 self-loop-pdf = 19
 Transition-id = 19 p = 0.5 [self-loop]
 Transition-id = 20 p = 0.5 [0 -> 1]
Transition-state 11: phone = WAKE_WORD2 hmm-state = 1 forward-pdf = 20 self-loop-pdf = 21
 Transition-id = 21 p = 0.5 [self-loop]
 Transition-id = 22 p = 0.5 [1 -> 2]
Transition-state 12: phone = WAKE_WORD2 hmm-state = 2 forward-pdf = 22 self-loop-pdf = 23
 Transition-id = 23 p = 0.5 [self-loop]
 Transition-id = 24 p = 0.5 [2 -> 3]
Transition-state 13: phone = WAKE_WORD2 hmm-state = 3 forward-pdf = 24 self-loop-pdf = 25
Transition-id = 25 p = 0.5 [self-loop]
 Transition-id = 26 p = 0.5 [3 -> 4]
Transition-state 14: phone = WAKE_WORD3 hmm-state = 0 forward-pdf = 26 self-loop-pdf = 27
 Transition-id = 27 p = 0.5 [self-loop]
 Transition-id = 28 p = 0.5 [0 -> 1]
Transition-state 15: phone = WAKE_WORD3 hmm-state = 1 forward-pdf = 28 self-loop-pdf = 29
 Transition-id = 29 p = 0.5 [self-loop]
 Transition-id = 30 p = 0.5 [1 -> 2]
Transition-state 16: phone = WAKE_WORD3 hmm-state = 2 forward-pdf = 30 self-loop-pdf = 31
 Transition-id = 31 p = 0.5 [self-loop]
 Transition-id = 32 p = 0.5 [2 -> 3]
Transition-state 17: phone = WAKE_WORD3 hmm-state = 3 forward-pdf = 32 self-loop-pdf = 33
 Transition-id = 33 p = 0.5 [self-loop]
 Transition-id = 34 p = 0.5 [3 -> 4]
Transition-state 18: phone = WAKE_WORD4 hmm-state = 0 forward-pdf = 34 self-loop-pdf = 35
 Transition-id = 35 p = 0.5 [self-loop]
 Transition-id = 36 p = 0.5 [0 -> 1]
Transition-state 19: phone = WAKE_WORD4 hmm-state = 1 forward-pdf = 36 self-loop-pdf = 37
 Transition-id = 37 p = 0.5 [self-loop]
 Transition-id = 38 p = 0.5 [1 -> 2]
Transition-state 20: phone = WAKE_WORD4 hmm-state = 2 forward-pdf = 38 self-loop-pdf = 39
 Transition-id = 39 p = 0.5 [self-loop]
 Transition-id = 40 p = 0.5 [2 -> 3]
Transition-state 21: phone = WAKE_WORD4 hmm-state = 3 forward-pdf = 40 self-loop-pdf = 41
 Transition-id = 41 p = 0.5 [self-loop]
 Transition-id = 42 p = 0.5 [3 -> 4]
Transition-state 22: phone = WAKE_WORD5 hmm-state = 0 forward-pdf = 42 self-loop-pdf = 43
 Transition-id = 43 p = 0.5 [self-loop]
 Transition-id = 44 p = 0.5 [0 -> 1]
Transition-state 23: phone = WAKE_WORD5 hmm-state = 1 forward-pdf = 44 self-loop-pdf = 45
 Transition-id = 45 p = 0.5 [self-loop]
 Transition-id = 46 p = 0.5 [1 -> 2]
Transition-state 24: phone = WAKE_WORD5 hmm-state = 2 forward-pdf = 46 self-loop-pdf = 47
 Transition-id = 47 p = 0.5 [self-loop]
 Transition-id = 48 p = 0.5 [2 -> 3]
Transition-state 25: phone = WAKE_WORD5 hmm-state = 3 forward-pdf = 48 self-loop-pdf = 49
 Transition-id = 49 p = 0.5 [self-loop]
 Transition-id = 50 p = 0.5 [3 -> 4]

@freewym
Copy link
Contributor Author

freewym commented Nov 29, 2020

I wonder if the results would be better if you do the same thing to the training data, i.e. add silence before or after the wake word randomly.

@ybNo1
Copy link

ybNo1 commented Nov 29, 2020

I wonder if the results would be better if you do the same thing to the training data, i.e. add silence before or after the wake word randomly.

Thanks!!As you said the pos audio of train set is isolated.I'll try add random speech/noise to the begin/end of pos audio, and I'll later post the result of the experiment. thanks again for replying.

@freewym
Copy link
Contributor Author

freewym commented Nov 29, 2020

I suggest only add random silence for now, as the current recipe assumes the positive training examples only contain wake word and silence.

@ybNo1
Copy link

ybNo1 commented Nov 29, 2020

I suggest only add random silence for now, as the current recipe assumes the positive training examples only contain wake word and silence.

Actually I just take silence audios as a test case, other test results using none wake word speech is similar.

@freewym
Copy link
Contributor Author

freewym commented Nov 29, 2020

You can first look at if adding silence to training helps on test example added with silence. Then we can at least confirm our conjecture.

@ybNo1
Copy link

ybNo1 commented Nov 29, 2020

You can first look at if adding silence to training helps on test example added with silence. Then we can at least confirm our conjecture.

OK, I'll try only adding silence.

@ybNo1
Copy link

ybNo1 commented Nov 30, 2020

You can first look at if adding silence to training helps on test example added with silence. Then we can at least confirm our conjecture.

Adding silence to training data works for me, now the trained model can recognize correctly with silence appended before/after wake word.
Besides, I then add random freetext to pos data, and the model can achieve 50% at wake word EER, better than using orgin data which is only about 10% recall. I guess I should try different ways of adding random freetext.
Thanks for your advice! And sorry about asking issues in PR, I didn't mean to. And thanks again!

@freewym
Copy link
Contributor Author

freewym commented Nov 30, 2020

If you were to add freetext to positive examples, you need also need to modify their corresponding text file to reflect the change in the audio. I may also need to modify the phone LM topo to include paths like freetext->wake-word->freetext

@ybNo1
Copy link

ybNo1 commented Dec 1, 2020

If you were to add freetext to positive examples, you need also need to modify their corresponding text file to reflect the change in the audio. I may also need to modify the phone LM topo to include paths like freetext->wake-word->freetext

If I add freetext to pos data, the text format should be like "utt_id FREETEXT WORD FREETEXT", is it right?
So when adding 12% sil to data, why don't change text to 'utt_id <sil> WORD <sil>'?

@freewym
Copy link
Contributor Author

freewym commented Dec 1, 2020

if FREETEXT is added only to the beginning for example, better to change the text like "FREETEXT WORD".

RE adding silence, it is not necessary to change the text as silence has already been considered as "optional silence" before and after each word when specifying the lexicon.

@amurawite
Copy link

Hi, @freewym thanks for your work. I ran in to the same problem as ybNo1 ,and I try to add random 0-1.5s sil in keyword audio start and end. Then I use this data to train a modelmm. But I ran in to a other problem in this time. I create a eval set which add 1s silence in all audio start and end. The model is 100% recall and 622 times per hour false alarms in the add silence eval set .

@freewym
Copy link
Contributor Author

freewym commented Dec 11, 2020

I am not sure if the added silence has some artifacts that made the network learn wrong signals. Maybe you can try also add silence to the negative training examples

@amurawite
Copy link

I am not sure if the added silence has some artifacts that made the network learn wrong signals. Maybe you can try also add silence to the negative training examples

Thanks for your reply. I will try the methed you gave.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants