Accuracy difference #3

kbramhendra · 2022-10-11T06:38:16Z

Hey Hi,
I am trying to use Riva decoder as a substitution for k2 decoder. With riva decoder I am not getting the same accuracy as that of k2 decoder. There are lot more deletions, at least 12%. Experimented with different hypereparamters like acoustic_scale, max_active_states, but result seems to be not changing much. I have tried with the different topologies as well(eesen, compact), its the same case with all of them. Can you please help in this regard.

messiaen · 2022-10-14T17:12:28Z

@kbramhendra Thanks for the report. I've passed this on to our team internally.

galv · 2022-10-19T16:09:48Z

@kbramhendra what value have you set for "max_expand"?

We have it set to 10 here:

riva-asrlib-decoder/src/riva/asrlib/decoder/test_graph_construction.py

Line 170 in 77f0876

config.online_opts.lattice_postprocessor_opts.max_expand = 10

I have noticed that disabling this (setting it to 0) does improve WER. However, I am not certain where it is your issue based on what you have told me.

This is to account for an explosion in the state space of a depth first search in "word alignment" algorithm that can happen in rare circumstances.

Now, word alignment isn't necessary, strictly speaking, so I could consider disabling it, but I am still looking into what the "right" solution is here.

For CTC models using word pieces or graphemes, there is not enough positional information to use the word alignment. I tried marking every unit as "singleton" word_boundary.txt, but this explodes the state space very, very often. See: nvidia-riva/riva-asrlib-decoder#3 With the "_" character in CTC models predicting word pieces, we at the very least know which word pieces begin a word and which ones are either in the middle of the word or the end of a word, but the algorithm would still need to be rewritten, especially since "blank" is not a silence phoneme (it can appear between). I did look into using the lexicon-based word alignment. I don't have a specific complaint about it, but I did get a weird error where it couldn't create a final state at all in the output lattice, which caused Connect() to output an empty lattice. This may be because I wasn't quite sure how to handle the blank token. I treat it as its own phoneme, bcause of limitations in TransitionInformation, but this doesn't really make any sense. Needless to say, while the CTM outputs of the cuda decoder will be correct from a WER point of view, their time stamps won't be correct, but they probably never were in the first place, for CTC models.

See kaldi-asr/kaldi#4802

galv · 2022-10-19T23:20:38Z

@kbramhendra can you check if the branch in #4 fixes your issue? I believe it should based on my own internal testing.

Make sure not to set config.online_opts.lattice_postprocessor_opts.word_boundary_rxfilename to anything other than empty string to disable worda lignment. See here:

3aaa0e1#diff-c80f4904c78bc561ce1235944f91d4847817445e810c4d7a0064453503e0c7f3L160

Basically, word alignment would fail to complete sometimes when the max_expand option was set. The returned lattice would then be missing paths that were in the input path. Sometimes, these missing paths would be the best cost paths, or sometimes not even a single path would be complete by the time the max_expand option had taken effect. This explains why the errors (at least for me) were only deletions and substitutions, not insertions.

For CTC models using word pieces or graphemes, there is not enough positional information to use the word alignment. I tried marking every unit as "singleton" word_boundary.txt, but this explodes the state space very, very often. See: nvidia-riva/riva-asrlib-decoder#3 With the "_" character in CTC models predicting word pieces, we at the very least know which word pieces begin a word and which ones are either in the middle of the word or the end of a word, but the algorithm would still need to be rewritten, especially since "blank" is not a silence phoneme (it can appear between). I did look into using the lexicon-based word alignment. I don't have a specific complaint about it, but I did get a weird error where it couldn't create a final state at all in the output lattice, which caused Connect() to output an empty lattice. This may be because I wasn't quite sure how to handle the blank token. I treat it as its own phoneme, bcause of limitations in TransitionInformation, but this doesn't really make any sense. Needless to say, while the CTM outputs of the cuda decoder will be correct from a WER point of view, their time stamps won't be correct, but they probably never were in the first place, for CTC models.

See kaldi-asr/kaldi#4802

kbramhendra · 2022-10-20T05:12:54Z

@galv Thank you very much for taking your time and helping in this. yes it did help, and the deletion are reduced very much now. only ~1.5% difference is there between k2 and riva(in deletions only). I will see if anything is missed from my side. I highly appreciate your help, i was struck at this for some time. Its a great help.

* Remove unused variable. * cudadecoder: Make word alignment optional. For CTC models using word pieces or graphemes, there is not enough positional information to use the word alignment. I tried marking every unit as "singleton" word_boundary.txt, but this explodes the state space very, very often. See: nvidia-riva/riva-asrlib-decoder#3 With the "_" character in CTC models predicting word pieces, we at the very least know which word pieces begin a word and which ones are either in the middle of the word or the end of a word, but the algorithm would still need to be rewritten, especially since "blank" is not a silence phoneme (it can appear between). I did look into using the lexicon-based word alignment. I don't have a specific complaint about it, but I did get a weird error where it couldn't create a final state at all in the output lattice, which caused Connect() to output an empty lattice. This may be because I wasn't quite sure how to handle the blank token. I treat it as its own phoneme, bcause of limitations in TransitionInformation, but this doesn't really make any sense. Needless to say, while the CTM outputs of the cuda decoder will be correct from a WER point of view, their time stamps won't be correct, but they probably never were in the first place, for CTC models.

For CTC models using word pieces or graphemes, there is not enough positional information to use the word alignment. I tried marking every unit as "singleton" word_boundary.txt, but this explodes the state space very, very often. See: nvidia-riva/riva-asrlib-decoder#3 With the "_" character in CTC models predicting word pieces, we at the very least know which word pieces begin a word and which ones are either in the middle of the word or the end of a word, but the algorithm would still need to be rewritten, especially since "blank" is not a silence phoneme (it can appear between). I did look into using the lexicon-based word alignment. I don't have a specific complaint about it, but I did get a weird error where it couldn't create a final state at all in the output lattice, which caused Connect() to output an empty lattice. This may be because I wasn't quite sure how to handle the blank token. I treat it as its own phoneme, bcause of limitations in TransitionInformation, but this doesn't really make any sense. Needless to say, while the CTM outputs of the cuda decoder will be correct from a WER point of view, their time stamps won't be correct, but they probably never were in the first place, for CTC models.

jinggaizi · 2022-12-26T09:17:31Z

@kbramhendra hi, i also try to use this decoder tools, how can use it for ctc+tlg

galv mentioned this issue Oct 19, 2022

Make word alignment optional kaldi-asr/kaldi#4802

Merged

galv added a commit that referenced this issue Oct 19, 2022

Fix #3

3aaa0e1

See kaldi-asr/kaldi#4802

galv added a commit that referenced this issue Oct 20, 2022

Fix #3

304b54a

See kaldi-asr/kaldi#4802

galv closed this as completed in 06dec3f Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accuracy difference #3

Accuracy difference #3

kbramhendra commented Oct 11, 2022

messiaen commented Oct 14, 2022

galv commented Oct 19, 2022

galv commented Oct 19, 2022 •

edited

Loading

kbramhendra commented Oct 20, 2022 •

edited

Loading

jinggaizi commented Dec 26, 2022

Accuracy difference #3

Accuracy difference #3

Comments

kbramhendra commented Oct 11, 2022

messiaen commented Oct 14, 2022

galv commented Oct 19, 2022

galv commented Oct 19, 2022 • edited Loading

kbramhendra commented Oct 20, 2022 • edited Loading

jinggaizi commented Dec 26, 2022

galv commented Oct 19, 2022 •

edited

Loading

kbramhendra commented Oct 20, 2022 •

edited

Loading