-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accuracy difference #3
Comments
@kbramhendra Thanks for the report. I've passed this on to our team internally. |
@kbramhendra what value have you set for "max_expand"? We have it set to 10 here:
I have noticed that disabling this (setting it to 0) does improve WER. However, I am not certain where it is your issue based on what you have told me. This is to account for an explosion in the state space of a depth first search in "word alignment" algorithm that can happen in rare circumstances. Now, word alignment isn't necessary, strictly speaking, so I could consider disabling it, but I am still looking into what the "right" solution is here. |
For CTC models using word pieces or graphemes, there is not enough positional information to use the word alignment. I tried marking every unit as "singleton" word_boundary.txt, but this explodes the state space very, very often. See: nvidia-riva/riva-asrlib-decoder#3 With the "_" character in CTC models predicting word pieces, we at the very least know which word pieces begin a word and which ones are either in the middle of the word or the end of a word, but the algorithm would still need to be rewritten, especially since "blank" is not a silence phoneme (it can appear between). I did look into using the lexicon-based word alignment. I don't have a specific complaint about it, but I did get a weird error where it couldn't create a final state at all in the output lattice, which caused Connect() to output an empty lattice. This may be because I wasn't quite sure how to handle the blank token. I treat it as its own phoneme, bcause of limitations in TransitionInformation, but this doesn't really make any sense. Needless to say, while the CTM outputs of the cuda decoder will be correct from a WER point of view, their time stamps won't be correct, but they probably never were in the first place, for CTC models.
@kbramhendra can you check if the branch in #4 fixes your issue? I believe it should based on my own internal testing. Make sure not to set 3aaa0e1#diff-c80f4904c78bc561ce1235944f91d4847817445e810c4d7a0064453503e0c7f3L160 Basically, word alignment would fail to complete sometimes when the max_expand option was set. The returned lattice would then be missing paths that were in the input path. Sometimes, these missing paths would be the best cost paths, or sometimes not even a single path would be complete by the time the max_expand option had taken effect. This explains why the errors (at least for me) were only deletions and substitutions, not insertions. |
For CTC models using word pieces or graphemes, there is not enough positional information to use the word alignment. I tried marking every unit as "singleton" word_boundary.txt, but this explodes the state space very, very often. See: nvidia-riva/riva-asrlib-decoder#3 With the "_" character in CTC models predicting word pieces, we at the very least know which word pieces begin a word and which ones are either in the middle of the word or the end of a word, but the algorithm would still need to be rewritten, especially since "blank" is not a silence phoneme (it can appear between). I did look into using the lexicon-based word alignment. I don't have a specific complaint about it, but I did get a weird error where it couldn't create a final state at all in the output lattice, which caused Connect() to output an empty lattice. This may be because I wasn't quite sure how to handle the blank token. I treat it as its own phoneme, bcause of limitations in TransitionInformation, but this doesn't really make any sense. Needless to say, while the CTM outputs of the cuda decoder will be correct from a WER point of view, their time stamps won't be correct, but they probably never were in the first place, for CTC models.
@galv Thank you very much for taking your time and helping in this. yes it did help, and the deletion are reduced very much now. only ~1.5% difference is there between k2 and riva(in deletions only). I will see if anything is missed from my side. I highly appreciate your help, i was struck at this for some time. Its a great help. |
* Remove unused variable. * cudadecoder: Make word alignment optional. For CTC models using word pieces or graphemes, there is not enough positional information to use the word alignment. I tried marking every unit as "singleton" word_boundary.txt, but this explodes the state space very, very often. See: nvidia-riva/riva-asrlib-decoder#3 With the "_" character in CTC models predicting word pieces, we at the very least know which word pieces begin a word and which ones are either in the middle of the word or the end of a word, but the algorithm would still need to be rewritten, especially since "blank" is not a silence phoneme (it can appear between). I did look into using the lexicon-based word alignment. I don't have a specific complaint about it, but I did get a weird error where it couldn't create a final state at all in the output lattice, which caused Connect() to output an empty lattice. This may be because I wasn't quite sure how to handle the blank token. I treat it as its own phoneme, bcause of limitations in TransitionInformation, but this doesn't really make any sense. Needless to say, while the CTM outputs of the cuda decoder will be correct from a WER point of view, their time stamps won't be correct, but they probably never were in the first place, for CTC models.
For CTC models using word pieces or graphemes, there is not enough positional information to use the word alignment. I tried marking every unit as "singleton" word_boundary.txt, but this explodes the state space very, very often. See: nvidia-riva/riva-asrlib-decoder#3 With the "_" character in CTC models predicting word pieces, we at the very least know which word pieces begin a word and which ones are either in the middle of the word or the end of a word, but the algorithm would still need to be rewritten, especially since "blank" is not a silence phoneme (it can appear between). I did look into using the lexicon-based word alignment. I don't have a specific complaint about it, but I did get a weird error where it couldn't create a final state at all in the output lattice, which caused Connect() to output an empty lattice. This may be because I wasn't quite sure how to handle the blank token. I treat it as its own phoneme, bcause of limitations in TransitionInformation, but this doesn't really make any sense. Needless to say, while the CTM outputs of the cuda decoder will be correct from a WER point of view, their time stamps won't be correct, but they probably never were in the first place, for CTC models.
@kbramhendra hi, i also try to use this decoder tools, how can use it for ctc+tlg |
Hey Hi,
I am trying to use Riva decoder as a substitution for k2 decoder. With riva decoder I am not getting the same accuracy as that of k2 decoder. There are lot more deletions, at least 12%. Experimented with different hypereparamters like acoustic_scale, max_active_states, but result seems to be not changing much. I have tried with the different topologies as well(eesen, compact), its the same case with all of them. Can you please help in this regard.
The text was updated successfully, but these errors were encountered: