Setting length_penalty to a negative score is helpful for CTC models,
since they are often biased towards taking shorter length paths
through the WFST graph. (Since shorter paths have smaller costs, in
general.)
However, a side effect of using length penalty this way is that stuff
like "no one cares" would come out as "no one caress" instead because
"caress" has a longer WFST path than "cares".
Applying the penalty only when olabel != 0 (epsilon) can help work
around this issue, while still preserving some of the benefits from
length_penalty.
Note that this word_length_penalty is applied in both the emitting and
non-emitting ExpandArcs, while length_penalty is applied only in the
emitting ExpandArcs. I believe this is the proper way to do things.
Here are some experiments from running the test test_sub_ins_del:
model is stt_en_conformer_ctc_small
dataset is test-clean
For vanilla "ctc" topology, the best length_penalty was -5.0. The WER was:
wer=0.04530584297017651, ins=369, sub=1650, del=363
For vanilla "ctc" topology, the best word_length_penalty was -10.0. The WER was:
wer=0.045058581862446746, ins=375, sub=1608, del=386
For compact "ctc" topology, the best length_penalty was -9.5. The WER was:
wer=0.045058581862446746, ins=375, sub=1608, del=386
For compact "ctc" topology, the best word_length_penalty was -10.0. The WER was:
wer=0.04309951308581862, ins=302, sub=1572, del=392
The best result comes from using compact CTC topology with word_length_penalty=-10.0
It makes sense that a more negative length penalty is required to
minimize WER for the compact CTC topology; it has fewer self-loops.
Insertion, Substitution, and Deletion statistics were obtained by
applying this diff:
modified src/riva/asrlib/decoder/test_graph_construction.py
@@ -963,6 +963,8 @@ class TestGraphConstruction:
references = [s.lower() for s in references]
# Might want to try a different WER implementation, for sanity.
my_wer = wer(references, predictions)
+ wer_ratio, ins, sub, deletions = my_wer
+ print(f"GALVEZ:wer={wer_ratio}, ins={ins}, sub={sub}, del={deletions}")
other_wer = word_error_rate(references, predictions)
print("beam search WER:", my_wer)
print("other beam search WER:", other_wer)