Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add word_length_penalty option. #40

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

galv
Copy link
Collaborator

@galv galv commented Jun 12, 2024

Setting length_penalty to a negative score is helpful for CTC models, since they are often biased towards taking shorter length paths through the WFST graph. (Since shorter paths have smaller costs, in general.)

However, a side effect of using length penalty this way is that stuff like "no one cares" would come out as "no one caress" instead because "caress" has a longer WFST path than "cares".

Applying the penalty only when olabel != 0 (epsilon) can help work around this issue, while still preserving some of the benefits from length_penalty.

Note that this word_length_penalty is applied in both the emitting and non-emitting ExpandArcs, while length_penalty is applied only in the emitting ExpandArcs. I believe this is the proper way to do things.

Here are some experiments from running the test test_sub_ins_del:

model is stt_en_conformer_ctc_small
dataset is test-clean

For vanilla "ctc" topology, the best length_penalty was -5.0. The WER was: wer=0.04530584297017651, ins=369, sub=1650, del=363

For vanilla "ctc" topology, the best word_length_penalty was -10.0. The WER was: wer=0.045058581862446746, ins=375, sub=1608, del=386

For compact "ctc" topology, the best length_penalty was -9.5. The WER was: wer=0.045058581862446746, ins=375, sub=1608, del=386

For compact "ctc" topology, the best word_length_penalty was -10.0. The WER was: wer=0.04309951308581862, ins=302, sub=1572, del=392

The best result comes from using compact CTC topology with word_length_penalty=-10.0

It makes sense that a more negative length penalty is required to minimize WER for the compact CTC topology; it has fewer self-loops.

Insertion, Substitution, and Deletion statistics were obtained by applying this diff:

modified   src/riva/asrlib/decoder/test_graph_construction.py
@@ -963,6 +963,8 @@ class TestGraphConstruction:
         references = [s.lower() for s in references]
         # Might want to try a different WER implementation, for sanity.
         my_wer = wer(references, predictions)
+        wer_ratio, ins, sub, deletions = my_wer
+        print(f"GALVEZ:wer={wer_ratio}, ins={ins}, sub={sub}, del={deletions}")
         other_wer = word_error_rate(references, predictions)
         print("beam search WER:", my_wer)
         print("other beam search WER:", other_wer)

Setting length_penalty to a negative score is helpful for CTC models,
since they are often biased towards taking shorter length paths
through the WFST graph. (Since shorter paths have smaller costs, in
general.)

However, a side effect of using length penalty this way is that stuff
like "no one cares" would come out as "no one caress" instead because
"caress" has a longer WFST path than "cares".

Applying the penalty only when olabel != 0 (epsilon) can help work
around this issue, while still preserving some of the benefits from
length_penalty.

Note that this word_length_penalty is applied in both the emitting and
non-emitting ExpandArcs, while length_penalty is applied only in the
emitting ExpandArcs. I believe this is the proper way to do things.

Here are some experiments from running the test test_sub_ins_del:

model is stt_en_conformer_ctc_small
dataset is test-clean

For vanilla "ctc" topology, the best length_penalty was -5.0. The WER was:
wer=0.04530584297017651, ins=369, sub=1650, del=363

For vanilla "ctc" topology, the best word_length_penalty was -10.0. The WER was:
wer=0.045058581862446746, ins=375, sub=1608, del=386

For compact "ctc" topology, the best length_penalty was -9.5. The WER was:
wer=0.045058581862446746, ins=375, sub=1608, del=386

For compact "ctc" topology, the best word_length_penalty was -10.0. The WER was:
wer=0.04309951308581862, ins=302, sub=1572, del=392

The best result comes from using compact CTC topology with word_length_penalty=-10.0

It makes sense that a more negative length penalty is required to
minimize WER for the compact CTC topology; it has fewer self-loops.

Insertion, Substitution, and Deletion statistics were obtained by
applying this diff:

modified   src/riva/asrlib/decoder/test_graph_construction.py
@@ -963,6 +963,8 @@ class TestGraphConstruction:
         references = [s.lower() for s in references]
         # Might want to try a different WER implementation, for sanity.
         my_wer = wer(references, predictions)
+        wer_ratio, ins, sub, deletions = my_wer
+        print(f"GALVEZ:wer={wer_ratio}, ins={ins}, sub={sub}, del={deletions}")
         other_wer = word_error_rate(references, predictions)
         print("beam search WER:", my_wer)
         print("other beam search WER:", other_wer)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant