Timestamp precision issue in large model #139
Replies: 5 comments 13 replies
-
*update It seems like a problem due the selecting the only largest logit to determine the timestamp. Although this part of the training and inference, it sacrifices precision in real practical applications because it relies the on model's top prediction to be always correct. So in theory, to increase the precision of the timestamps, you can apply heuristics to filter and selected what is logically the best of the top predictions instead choosing only the top one (which is what it does right now). |
Beta Was this translation helpful? Give feedback.
-
Another caveat is that the prediction is often biased to integer timestamps, as you see in the first example all ending with |
Beta Was this translation helpful? Give feedback.
-
It looks like there might be a kind of bug in the timestamp filtering rule. For each token, we try to decide whether it is a text token or a timestamp token and it is implemented by the following rule: Lines 433 to 437 in 62fe7f1 If the sum of scores for all possible timestamp positions is greater than the largest text token score, then it is assumed to be a timestamp token. In case of a timestamp, all scores for text token are zeroed by setting its logits to In beam search, the first 5 most confident tokens are explored and if the largest score corresponds to a text token, timestamp scores are remained unchanged and there might be jumping between text and time tokens randomly because the second largest score in the text scores part can be lower than the largest score in the timestamps part. Lines 308 to 312 in 62fe7f1 So this leads to inserting timestamps into random places or inserting inaccurate timestamps. However, it can be easily fixed by setting all timestamps logits to if timestamp_logprob > max_text_token_logprob:
logits[k, : self.tokenizer.timestamp_begin] = -np.inf
else:
logits[k, self.tokenizer.timestamp_begin:] = -np.inf This small change results in much more predictable places of timestamps, mostly either at the beginning or at the end of a phrase, not in random places. And it seems like timestamps become also quite accurate. The possible drawback is that the model produces much larger segments. Here are some examples. Base modelBefore:
After:
Tiny modelBefore:
After:
The tiny model tends to produce very long text segments, almost paragraph-like. |
Beta Was this translation helpful? Give feedback.
-
@ksofiyuk could you improve somehow the script in a way that forcing it to generate output as punctuation enabled? |
Beta Was this translation helpful? Give feedback.
-
Would you mind showing how to generate phrase-level timestamps? I couldn't find any example. Thank you. |
Beta Was this translation helpful? Give feedback.
-
I transcribed the same audio file with medium and large model and found the timestamp precision of the large model
is alway rounded to 1 second while the medium model is correct. Also the end time of large model is also incorrect.
Could you fix it? Thanks.
Total audio file lasts 20:32:28.
[large model, incorrect timestamp precision]
20:26.000 --> 20:28.000
クソもう30分も遅れてる
20:29.000 --> 20:30.000
ブラインドデートに
20:30.000 --> 20:35.000
あんな子が来てくれたらいいのにな
[medium model correct timestamp precision]
20:26.280 --> 20:29.480
ああくそもう30分も遅れてる
20:29.480 --> 20:32.280
ブラインドデートにあんな子が来てくれたらいいのにな
Beta Was this translation helpful? Give feedback.
All reactions