Timestamp precision issue in large model #139

unlock2000 · 2022-09-26T13:57:23Z

unlock2000
Sep 26, 2022

I transcribed the same audio file with medium and large model and found the timestamp precision of the large model
is alway rounded to 1 second while the medium model is correct. Also the end time of large model is also incorrect.
Could you fix it? Thanks.

Total audio file lasts 20:32:28.

[large model, incorrect timestamp precision]

20:26.000 --> 20:28.000
クソもう30分も遅れてる

20:29.000 --> 20:30.000
ブラインドデートに

20:30.000 --> 20:35.000
あんな子が来てくれたらいいのにな

[medium model correct timestamp precision]

20:26.280 --> 20:29.480
ああくそもう30分も遅れてる

20:29.480 --> 20:32.280
ブラインドデートにあんな子が来てくれたらいいのにな

jianfch · 2022-09-27T02:56:01Z

jianfch
Sep 27, 2022

*update
https://github.com/jianfch/stable-ts script to remedy this issues of skipping timestamps

It seems like a problem due the selecting the only largest logit to determine the timestamp. Although this part of the training and inference, it sacrifices precision in real practical applications because it relies the on model's top prediction to be always correct. So in theory, to increase the precision of the timestamps, you can apply heuristics to filter and selected what is logically the best of the top predictions instead choosing only the top one (which is what it does right now). ~~I am working that now~~

6 replies

jianfch Sep 30, 2022

That's the separately decoded tokens not the actual words. The actual words/full sentences are in the results too. It looks like you used results_to_srt() which by default generates token-level srt.
To generate the sentence level srt, you can do this instead:

from stable_whisper import results_to_srt
# after you get the result from the modified model
results_to_srt(result, 'output.srt', word_level=False)

If you find the sentences sometimes ending a bit too early b/c this sets the end of the sentence to the timestamp of the last token instead of the predicted end of the sentence.
You can do this instead:

from stable_whisper import to_srt
# after you get the result from the modified model
to_srt(result['segments], 'output.srt')

unlock2000 Oct 1, 2022
Author

Encounted error using below method.
results_to_srt(result, 'output.srt', word_level=False)
But after using your new method results_to_sentence_srt(result, 'audio.srt'),
the precision of the sentence level result is perfect.

results_to_srt (result,
'output.srt', word_level=False)
Traceback (most recent call last):
File "<stdins", line 1, in
File "/content/stable-ts/stable_whisper.py",
， Iine 157, in results_to_srt
if any ((end _at_last _word, end_before_period, start_at_first_word)) else res [' segment')
KeyError: ' segment'

jianfch Oct 1, 2022

if any ((end _at_last _word, end_before_period, start_at_first_word)) else res [' segment')
KeyError: ' segment'

Noted. ~~I'll patch it later that.~~ patched.

thhung Oct 4, 2022

Still having the wrong timestamp in my case if there are a silent period at the beginning of the audio file.

jianfch Oct 4, 2022

same as the issue discussed in #237
along with setting value max_initial_timestamp=None, you can try using beam search if you're not already using cli

jongwook · 2022-09-27T07:59:55Z

jongwook
Sep 27, 2022
Maintainer

Another caveat is that the prediction is often biased to integer timestamps, as you see in the first example all ending with .000. I empirically observed that those tend to be less accurate; blurring the predicted distribution may help, but we haven't done a conclusive study on this yet.

0 replies

ksofiyuk · 2022-09-28T14:25:45Z

ksofiyuk
Sep 28, 2022

It looks like there might be a kind of bug in the timestamp filtering rule. For each token, we try to decide whether it is a text token or a timestamp token and it is implemented by the following rule:

whisper/whisper/decoding.py

Lines 433 to 437 in 62fe7f1

    
           for k in range(tokens.shape[0]): 
        
               timestamp_logprob = logprobs[k, self.tokenizer.timestamp_begin :].logsumexp(dim=-1) 
        
               max_text_token_logprob = logprobs[k, : self.tokenizer.timestamp_begin].max() 
        
               if timestamp_logprob > max_text_token_logprob: 
        
                   logits[k, : self.tokenizer.timestamp_begin] = -np.inf

If the sum of scores for all possible timestamp positions is greater than the largest text token score, then it is assumed to be a timestamp token. In case of a timestamp, all scores for text token are zeroed by setting its logits to -np.inf, as it can be seen in the code above. However, in case of a text token, all timestamp scores are remained unchanged, and it can become a problem when beam search is used.

In beam search, the first 5 most confident tokens are explored and if the largest score corresponds to a text token, timestamp scores are remained unchanged and there might be jumping between text and time tokens randomly because the second largest score in the text scores part can be lower than the largest score in the timestamps part.

whisper/whisper/decoding.py

Lines 308 to 312 in 62fe7f1

    
           for logprob, token in zip(*logprobs[idx].topk(self.beam_size + 1)): 
        
               new_logprob = (sum_logprobs[idx] + logprob).item() 
        
               sequence = tuple(prefix + [token.item()]) 
        
               scores[sequence] = new_logprob 
        
               sources[sequence] = idx

So this leads to inserting timestamps into random places or inserting inaccurate timestamps. However, it can be easily fixed by setting all timestamps logits to -np.inf, if the current token is assumed to be a text token:

if timestamp_logprob > max_text_token_logprob:
    logits[k, : self.tokenizer.timestamp_begin] = -np.inf
else:
    logits[k, self.tokenizer.timestamp_begin:] = -np.inf

This small change results in much more predictable places of timestamps, mostly either at the beginning or at the end of a phrase, not in random places. And it seems like timestamps become also quite accurate. The possible drawback is that the model produces much larger segments.

Here are some examples.

Base model

Before:

00:00.000 --> 00:03.280
 You know in the first Jurassic Park movie there was that scene with the

00:03.280 --> 00:07.120
 glass of water and the car's cup holder with ripples on it to show just how heavy

00:07.120 --> 00:11.200
 the approaching T-Rex's footsteps were? Well this thing behind me is technically

00:11.200 --> 00:15.040
 called the large mobile shaker but the folks here at the University of Texas

00:15.040 --> 00:19.480
 they call it the T-Rex because it's a machine designed to shake the ground

00:19.480 --> 00:24.600
 itself. We have five shakers they go from a small one that's right behind me and

After:

00:00.000 --> 00:04.000
 You know in the first Jurassic Park movie, there was that scene with the glass of water

00:04.000 --> 00:09.000
 in the car's cup holder with ripples on it to show just how heavy the approaching T-Rex's footsteps were?

00:09.000 --> 00:13.000
 Well, this thing behind me is technically called the Large Mobile Shaker,

00:13.000 --> 00:17.000
 but the folks here at the University of Texas, they call it the T-Rex,

00:17.000 --> 00:20.000
 because it's a machine designed to shake the ground itself.

00:20.000 --> 00:25.000
 We have five shakers, they go from a small one that's right behind me,

Tiny model

Before:

00:00.000 --> 00:04.160
 You know in the first Jurassic Park movie, there was that scene with the glass of water and the

00:04.160 --> 00:08.640
 cars cup holder with ripples on it to show just how heavy the approaching T-Rex's footsteps

00:08.640 --> 00:13.840
 were. Well this thing behind me is technically called the large mobile shaker, but the folks

00:13.840 --> 00:18.240
 here at the University of Texas they call it the T-Rex because it's a machine designed to

00:18.240 --> 00:24.640
 shake the ground itself. We have five shakers, they go from a small one that's right behind me

00:24.640 --> 00:28.960
 and we call that an urban shaker because you can bring it downtown and that getting the trouble

After:

00:00.000 --> 00:09.000
 You know, in the first Jurassic Park movie, there was that scene with the glass of water and the car's cup holder with ripples on it to show just how heavy the approaching T-Rex's footsteps were.

00:09.000 --> 00:20.000
 Well, this thing behind me is technically called the Large Mobile Shaker, but the folks here at the University of Texas, they call it the T-Rex, because it's a machine designed to shake the ground itself.

00:20.000 --> 00:30.000
 We have five shakers, they go from a small one that's right behind me, and we call that an urban shaker, because you can bring it downtown and not get into trouble of shaking.

The tiny model tends to produce very long text segments, almost paragraph-like.

6 replies

FurkanGozukara Sep 28, 2022

@ksofiyuk @jongwook could you improve algorithm to generate timestamps same as YouTube?

Ludvig-Joborn Sep 29, 2022

Excellent stuff! Now I wonder, will this be merged into main?

taylorchu Oct 2, 2022

#224

I put a pr here. I tested with my own data, but this actually does not generate timestamp at end of segment/phrase reliably.

ksofiyuk Oct 2, 2022

As far as I understand, Whisper cannot produce exact word-, phrase-, sentence-level timestamps off-the-shelf due to the way it was trained. The models were trained on publicly available subtitles and transcripts from the Internet in which timestamps are placed quite randomly and not in a unified way. And the model tries to approximate this messy timestamps distribution which results in unstable and unsystematic behavior.

My suggestion above is just a heuristic that might improve stability, but it is not a solution. And it looks like it is impossible to get accurate word- or even sentence-level timestamps without at least finetuning the model. I guess, a possible solution might be the following. It is needed to fine-tune the model on relatively small amounts of data with unified high-quality timestamps (e.g. sentence-level). As the original model is trained on a very large and diverse dataset I assume that it can model almost any possible distribution of timestamps and it just needs to be slightly finetuned to specialize it on a certain timestamps pattern.

ksofiyuk Oct 2, 2022

And I'm not sure that the token-based approach for predicting timestamps in Whisper is very good for that task, because these tokens are disconnected from the location of the audio features and they tend to hallucinate trying to model the underlying timestamp distribution which would inevitably contain errors and inaccuracies. That is why the model often predicts integer values of timestamps - this is how timestamps are often placed in the wild by people.

FurkanGozukara · 2022-09-29T23:30:53Z

FurkanGozukara
Sep 29, 2022

@ksofiyuk could you improve somehow the script in a way that forcing it to generate output as punctuation enabled?

1 reply

lmmx Sep 30, 2022

@FurkanGozukara That is a separate issue which you raised in #194, please keep unrelated discussions separate to make it easier for the maintainers

dennymarcels · 2022-09-30T19:00:12Z

dennymarcels
Sep 30, 2022

Would you mind showing how to generate phrase-level timestamps? I couldn't find any example. Thank you.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timestamp precision issue in large model #139

{{title}}

Replies: 5 comments 13 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Timestamp precision issue in large model #139

Replies: 5 comments · 13 replies

unlock2000 Oct 1, 2022 Author

jongwook Sep 27, 2022 Maintainer

Replies: 5 comments 13 replies

unlock2000 Oct 1, 2022
Author

jongwook
Sep 27, 2022
Maintainer