whisper: how to get streaming word level timestamps? (automatic-speech-recognition) #1198

getflourish · 2025-02-18T15:29:42Z

Question

Goal

streaming
word level timestamps

Issue

on_chunk_start / on_chunk_end are not called when using return_timestamps: "word".
These callbacks only provide timestamps with return_timestamps: true

I also tried to decode tokens, as I’ve seen it in the demo, but that uses callbacks that no longer exist (e.g. chunk_callback(chunk) and callback_function(item))

Setup

const transcriber = await pipeline(
  "automatic-speech-recognition",
  "Xenova/whisper-tiny",
  {
    device: "webgpu",
   }
);

token_callback_function: (tokens) => {
  const { feature_extractor } = transcriber.processor;
  const { config: modelConfig } = transcriber.model;
  
  const time_precision = feature_extractor.config.chunk_length / modelConfig.max_source_positions;

  if (tokens) {
    const data = transcriber.tokenizer._decode_asr(
      [{ tokens, finalised: false }],
      {
        time_precision,
        return_timestamps: true,
        force_full_sequences: false,
      }
    );

    console.log("data", data);
  }
};

Decoding works, but timestamps are null.

The text was updated successfully, but these errors were encountered:

xenova · 2025-02-18T15:38:00Z

The algorithm we use to compute word-level timestamps (dynamic time warping) requires the entire chunk to be processed, so streamed word-level timestamps isn't currently possible.

getflourish · 2025-02-18T15:58:18Z

I understand!

Ideally, what I would want in the end are word level timestamps.

So I have to use return_timestamps: "word"

Unfortunately, that really just gives me all the single words as a large array of words and their timestamps. (as expected)

But at the same time I would also like to have the grouping that return_timestamps: true provides.

How to get the best of both worlds?

I can imagine to post-process the words into meaningful sentences, but I wonder if that’s the way to go…

getflourish · 2025-02-20T04:45:47Z

Since the model is by OpenAI, I wonder if the API could be aligned towards their approach of providing "segment", "word" or both to an array of timestamp_granularities.

https://platform.openai.com/docs/api-reference/chat

The timestamp granularities to populate for this transcription. response_format must be set verbose_json to use timestamp granularities. Either or both of these options are supported: word, or segment. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

getflourish added the question Further information is requested label Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper: how to get streaming word level timestamps? (automatic-speech-recognition) #1198

whisper: how to get streaming word level timestamps? (automatic-speech-recognition) #1198

getflourish commented Feb 18, 2025 •

edited

Loading

xenova commented Feb 18, 2025

getflourish commented Feb 18, 2025

getflourish commented Feb 20, 2025

whisper: how to get streaming word level timestamps? (automatic-speech-recognition) #1198

whisper: how to get streaming word level timestamps? (automatic-speech-recognition) #1198

Comments

getflourish commented Feb 18, 2025 • edited Loading

Question

Goal

Issue

Setup

xenova commented Feb 18, 2025

getflourish commented Feb 18, 2025

getflourish commented Feb 20, 2025

getflourish commented Feb 18, 2025 •

edited

Loading