Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whisper: how to get streaming word level timestamps? (automatic-speech-recognition) #1198

Open
getflourish opened this issue Feb 18, 2025 · 3 comments
Labels
question Further information is requested

Comments

@getflourish
Copy link

getflourish commented Feb 18, 2025

Question

Goal

  • streaming
  • word level timestamps

Issue

on_chunk_start / on_chunk_end are not called when using return_timestamps: "word".
These callbacks only provide timestamps with return_timestamps: true

I also tried to decode tokens, as I’ve seen it in the demo, but that uses callbacks that no longer exist (e.g. chunk_callback(chunk) and callback_function(item))

Setup

const transcriber = await pipeline(
  "automatic-speech-recognition",
  "Xenova/whisper-tiny",
  {
    device: "webgpu",
   }
);
token_callback_function: (tokens) => {
  const { feature_extractor } = transcriber.processor;
  const { config: modelConfig } = transcriber.model;
  
  const time_precision = feature_extractor.config.chunk_length / modelConfig.max_source_positions;

  if (tokens) {
    const data = transcriber.tokenizer._decode_asr(
      [{ tokens, finalised: false }],
      {
        time_precision,
        return_timestamps: true,
        force_full_sequences: false,
      }
    );

    console.log("data", data);
  }
};

Decoding works, but timestamps are null.

Image
@getflourish getflourish added the question Further information is requested label Feb 18, 2025
@xenova
Copy link
Collaborator

xenova commented Feb 18, 2025

The algorithm we use to compute word-level timestamps (dynamic time warping) requires the entire chunk to be processed, so streamed word-level timestamps isn't currently possible.

@getflourish
Copy link
Author

I understand!

Ideally, what I would want in the end are word level timestamps.

So I have to use return_timestamps: "word"

Unfortunately, that really just gives me all the single words as a large array of words and their timestamps. (as expected)

But at the same time I would also like to have the grouping that return_timestamps: true provides.

How to get the best of both worlds?

I can imagine to post-process the words into meaningful sentences, but I wonder if that’s the way to go…

@getflourish
Copy link
Author

Since the model is by OpenAI, I wonder if the API could be aligned towards their approach of providing "segment", "word" or both to an array of timestamp_granularities.

https://platform.openai.com/docs/api-reference/chat

The timestamp granularities to populate for this transcription. response_format must be set verbose_json to use timestamp granularities. Either or both of these options are supported: word, or segment. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants