-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Real time whisper transcription #405
Comments
Real-time transcription will hopefully be possible once webgpu support is added, and we'll definitely revisit (and update the demo) once it is. If someone in the community would like to try modify the whisper-web source code (or provide a basic streaming) implementation, which could be adapted once webgpu is supported, that would be great! 😇 |
Curious why is it waiting for WebGPU, at least on my macbook pro pre-m1, the decoding is faster than the time of the recording. What would be needed is to be able to feed audio frames in an async way instead of all at once. |
The major bottleneck at the moment is the encoder, which can take a few seconds to process ~30 seconds. Ideally, if we were to process shorter audio sequences, it would take much shorter, however, this is a hard constraint of the architecture. The initial transformations into log-mel spectrogram space produce 30 second chunks that are fed into the encoder. See here for more discussion on this. |
Sorry for the super late reply. That makes sense. Thanks for the link to the discussions. Let me bring more visibility to this issue see if someone is interested in contributing. |
it's not real time but it might give someone some inspiration for chunked processing. |
does onnx deprecate the webgl backend? |
Hi @xenova , |
This is now possible with Transformers.js v3: https://x.com/xenovacom/status/1799110540700078422 🥳 whisper-realtime.mp4I'll close this issue once Transformers.js v3 is officially out and #545 is merged 🚀 |
@xenova I tried the demo, the latency is still poor... What can be done to improve this? Smaller models? Custom GPUs? See a video preview here: https://streamable.com/m7oyq1 |
It is showing loading model since a while |
Real time whisper transcription
Right now the demo works for a recording but does it in one shot. I'd love to be able to do it as I speak. Sadly the interface seems to be accepting only a Float32Array (or arrays of) and not a way to keep feeding it float32 arrays as we receive them from the audio source.
Would be great to be able to do it in a streaming fashion.
Reason for request
I want to build a tool to help recording off voice and want to get a real time transcription to overlay on-top of the existing one to help get a sense of progress.
Thanks <3
The text was updated successfully, but these errors were encountered: