-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variable audio_ctx consistently produces ~3x speedup for short audio clips #1855
Comments
Nice! The idea to compute the Encoder with partial context was introduced and discussed to some extend here: #137 I often use it for short audio segments on low-end devices such as Raspberry Pis. It's surprising to see the WER being lower with partial context though - I've never measured it, but I expected that the quality would be worse when using smaller |
Yes, that was a great thread and motivating some of my experiments to better understanding the impact of using partial context with the Encoder. To your point, choosing a value for
|
Since this approach seems generally useful, I created a quick PR (#1857) to make it easier to use the |
When transcribing short audio clips (e.g., <30 seconds, and usually between 5-10 seconds), I've noticed that the
audio_ctx
parameter can greatly increase performance when set appropriately. After some experimentation, it seems that using the length of the audio clip to scale the value ofaudio_ctx
works quite well. I have been usingaudio_ctx = (audio length in seconds/30 seconds)*1500 + 128
somewhat arbitrarily.To confirm these observations I did some comparisons using clips from the Common Voice dataset, the
base.en
model, and this hardware configuration:CPU: Intel i7-11700K
RAM: DDR4 - 3200 Mhz
whisper.cpp
build details:These are the results I'm seeing for 200 random clips from Common Voice (average length ~5.7 seconds):
I also see similar results with the
tiny.en
model.Overall, this seems like a great way to get a ~3-3.5x speedup on CPU with no significant penalty to accuracy when working with shorter audio clips. Might there be some other side effects or downsides that I'm not considering when using
audio_ctx
in this way?The text was updated successfully, but these errors were encountered: