Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Distil-Whisper] Add support for Distil-Whisper #1423

Open
patrickvonplaten opened this issue Nov 3, 2023 · 9 comments
Open

[Distil-Whisper] Add support for Distil-Whisper #1423

patrickvonplaten opened this issue Nov 3, 2023 · 9 comments
Labels
high priority Very important issue

Comments

@patrickvonplaten
Copy link

Hey,

We've recently released two Distil-Whisper checkpoints:

  • Large-v2-32-2 which is a 32-encoder layer, 2-decoder layer distilled large-v2 checkpoint
  • Medium-24-2.en which is a 24-encoder layer, 2-decoder layer distilled medium.en checkpoint

On GPU, we achieve speed-ups of up to 6x compared to the teacher models at relatively minimal degradation in performance.
More information here: https://twitter.com/sanchitgandhi99/status/1719409022246220184

Using your conversion scripts, we've already converted the checkpoints to .cpp format see:

We'd love to collaborate on supporting the checkpoints for this repository as we're really excited to see about the potential speed-ups that can be achieved on optimized C++ code.

It looks like some changes to whisper.cpp will be necessary for such a change (e.g. we should probably define a new model type here?)

@ggerganov would you be interested in adding Distil-Whisper?

@patrickvonplaten
Copy link
Author

Linking for visibility: #1414

@bobqianic bobqianic added the high priority Very important issue label Nov 3, 2023
@ggerganov
Copy link
Owner

Hi @patrickvonplaten - congrats on the release!

I believe I have successfully added initial support for the distilled models in the following PR: #1424

However, I'm worried that for optimal quality, AFAICT these models require an alternative decoding strategy with overlapping chunks for long-form transcriptions. This can take more time to implement and I am not sure yet how to fit it in the existing implementation.

Could you point me to the reference implementation?

I will give it a thought and see if I can come up with a solution in the following days.
For the moment, #1424 should hopefully work as an initial version

@patrickvonplaten
Copy link
Author

Hey @ggerganov,

The implementation we're using in Transformers actually uses overlapping chunks. We overlap each chunk by 2.5 seconds. Essentially we follow the strategy as described here: https://huggingface.co/blog/asr-chunking using a chunk length of 15 seconds and chunk_stride of 2.5 second (default).

It's all implemented here: https://github.com/huggingface/transformers/blob/ac5d4cf6de24b4f7fa92996e92d1d71dd5411a6a/src/transformers/pipelines/automatic_speech_recognition.py#L135 and the code to run in inference for debugging should be this one: https://github.com/huggingface/distil-whisper/tree/main#long-form-transcription

The other option is to just use openai's codebase: https://github.com/openai/whisper using distil-whisper checkpoints converted into the original format: https://huggingface.co/distil-whisper/distil-large-v2/blob/main/original-model.fp32.bin

Does this help? I'm also working on adding OAI's naively to Transformers for easier debugging but this might take until next week

@ggerganov
Copy link
Owner

Thanks for the links. Will probably look into chunking after I make the v1.5.0 release of whisper.cpp.

@rawwerks
Copy link

rawwerks commented Dec 13, 2023

i would like to weigh in from the "end user peanut gallery" that i believe the full implementation of the chunking for distil-whisper would be a major inflection point for the widespread adoption of whisper.cpp. qualitatively, the recent speed improvements were able to help products like MacWhisper get to a point where consumer hardware (M1) can now transcribe short audio faster than you can upload/transcribe/download via a cloud service like Otter or Happyscribe. if we can get the extra 5-6x from distil-whisper, then even hours long transcriptions of meetings, podcasts, etc, could be transcribed in minutes to tens of minutes on consumer hardware (with respectable accuracy (medium or large))

of course everyone would rather transcribe locally for privacy and cost reasons. you have the power to make this practical. everyone will have their own private transcriptionist. we don't need another 10x to make this a UX inflection, just another 5x will seriously change the game.

thank you for the important work that you do!

@PoignardAzur
Copy link

I haven't managed to run the conversion scripts myself (see #1711).

Is there any chance you could release additional versions, using the GGUF format with the recent quantization options?

@ciekawy
Copy link

ciekawy commented Jan 10, 2024

any chances for this to support with https://huggingface.co/Aspik101/distil-whisper-large-v3-pl ?

@johnmccombs1
Copy link

I'd love to see this as well. The distil models run so much faster but unfortunately for anything longer than 10-20 seconds, it starts cutting out words/phrases. I tested against a distil model using regular Whisper here https://huggingface.co/spaces/distil-whisper/whisper-vs-distil-whisper with the same audio file and it works nearly flawlessly. But for some reason using it through whisper.cpp creates a large number of errors and words that are cut off or misspelled (I'm assuming it's because it's chunking oddly). Would love to see this fixed.

@hlevring
Copy link

hlevring commented Apr 2, 2024

@patrickvonplaten with the latest release of Distilled V3 my understanding is that Distilled model is no longer exclusively tied to the chunked algorithm as far as I can understand
https://huggingface.co/distil-whisper/distil-large-v3
https://huggingface.co/distil-whisper/distil-large-v3-ggml

So maybe this ticket could be closed? I suppose it mainly remained open to address the chunking?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue
Projects
None yet
Development

No branches or pull requests

8 participants