CrisperWhisper takes way longer than large-v3 when running transcribe task #29

Hongshen2010 · 2025-02-11T08:49:40Z

Hey there! I recently want to try CrisperWhisper for its awesome support of filler words. But one frustrating thing is that it takes much longer than large-v3 model when running transcribe task.

I use WhisperX to load and run CrisperWhisper model, which uses FasterWhisper as backend, it's running in local mode

    ...
    model_name = "nyrahealth/faster_CrisperWhisper"
    model = whisperx.load_model(
      model_name,
      device,
      compute_type=compute_type,
      language=language,
      asr_options=default_asr_options,
      vad_method=vad_method,
      vad_options=vad_options,
      local_files_only=True,
    )
    print(f'[load transcribe model] Model: {model_name} Time cost: {timer.tok():0.4f} seconds')

    audio = whisperx.load_audio(audio_file)
    print(f'[load audio] Model: {model_name} Time cost: {timer.tok():0.4f} seconds')

    print(f'[transcribe] chunk_size: {chunk_size} seconds')
    result = model.transcribe(
        audio, batch_size=batch_size, task=task, chunk_size=chunk_size
    )
    print(f'[transcribe] File: {audio_file}: Model: {model_name} Time cost: {timer.tok():0.4f} seconds')
    ...

Add running logs for references:

INFO:speechbrain.utils.quirks:Applied quirks (see `speechbrain.utils.quirks`): [allow_tf32, disable_jit_profiling]
INFO:speechbrain.utils.quirks:Excluded quirks specified by the `SB_DISABLE_QUIRKS` environment (comma-separated list): []
whisperx version: 3.3.1 cuda:True. language: en. transLang: None. vad_onset: 0.1. vad_offset: 0.05
>>Performing voice activity detection using Pyannote...
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.0.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint venv/lib/python3.10/site-packages/whisperx/assets/pytorch_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.5.1+cu124. Bad things might happen unless you revert torch to 1.x.
[load transcribe model] Model: nyrahealth/faster_CrisperWhisper Time cost: 3.5804 seconds
[load audio] Model: nyrahealth/faster_CrisperWhisper Time cost: 3.9981 seconds
[transcribe] chunk_size: 30 seconds
/app/venv/lib/python3.10/site-packages/pyannote/audio/utils/reproducibility.py:74: ReproducibilityWarning: TensorFloat-32 (TF32) has been disabled as it might lead to reproducibility issues and lower accuracy.
It can be re-enabled by calling
   >>> import torch
   >>> torch.backends.cuda.matmul.allow_tf32 = True
   >>> torch.backends.cudnn.allow_tf32 = True
See https://github.com/pyannote/pyannote-audio/issues/1370 for more details.

  warnings.warn(
Suppressing numeral and symbol tokens
[transcribe] File: audio-raw.audio: Model: nyrahealth/faster_CrisperWhisper Time cost: 124.0263 seconds
[transcribe] before alignment: audio-raw.audio: Time cost: 124.0269 seconds
[align] Model: nyrahealth/faster_CrisperWhisper Time cost: 134.8052 seconds
[align] after alignment: audio-raw.audio: Time cost: 134.8109 seconds

For a 3 minute audio file, large-v3 takes about 45 seconds to finish while CrisperWhisper takes about 124 seconds.
They both run on a VM with specs:

Machine type: n1-standard-8 (8 vCPUs, 30 GB Memory)
CPU platform: Intel Haswell
GPUs: 1 x NVIDIA T4

Both CrisperWhisper and large-v3 use FasterWhisper as backend, longer transcribe time doesn't make sense to me.
Any idea on why such a huge time cost discrepancy?

The text was updated successfully, but these errors were encountered:

bruno-hays · 2025-02-27T15:27:44Z

Hello,
You should take a look at the time per token for each model.
CrisperWhisper transcribes everything which means more tokens which means longer time to generate the transcription.
~3x more tokens seems a bit excessive though

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CrisperWhisper takes way longer than large-v3 when running transcribe task #29

CrisperWhisper takes way longer than large-v3 when running transcribe task #29

Hongshen2010 commented Feb 11, 2025 •

edited

Loading

bruno-hays commented Feb 27, 2025

CrisperWhisper takes way longer than large-v3 when running transcribe task #29

CrisperWhisper takes way longer than large-v3 when running transcribe task #29

Comments

Hongshen2010 commented Feb 11, 2025 • edited Loading

bruno-hays commented Feb 27, 2025

Hongshen2010 commented Feb 11, 2025 •

edited

Loading