You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey there! I recently want to try CrisperWhisper for its awesome support of filler words. But one frustrating thing is that it takes much longer than large-v3 model when running transcribe task.
I use WhisperX to load and run CrisperWhisper model, which uses FasterWhisper as backend, it's running in local mode
INFO:speechbrain.utils.quirks:Applied quirks (see `speechbrain.utils.quirks`): [allow_tf32, disable_jit_profiling]
INFO:speechbrain.utils.quirks:Excluded quirks specified by the `SB_DISABLE_QUIRKS` environment (comma-separated list): []
whisperx version: 3.3.1 cuda:True. language: en. transLang: None. vad_onset: 0.1. vad_offset: 0.05
>>Performing voice activity detection using Pyannote...
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.0.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint venv/lib/python3.10/site-packages/whisperx/assets/pytorch_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.5.1+cu124. Bad things might happen unless you revert torch to 1.x.
[load transcribe model] Model: nyrahealth/faster_CrisperWhisper Time cost: 3.5804 seconds
[load audio] Model: nyrahealth/faster_CrisperWhisper Time cost: 3.9981 seconds
[transcribe] chunk_size: 30 seconds
/app/venv/lib/python3.10/site-packages/pyannote/audio/utils/reproducibility.py:74: ReproducibilityWarning: TensorFloat-32 (TF32) has been disabled as it might lead to reproducibility issues and lower accuracy.
It can be re-enabled by calling
>>> import torch
>>> torch.backends.cuda.matmul.allow_tf32 = True
>>> torch.backends.cudnn.allow_tf32 = True
See https://github.com/pyannote/pyannote-audio/issues/1370 for more details.
warnings.warn(
Suppressing numeral and symbol tokens
[transcribe] File: audio-raw.audio: Model: nyrahealth/faster_CrisperWhisper Time cost: 124.0263 seconds
[transcribe] before alignment: audio-raw.audio: Time cost: 124.0269 seconds
[align] Model: nyrahealth/faster_CrisperWhisper Time cost: 134.8052 seconds
[align] after alignment: audio-raw.audio: Time cost: 134.8109 seconds
For a 3 minute audio file, large-v3 takes about 45 seconds to finish while CrisperWhisper takes about 124 seconds.
They both run on a VM with specs:
Machine type: n1-standard-8 (8 vCPUs, 30 GB Memory)
CPU platform: Intel Haswell
GPUs: 1 x NVIDIA T4
Both CrisperWhisper and large-v3 use FasterWhisper as backend, longer transcribe time doesn't make sense to me.
Any idea on why such a huge time cost discrepancy?
The text was updated successfully, but these errors were encountered:
Hello,
You should take a look at the time per token for each model.
CrisperWhisper transcribes everything which means more tokens which means longer time to generate the transcription.
~3x more tokens seems a bit excessive though
Hey there! I recently want to try
CrisperWhisper
for its awesome support of filler words. But one frustrating thing is that it takes much longer thanlarge-v3
model when running transcribe task.I use
WhisperX
to load and runCrisperWhisper
model, which usesFasterWhisper
as backend, it's running in local modeAdd running logs for references:
For a 3 minute audio file,
large-v3
takes about 45 seconds to finish whileCrisperWhisper
takes about 124 seconds.They both run on a VM with specs:
Both
CrisperWhisper
andlarge-v3
use FasterWhisper as backend, longer transcribe time doesn't make sense to me.Any idea on why such a huge time cost discrepancy?
The text was updated successfully, but these errors were encountered: