Transcribe, translate, annotate and subtitle audio and video files with OpenAI's Whisper ... fast!
whisply
combines faster-whisper and insanely-fast-whisper to offer an easy-to-use solution for batch processing files. It also enables word-level speaker annotation by integrating whisperX and pyannote.
-
🚴♂️ Performance: Depending on your hardware
whisply
will use the fastestWhisper
implementation:- CPU:
fast-whisper
orwhisperX
- GPU (Nvidia CUDA) and MPS (Metal Performance Shaders, Apple M1-M3):
insanely-fast-whisper
orwhisperX
- CPU:
-
✅ Auto device selection: When performing transcription or translation tasks without speaker annotation or subtitling,
faster-whisper
(CPU) orinsanely-fast-whisper
(MPS, Nvidia GPUs) will be selected automatically based on your hardware if you do not provide a device by using the--device
option. -
🗣️ Word-level annotations: If you choose to
--subtitle
or--annotate
,whisperX
will be used, which supports word-level segmentation and speaker annotations. Depending on your hardware,whisperX
can run either on CPU or Nvidia GPU (but not on Apple MPS). Out of the boxwhisperX
will not provide timestamps for words containing only numbers (e.g. "1.5" or "2024"):whisply
fixes those instances through timestamp approximation. -
💬 Subtitles: Generating subtitles is customizable. You can specify the number of words per subtitle block (e.g., choosing "5" will generate
.srt
and.webvtt
files where each subtitle block exactly 5 words per segment with the corresponding timestamps). -
🧺 Batch processing:
whisply
can process single files, whole folders, URLs or a combination of all by combining paths in a.list
document. See the Batch processing section for more information. -
⚙️ Supported output formats:
.json
.txt
.txt (annotated)
.srt
.webvtt
.rttm
- FFmpeg
- python3.11
- GPU processing requires Nvidia GPU (CUDA) or Apple Metal Performance Shaders (MPS) (Mac M1-M3)
- Speaker annotation requires a HuggingFace Access Token
1. Install ffmpeg
--- macOS ---
brew install ffmpeg
--- Linux ---
sudo apt-get update
sudo apt-get install ffmpeg
--- Windows ----
https://ffmpeg.org/download.html
2. Clone this repository and change to project folder
git clone https://github.com/tsmdt/whisply.git
cd whisply
3. Create a Python virtual environment
python3.11 -m venv venv
4. Activate the Python virtual environment
source venv/bin/activate
5. Install whisply
with pip
pip install .
Usage: whisply [OPTIONS]
WHISPLY 💬 Transcribe, translate, annotate and subtitle audio and video files
with OpenAI's Whisper ... fast!
Options:
--files PATH Path to file, folder, URL or .list to process.
--output_dir DIRECTORY Folder where transcripts should be saved.
Default: "./transcriptions".
--device [auto|cpu|gpu|mps] Select the computation device: auto (default),
CPU, GPU (NVIDIA CUDA), or MPS (Mac M1-M3).
--model TEXT Select the whisper model to use (Default:
large-v2). Refers to whisper model size:
https://huggingface.co/collections/openai
--lang TEXT Specify the language of the file(s) you provide
(en, de, fr ... Default: auto-detection).
--annotate Enable speaker detection to identify and
annotate different speakers. Creates .rttm
file.
--hf_token TEXT HuggingFace Access token required for speaker
detection.
--translate Translate transcription to English.
--subtitle Create .srt and .webvtt subtitles from the
transcription.
--sub_length INTEGER Subtitle length in words for each subtitle
block (Default: 5); e.g. "10" produces
subtitles where each individual subtitle block
covers exactly 10 words.
--config FILE Path to configuration file.
--filetypes List supported audio and video file types.
--verbose Print text chunks during transcription.
--help Show this message and exit.
In order to annotate speakers using --annotate
you need to provide a valid HuggingFace access token using the --hf_token
option. Additionally, you must accept the terms and conditions for both version 3.0 and version 3.1 of the pyannote
segmentation model. For detailed instructions, refer to the Requirements section on the pyannote model page on HuggingFace.
Whithout passing the --hf_token
option, whisply
will try to automatically read an existing HuggingFace access token from your shell environment that you have previously exported like this:
export HF_TOKEN=hf_abcieo...
whisply
uses whisperX for speaker diarization and annotation. Instead of returning chunk-level timestamps like the standard Whisper
implementation whisperX
is able to return word-level timestamps as well as annotating speakers word by word, thus returning much more precise annotations.
Out of the box whisperX
will not provide timestamps for words containing only numbers (e.g. "1.5" or "2024"): whisply
fixes those instances through timestamp approximation. Other known limitations of whisperX
include:
- inaccurate speaker diarization if multiple speakers talk at the same time
- to provide word-level timestamps and annotations
whisperX
uses language specific alignment models; out of the boxwhisperX
supports these languages:en, fr, de, es, it, ja, zh, nl, uk, pt
.
Refer to the whisperX GitHub page for more information.
Instead of providing a file, folder or URL by using the --files
option you can pass a .list
with a mix of files, folders and URLs for processing.
Example:
$ cat my_files.list
video_01.mp4
video_02.mp4
./my_files/
https://youtu.be/KtOayYXEsN4?si=-0MS6KXbEWXA7dqo
You can provide a .json
config file by using the --config
option which makes batch processing easy. An example config looks like this:
{
"files": "./files/my_files.list", # Path to your files
"output_dir": "./transcriptions", # Output folder where transcriptions are saved
"device": "auto", # AUTO, GPU, MPS or CPU
"model": "large-v2", # Whisper model to use
"lang": null, # Null for auto-detection or language codes ("en", "de", ...)
"annotate": false, # Annotate speakers
"hf_token": "HuggingFace Access Token", # Your HuggingFace Access Token (needed for annotations)
"translate": false, # Translate to English
"subtitle": false, # Subtitle file(s)
"sub_length": 10, # Length of each subtitle block in number of words
"verbose": false # Print transcription segments while processing
}