A lightweight, transparent overlay application that displays real-time transcriptions of your speech using Whisper AI models on Linux.
The application is currently in very early development and might be unstable, buggy and/or crash.
Contributions are welcome. There are no guidelines yet. Just check the planned features, known issues and make sure your changes work on NixOS and other distros!
- Real-Time Transcription: Transcribes your speech in real-time using OpenAI's Whisper models
- Voice Activity Detection: Uses Silero VAD for accurate speech detection
- Transparent Overlay: Non-intrusive overlay that sits at the bottom of your screen
- Audio Visualization: Visual feedback when speaking with a spectrogram display
- Copy/Paste Functionality: Easily copy transcribed text to clipboard
- Pause/Resume Recording: Pause/Resume recording
- Auto-Start Recording: Begins recording as soon as the application launches
- Scroll Controls: Navigate through longer transcripts
- Configurable: Configure the model, language, and other settings like keyboard shortcuts in the config file (config.json)
- Automatic Model Download: Both Whisper and Silero VAD models are downloaded automatically
- Better error handling: Handle errors gracefully and provide useful error messages
- Improve performance: Lower CPU usage, lower latency, better multi-threaded code
- Better UI: A better UI with a focus on more usability
- VSYNC: Add VSYNC support for optionally reducing rendered frames
- Input field detection: Automatically detect input fields and transcribe text into them (might be a bit tricky to implement)
- CUDA support: Add support for CUDA to speed up inference on supported GPUs
- Other backends: I want to add other optional backends like Whisper.cpp or even an API (which would greatly increase speed/accuracy at the cost of some latency and maybe your privacy)
- Using a GUI framework: I want to learn more about wgpu and wgsl and think a GUI written from scratch is perfectly fine for this application
- Support for Windows/macOS: Not planned by me personally but if anyone wants to give it a shot feel free
DISCLAIMER: Building from source, installing dependencies and running the application has only been tested on NixOS and I'm unsure if it will work on other distributions.
For Debian/Ubuntu-based distributions:
sudo apt install build-essential portaudio19-dev libclang-dev pkg-config wl-copy \
libxkbcommon-dev libwayland-dev libx11-dev libxcursor-dev libxi-dev libxrandr-dev \
libasound2-dev libssl-dev libfftw3-dev curl cmake libvulkan-dev
For Fedora/RHEL-based distributions:
sudo dnf install gcc gcc-c++ portaudio-devel clang-devel pkg-config wl-copy \
libxkbcommon-devel wayland-devel libX11-devel libXcursor-devel libXi-devel libXrandr-devel \
alsa-lib-devel openssl-devel fftw-devel curl cmake vulkan-loader-devel
For Arch-based distributions:
sudo pacman -S base-devel portaudio clang pkgconf wl-copy \
libxkbcommon wayland libx11 libxcursor libxi libxrandr alsa-lib openssl fftw curl cmake \
vulkan-headers vulkan-tools
For NixOS:
Simply use the provided flake.nix by running
nix develop
while in the root directory of the repository. The flake includes all necessary dependencies including vulkan-loader.
Sonori needs two types of models to function properly:
-
Whisper Model - Configured in the
config.json
file and downloaded automatically on first run -
Silero VAD Model - Also downloaded automatically on first run
Note: If you need to download the Silero model manually for any reason, you should head to the repo and download the model yourself:
https://github.com/snakers4/silero-vad/
And then place it in
~/.cache/sonori/models/
- ONNX Runtime: Required for the Silero VAD model.
- CTranslate2: Used for Whisper model inference.
- Vulkan: Required for WGPU rendering. Your system must have a working Vulkan installation.
- Install Rust and Cargo (https://rustup.rs/) and make sure the dependencies are installed
- Clone this repository
- Build the application:
cargo build --release
- The executable will be in
target/release/sonori
- Launch the application:
./target/release/sonori
- A transparent overlay will appear at the bottom of your screen
- Recording starts automatically
- Speak naturally - your speech will be transcribed in real-time or near real-time (based on the model and hardware)
- Use the buttons on the overlay to:
- Pause/Resume recording
- Copy text to clipboard
- Clear transcript history
- Exit the application
Sonori uses a config.json
file in the same directory as the executable. If not present, a default configuration is used.
Example configuration:
{
"model": "openai/whisper-base.en",
"language": "en",
"compute_type": "INT8",
"log_stats_enabled": false,
"buffer_size": 1024,
"sample_rate": 16000,
"whisper_options": {
"beam_size": 5,
"patience": 1.0,
"repetition_penalty": 1.25
},
"vad_config": {
"threshold": 0.2,
"hangbefore_frames": 1,
"hangover_frames": 15,
"max_buffer_duration_sec": 30.0,
"max_segment_count": 20
},
"audio_processor_config": {
"max_vis_samples": 1024
},
"keyboard_shortcuts": {
"copy_transcript": "KeyC",
"reset_transcript": "KeyR",
"quit_application": "KeyQ",
"toggle_recording": "Space",
"exit_application": "Escape"
}
}
You can customize the keyboard shortcuts used in the application by editing the keyboard_shortcuts
section in the config.json file. The default shortcuts are:
copy_transcript
: KeyC (Ctrl+C) - Copy the transcription to clipboardreset_transcript
: KeyR (Ctrl+R) - Clear the current transcripttoggle_recording
: Space - Toggle recording on/offexit_application
: Escape - Exit the application
When specifying keys, use the key names from the KeyCode enum in winit, such as:
- Letter keys: KeyA, KeyB, KeyC, etc.
- Number keys: Digit0, Digit1, etc.
- Function keys: F1, F2, etc.
- Special keys: Space, Escape, Enter, Tab, etc.
Note: The Ctrl modifier is automatically applied to copy_transcript, reset_transcript shortcuts.
Recommended Local Whisper models:
openai/whisper-tiny.en
- Tiny model, English only (for low-end CPUs)openai/whisper-base.en
- Base model, English only (default, for low to mid-range CPUs)distil-whisper/distil-small.en
- Small model, English only (for mid to high-range CPUs)distil-whisper/distil-medium.en
- Medium model, English only (for high-end CPUs only)- any other bigger whisper model - probably too slow to run on CPU only in real-time
For non-English languages, use the multilingual models (without .en
suffix) and set the appropriate language code in the configuration.
- The application might not work with all Wayland compositors (I only tested it with KDE Plasma and KWin).
- The transcriptions are not 100% accurate and might contain errors. This is closely related to the whisper model that is used.
- Sometimes the last word of a "segment" is cut off. This is probably an issue with processing the audio data.
- The CPU usage is too high, even when idle. This might be related to bad code on my side or some overhead of the models. I already identified that changing the buffer size will help (or make it worse).
Sonori uses layer shell protocol for Wayland compositors. If you experience issues:
- Make sure you are in a wayland session and your compositor supports the layer shell protocol
Sonori uses WGPU for rendering, which requires Vulkan support. If you encounter errors related to adapter detection or Vulkan:
- Ensure you have the Vulkan libraries installed for your distribution (see Dependencies section)
- Verify that your GPU supports Vulkan and that drivers are properly installed
- On some systems, you may need to install additional vendor-specific Vulkan packages (e.g.,
mesa-vulkan-drivers
on Ubuntu/Debian) - You can test Vulkan support by running
vulkaninfo
orvkcube
if available on your system
If you encounter issues with automatic model conversion:
For NixOS:
nix-shell model-conversion/shell.nix
ct2-transformers-converter --model your-model --output_dir ~/.cache/whisper/your-model --copy_files preprocessor_config.json tokenizer.json
For other distributions:
pip install -U ctranslate2 huggingface_hub torch transformers
ct2-transformers-converter --model your-model --output_dir ~/.cache/whisper/your-model --copy_files preprocessor_config.json tokenizer.json
- Linux: Supported (tested on Wayland using KDE Plasma and KWin)
- Windows/macOS: Not officially supported or tested