Sonori

A lightweight, transparent overlay application that displays real-time transcriptions of your speech using Whisper AI models on Linux.

The application is currently in very early development and might be unstable, buggy and/or crash.

Contributions are welcome. There are no guidelines yet. Just check the planned features, known issues and make sure your changes work on NixOS and other distros!

Features

Current

Real-Time Transcription: Transcribes your speech in real-time using OpenAI's Whisper models
Voice Activity Detection: Uses Silero VAD for accurate speech detection
Transparent Overlay: Non-intrusive overlay that sits at the bottom of your screen
Audio Visualization: Visual feedback when speaking with a spectrogram display
Copy/Paste Functionality: Easily copy transcribed text to clipboard
Pause/Resume Recording: Pause/Resume recording
Auto-Start Recording: Begins recording as soon as the application launches
Scroll Controls: Navigate through longer transcripts
Configurable: Configure the model, language, and other settings like keyboard shortcuts in the config file (config.json)
Automatic Model Download: Both Whisper and Silero VAD models are downloaded automatically

Planned

Better error handling: Handle errors gracefully and provide useful error messages
Improve performance: Lower CPU usage, lower latency, better multi-threaded code
Better UI: A better UI with a focus on more usability
VSYNC: Add VSYNC support for optionally reducing rendered frames
Input field detection: Automatically detect input fields and transcribe text into them (might be a bit tricky to implement)
CUDA support: Add support for CUDA to speed up inference on supported GPUs
Other backends: I want to add other optional backends like Whisper.cpp or even an API (which would greatly increase speed/accuracy at the cost of some latency and maybe your privacy)

NOT Planned

Using a GUI framework: I want to learn more about wgpu and wgsl and think a GUI written from scratch is perfectly fine for this application
Support for Windows/macOS: Not planned by me personally but if anyone wants to give it a shot feel free

Requirements

Dependencies

DISCLAIMER: Building from source, installing dependencies and running the application has only been tested on NixOS and I'm unsure if it will work on other distributions.

For Debian/Ubuntu-based distributions:

sudo apt install build-essential portaudio19-dev libclang-dev pkg-config wl-copy \
  libxkbcommon-dev libwayland-dev libx11-dev libxcursor-dev libxi-dev libxrandr-dev \
  libasound2-dev libssl-dev libfftw3-dev curl cmake libvulkan-dev

For Fedora/RHEL-based distributions:

sudo dnf install gcc gcc-c++ portaudio-devel clang-devel pkg-config wl-copy \
  libxkbcommon-devel wayland-devel libX11-devel libXcursor-devel libXi-devel libXrandr-devel \
  alsa-lib-devel openssl-devel fftw-devel curl cmake vulkan-loader-devel

For Arch-based distributions:

sudo pacman -S base-devel portaudio clang pkgconf wl-copy \
  libxkbcommon wayland libx11 libxcursor libxi libxrandr alsa-lib openssl fftw curl cmake \
  vulkan-headers vulkan-tools

For NixOS:

Simply use the provided flake.nix by running

nix develop

while in the root directory of the repository. The flake includes all necessary dependencies including vulkan-loader.

Required Models

Sonori needs two types of models to function properly:

Whisper Model - Configured in the config.json file and downloaded automatically on first run
Silero VAD Model - Also downloaded automatically on first run

Note: If you need to download the Silero model manually for any reason, you should head to the repo and download the model yourself:

https://github.com/snakers4/silero-vad/

And then place it in ~/.cache/sonori/models/

Additional Requirements

ONNX Runtime: Required for the Silero VAD model.
CTranslate2: Used for Whisper model inference.
Vulkan: Required for WGPU rendering. Your system must have a working Vulkan installation.

Installation

Building from Source

Install Rust and Cargo (https://rustup.rs/) and make sure the dependencies are installed
Clone this repository
Build the application:
```
cargo build --release
```
The executable will be in target/release/sonori

Usage

Launch the application:
```
./target/release/sonori
```
A transparent overlay will appear at the bottom of your screen
Recording starts automatically
Speak naturally - your speech will be transcribed in real-time or near real-time (based on the model and hardware)
Use the buttons on the overlay to:
- Pause/Resume recording
- Copy text to clipboard
- Clear transcript history
- Exit the application

Configuration

Sonori uses a config.json file in the same directory as the executable. If not present, a default configuration is used.

Example configuration:

{
  "model": "openai/whisper-base.en",
  "language": "en",
  "compute_type": "INT8",
  "log_stats_enabled": false,
  "buffer_size": 1024,
  "sample_rate": 16000,
  "whisper_options": {
    "beam_size": 5,
    "patience": 1.0,
    "repetition_penalty": 1.25
  },
  "vad_config": {
    "threshold": 0.2,
    "hangbefore_frames": 1,
    "hangover_frames": 15,
    "max_buffer_duration_sec": 30.0,
    "max_segment_count": 20
  },
  "audio_processor_config": {
    "max_vis_samples": 1024
  },
  "keyboard_shortcuts": {
    "copy_transcript": "KeyC",
    "reset_transcript": "KeyR",
    "quit_application": "KeyQ",
    "toggle_recording": "Space",
    "exit_application": "Escape"
  }
}

Keyboard Shortcuts

You can customize the keyboard shortcuts used in the application by editing the keyboard_shortcuts section in the config.json file. The default shortcuts are:

copy_transcript: KeyC (Ctrl+C) - Copy the transcription to clipboard
reset_transcript: KeyR (Ctrl+R) - Clear the current transcript
toggle_recording: Space - Toggle recording on/off
exit_application: Escape - Exit the application

When specifying keys, use the key names from the KeyCode enum in winit, such as:

Letter keys: KeyA, KeyB, KeyC, etc.
Number keys: Digit0, Digit1, etc.
Function keys: F1, F2, etc.
Special keys: Space, Escape, Enter, Tab, etc.

Note: The Ctrl modifier is automatically applied to copy_transcript, reset_transcript shortcuts.

Model Options

Recommended Local Whisper models:

openai/whisper-tiny.en - Tiny model, English only (for low-end CPUs)
openai/whisper-base.en - Base model, English only (default, for low to mid-range CPUs)
distil-whisper/distil-small.en - Small model, English only (for mid to high-range CPUs)
distil-whisper/distil-medium.en - Medium model, English only (for high-end CPUs only)
any other bigger whisper model - probably too slow to run on CPU only in real-time

For non-English languages, use the multilingual models (without .en suffix) and set the appropriate language code in the configuration.

Known Issues

The application might not work with all Wayland compositors (I only tested it with KDE Plasma and KWin).
The transcriptions are not 100% accurate and might contain errors. This is closely related to the whisper model that is used.
Sometimes the last word of a "segment" is cut off. This is probably an issue with processing the audio data.
The CPU usage is too high, even when idle. This might be related to bad code on my side or some overhead of the models. I already identified that changing the buffer size will help (or make it worse).

Troubleshooting

Wayland Support

Sonori uses layer shell protocol for Wayland compositors. If you experience issues:

Make sure you are in a wayland session and your compositor supports the layer shell protocol

Vulkan Support

Sonori uses WGPU for rendering, which requires Vulkan support. If you encounter errors related to adapter detection or Vulkan:

Ensure you have the Vulkan libraries installed for your distribution (see Dependencies section)
Verify that your GPU supports Vulkan and that drivers are properly installed
On some systems, you may need to install additional vendor-specific Vulkan packages (e.g., mesa-vulkan-drivers on Ubuntu/Debian)
You can test Vulkan support by running vulkaninfo or vkcube if available on your system

Model Conversion Issues

If you encounter issues with automatic model conversion:

For NixOS:

nix-shell model-conversion/shell.nix
ct2-transformers-converter --model your-model --output_dir ~/.cache/whisper/your-model --copy_files preprocessor_config.json tokenizer.json

For other distributions:

pip install -U ctranslate2 huggingface_hub torch transformers
ct2-transformers-converter --model your-model --output_dir ~/.cache/whisper/your-model --copy_files preprocessor_config.json tokenizer.json

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
model-conversion		model-conversion
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md
config.json		config.json
flake.lock		flake.lock
flake.nix		flake.nix
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sonori

Features

Current

Planned

NOT Planned

Requirements

Dependencies

Required Models

Additional Requirements

Installation

Building from Source

Usage

Configuration

Keyboard Shortcuts

Model Options

Known Issues

Troubleshooting

Wayland Support

Vulkan Support

Model Conversion Issues

Platform Support

Credits

License

About

Releases

Packages

Languages

0xPD33/sonori

Folders and files

Latest commit

History

Repository files navigation

Sonori

Features

Current

Planned

NOT Planned

Requirements

Dependencies

Required Models

Additional Requirements

Installation

Building from Source

Usage

Configuration

Keyboard Shortcuts

Model Options

Known Issues

Troubleshooting

Wayland Support

Vulkan Support

Model Conversion Issues

Platform Support

Credits

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages