Chiron - Part 5 Local Speech to Text #70
nduartech
announced in
Blog Posts
Replies: 1 comment
-
By the way, sherpa-onnx also supports speech to text. You can use whiser models with sherpa-onnx if you like. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
slug: chiron-stt
description: Getting Chiron to listen
published: 2024-11-21
As I mentioned previously, while sherpa-onnx's TTS capabilities were enough to satisfy that requirement, the speech-to-text, though it did give me some insight into how the ONNX versions of whisper and distil-whisper models at different sizes (tiny and small) performed on my machine, was edged out by a different library that did just a bit better.
CTranslate2
CTranslate2 emerged as a promising solution, primarily because it's a production-ready inference engine designed specifically for neural machine translation models. What caught my attention was its ability to run optimized Transformer models efficiently on CPU. The library provides several key advantages:
Optimized inference on CPU with INT8/INT16 quantization
Significantly reduced memory usage compared to PyTorch models
Fast model loading and efficient batch processing
Support for multiple model formats, including Whisper
The library's focus on CPU optimization aligned perfectly with my project's requirements, making it a strong candidate for the STT component.
Faster-Whisper
Building upon CTranslate2's foundation, Faster-Whisper provides a more streamlined implementation specifically for Whisper models. Faster-Whisper brought to the table:
Optimized implementation of OpenAI's Whisper model
4x faster inference than OpenAI's implementation (at the time)
Significantly lower memory usage
Easy integration with CTranslate2's optimizations
Faster-Whisper-Server
I initially explored Faster-Whisper-Server as a sidecar approach, where the Python-based server would run alongside the main application. However, this approach presented its own hurdles. The requirement to bundle a Python environment and interpreter with the application would have drastically increased the distribution size. I attempted to mitigate this by containerizing the service with Docker, but this introduced new problems - namely, unacceptably long startup and shutdown times as the application struggled to manage the Docker container lifecycle.
These practical deployment concerns, combined with the need for tight integration with the application's audio pipeline, led me to seek a more direct solution.
CTranslate2-rs
I ultimately settled on CTranslate2-rs, a Rust binding for CTranslate2 that gave me native access to the inference engine. I structured my implementation around a multi-stage pipeline designed to handle the complexities of real-time speech recognition.
For the first stage, I focused on audio preprocessing. I found that raw audio input needed careful handling - I added silence padding at the beginning of recordings to prevent word clipping, and implemented proper resampling to ensure the audio met the model's expected format. These preprocessing steps dramatically improved my transcription accuracy, especially for the first few words of each recording.
I handled the transcription stage by running it in a dedicated thread to prevent UI blocking. This asynchronous approach was crucial for maintaining application responsiveness during longer transcriptions. I designed the system using a message-passing architecture to communicate between the audio recording, transcription, and UI components, which gave me clean separation of concerns while maintaining efficient data flow.
I also needed to consider several practical aspects for production use. I implemented automatic model management to handle downloading and initialization, along with a cleanup system similar to the one I built for the text-to-speech functionality that actively cleans up audio files both after processing and during app shutdown.
One of the more interesting aspects of my implementation was the post-processing pipeline for keyword standardization. I added a simple but effective text replacement layer to ensure certain keywords, particularly the assistant's name, would always be spelled consistently regardless of the model's output. This lightweight solution fit well with my goal of maintaining minimal resource usage while ensuring a polished user experience.
The final result is a robust speech-to-text solution that runs entirely locally on CPU, providing fast and accurate transcription while maintaining a small resource footprint. My direct integration through Rust bindings achieved the original goals of efficiency and maintainability, without the complexity of managing external services or environments.
Beta Was this translation helpful? Give feedback.
All reactions