You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
That's not an issue, just a question/discussion for you/everyone about the proposed architecture.
Preamble about latencies
Vosk decoding latencies time are very fast! On my PC, for short (few words) utterances transcripts I got:
Using grammar-based models (e.g. pretrained model model-small-en-us-0.15)
If I DO NOT specify any grammar I achieve latency of ~500-600 msecs
If I DO specify a grammar (also pretty long) I achieve few tents of msecs (<< 100 msecs)
Using large / static graph model (e.g. vosk-model-en-us-aspire-0.2), I got ~400-500 msec latency (with a better accuracy for open-domain utterances).
Proposed architecture
Now, considering a stateful (task-oriented closed-domain) voice-assistant platform, I want to experiment how much can I slow-down latencies, with a stateful ASR. My idea is to connect Vosk ASR with a state-based dialog manager (as my own opensource NaifJs),
Workflow:
Initialization phase:
to load model that allow grammars (e.g. model model-small-en-us-0.15)
to prepare/create N different Vosk Recognizers for each grammar(N) (one grammar for for each state(N) )
Run-time (decoding time)
a "Decode Manager" decides which Recognizer us to be used, depending on the state injected by the dialog manager
The Decode Manager could use a fallback Recognizer, based on the original model, without a grammar specified for a final decision
There is any drawback in creating Vosk Recognizers at init time (and deleting them when the program server will exit)?
BTW. That approach would minimize new Recognizer elapsed, even if I noticed this partial latency is really low (few msecs) when a grammar is specified, whereas it increase to many tents of msecs if a grammar is NOT specified.
Does this architecture make sense? Any suggestion / alternative approach is very welcome.
Thanks
Giorgio
The text was updated successfully, but these errors were encountered:
The architecture ok, it is not a problem to pre-init the recognizers.
In general we don't recommend to use voice for something that requires immediate feedback, much more reasonable ways exist then. You'd better use the voice for something that is more free-form. So I think latency of 0.5 s is reasonable. Improvement requires debugging.
Yes, in my quick tests, Vosk speed (0.5s) is almost double of Coqui STT (~DeepSpeech), using a comparable large model I believe.
I would like to experiment/debug this kind of architecture, as you suggest, considering that maybe is not just about the "realtime" latency as absolute metric. It's also about to set-up a multiuser server architecture considering trade-off between maximizes "performances" and minimizing resources.
Hi Nicolay, all
That's not an issue, just a question/discussion for you/everyone about the proposed architecture.
Preamble about latencies
Vosk decoding latencies time are very fast! On my PC, for short (few words) utterances transcripts I got:
Proposed architecture
Now, considering a stateful (task-oriented closed-domain) voice-assistant platform, I want to experiment how much can I slow-down latencies, with a stateful ASR. My idea is to connect Vosk ASR with a state-based dialog manager (as my own opensource NaifJs),
Workflow:
Initialization phase:
grammar(N)
(one grammar for for eachstate(N)
)Run-time (decoding time)
See the diagram:
Questions:
There is any drawback in creating Vosk Recognizers at init time (and deleting them when the program server will exit)?
BTW. That approach would minimize
new Recognizer
elapsed, even if I noticed this partial latency is really low (few msecs) when a grammar is specified, whereas it increase to many tents of msecs if a grammar is NOT specified.Does this architecture make sense? Any suggestion / alternative approach is very welcome.
Thanks
Giorgio
The text was updated successfully, but these errors were encountered: