Stateful & low latency ASR architecture #553

solyarisoftware · 2021-05-24T10:41:24Z

Hi Nicolay, all

That's not an issue, just a question/discussion for you/everyone about the proposed architecture.

Preamble about latencies
Vosk decoding latencies time are very fast! On my PC, for short (few words) utterances transcripts I got:

Using grammar-based models (e.g. pretrained model model-small-en-us-0.15)
- If I DO NOT specify any grammar I achieve latency of ~500-600 msecs
- If I DO specify a grammar (also pretty long) I achieve few tents of msecs (<< 100 msecs)
Using large / static graph model (e.g. vosk-model-en-us-aspire-0.2), I got ~400-500 msec latency (with a better accuracy for open-domain utterances).

Proposed architecture
Now, considering a stateful (task-oriented closed-domain) voice-assistant platform, I want to experiment how much can I slow-down latencies, with a stateful ASR. My idea is to connect Vosk ASR with a state-based dialog manager (as my own opensource NaifJs),

Workflow:

Initialization phase:
- to load model that allow grammars (e.g. model model-small-en-us-0.15)
- to prepare/create N different Vosk Recognizers for each grammar(N) (one grammar for for each state(N) )
Run-time (decoding time)
- a "Decode Manager" decides which Recognizer us to be used, depending on the state injected by the dialog manager
- The Decode Manager could use a fallback Recognizer, based on the original model, without a grammar specified for a final decision

See the diagram:

                       state(S-1) -> grammar(S-1)
                      ┌────────────────────────────────────────────────────────────┐
                      │                                                            │
                      │                                                            │
                      │                                                            │
                      │       (1)                                                  │
           ┌──────────▼─────────┐                                                  │
           │                    │                                                  │
           │                    │                                (2)               │
           │                    │   ┌──────────────┐   ┌───────────┐               │
           │                    │   │              │   │           │               │
           │                    │   │ Grammar 1    │   │           │               │
           │                    ◄───┤ Recognizer 1 ◄───┤           │               │
           │                    │   │              │   │           │               │   (3)
           │                    │   │              │   │           │         ┌─────┴─────┐
           │                    │   └──────────────┘   │           │         │           │
           │                    │                      │           │         │           │
           │                    │   ┌──────────────┐   │           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   │ Grammar 2    │   │           │         │           │
           │                    ◄───┤ Recognizer 2 ◄───┤           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   └──────────────┘   │           │         │           │
pcm audio  │       DECODER      │                      │  MODEL    │         │  DIALOG   │
───────────►       MANAGER      │   ┌──────────────┐   │  ALLOWING │         │  MANAGER  ├───────►
           │                    │   │              │   │  GRAMMARS │         │           │
           │                    │   │ Grammar N    │   │           │         │           │
           │                    ◄───┤ Recognizer N ◄───┤           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   └──────────────┘   │           │         │           │
           │                    │                      │           │         │           │
           │                    │                      │           │         │           │
           │                    │                      │           │         │           │
           │                    │   ┌──────────────┐   │           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   │ No-Grammar   │   │           │         └─────▲─────┘
           │                    ◄───┤ Recognizer 0 ◄───┤           │               │
           │                    │   │              │   │           │               │
           │                    │   │              │   │           │               │
           │ ┌────────────────┐ │   └──────────────┘   └───────────┘               │
           │ │ acceptWaveForm │ │                                                  │
           │ │                │ │                                                  │
           │ └───────┬────────┘ │                                                  │
           │         │          │                                                  │
           │         │          │                                                  │
           └─────────┼──────────┘                                                  │
                     │                                                             │
                     │                                                             │
                     │                                                             │
                     │                                                             │
                     └─────────────────────────────────────────────────────────────┘
                     decode result S

Questions:

There is any drawback in creating Vosk Recognizers at init time (and deleting them when the program server will exit)?

BTW. That approach would minimize new Recognizer elapsed, even if I noticed this partial latency is really low (few msecs) when a grammar is specified, whereas it increase to many tents of msecs if a grammar is NOT specified.
Does this architecture make sense? Any suggestion / alternative approach is very welcome.

Thanks
Giorgio

The text was updated successfully, but these errors were encountered:

nshmyrev · 2021-05-24T19:30:22Z

The architecture ok, it is not a problem to pre-init the recognizers.

In general we don't recommend to use voice for something that requires immediate feedback, much more reasonable ways exist then. You'd better use the voice for something that is more free-form. So I think latency of 0.5 s is reasonable. Improvement requires debugging.

solyarisoftware · 2021-05-24T19:53:48Z

Yes, in my quick tests, Vosk speed (0.5s) is almost double of Coqui STT (~DeepSpeech), using a comparable large model I believe.

I would like to experiment/debug this kind of architecture, as you suggest, considering that maybe is not just about the "realtime" latency as absolute metric. It's also about to set-up a multiuser server architecture considering trade-off between maximizes "performances" and minimizing resources.

Thanks

solyarisoftware closed this as completed May 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stateful & low latency ASR architecture #553

Stateful & low latency ASR architecture #553

solyarisoftware commented May 24, 2021 •

edited

Loading

nshmyrev commented May 24, 2021

solyarisoftware commented May 24, 2021 •

edited

Loading

Stateful & low latency ASR architecture #553

Stateful & low latency ASR architecture #553

Comments

solyarisoftware commented May 24, 2021 • edited Loading

nshmyrev commented May 24, 2021

solyarisoftware commented May 24, 2021 • edited Loading

solyarisoftware commented May 24, 2021 •

edited

Loading

solyarisoftware commented May 24, 2021 •

edited

Loading