Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stateful & low latency ASR architecture #553

Closed
solyarisoftware opened this issue May 24, 2021 · 2 comments
Closed

Stateful & low latency ASR architecture #553

solyarisoftware opened this issue May 24, 2021 · 2 comments

Comments

@solyarisoftware
Copy link

solyarisoftware commented May 24, 2021

Hi Nicolay, all

That's not an issue, just a question/discussion for you/everyone about the proposed architecture.

Preamble about latencies
Vosk decoding latencies time are very fast! On my PC, for short (few words) utterances transcripts I got:

  1. Using grammar-based models (e.g. pretrained model model-small-en-us-0.15)
    • If I DO NOT specify any grammar I achieve latency of ~500-600 msecs
    • If I DO specify a grammar (also pretty long) I achieve few tents of msecs (<< 100 msecs)
  2. Using large / static graph model (e.g. vosk-model-en-us-aspire-0.2), I got ~400-500 msec latency (with a better accuracy for open-domain utterances).

Proposed architecture
Now, considering a stateful (task-oriented closed-domain) voice-assistant platform, I want to experiment how much can I slow-down latencies, with a stateful ASR. My idea is to connect Vosk ASR with a state-based dialog manager (as my own opensource NaifJs),

Workflow:

  1. Initialization phase:

    • to load model that allow grammars (e.g. model model-small-en-us-0.15)
    • to prepare/create N different Vosk Recognizers for each grammar(N) (one grammar for for each state(N) )
  2. Run-time (decoding time)

    • a "Decode Manager" decides which Recognizer us to be used, depending on the state injected by the dialog manager
    • The Decode Manager could use a fallback Recognizer, based on the original model, without a grammar specified for a final decision

See the diagram:

                       state(S-1) -> grammar(S-1)
                      ┌────────────────────────────────────────────────────────────┐
                      │                                                            │
                      │                                                            │
                      │                                                            │
                      │       (1)                                                  │
           ┌──────────▼─────────┐                                                  │
           │                    │                                                  │
           │                    │                                (2)               │
           │                    │   ┌──────────────┐   ┌───────────┐               │
           │                    │   │              │   │           │               │
           │                    │   │ Grammar 1    │   │           │               │
           │                    ◄───┤ Recognizer 1 ◄───┤           │               │
           │                    │   │              │   │           │               │   (3)
           │                    │   │              │   │           │         ┌─────┴─────┐
           │                    │   └──────────────┘   │           │         │           │
           │                    │                      │           │         │           │
           │                    │   ┌──────────────┐   │           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   │ Grammar 2    │   │           │         │           │
           │                    ◄───┤ Recognizer 2 ◄───┤           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   └──────────────┘   │           │         │           │
pcm audio  │       DECODER      │                      │  MODEL    │         │  DIALOG   │
───────────►       MANAGER      │   ┌──────────────┐   │  ALLOWING │         │  MANAGER  ├───────►
           │                    │   │              │   │  GRAMMARS │         │           │
           │                    │   │ Grammar N    │   │           │         │           │
           │                    ◄───┤ Recognizer N ◄───┤           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   └──────────────┘   │           │         │           │
           │                    │                      │           │         │           │
           │                    │                      │           │         │           │
           │                    │                      │           │         │           │
           │                    │   ┌──────────────┐   │           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   │ No-Grammar   │   │           │         └─────▲─────┘
           │                    ◄───┤ Recognizer 0 ◄───┤           │               │
           │                    │   │              │   │           │               │
           │                    │   │              │   │           │               │
           │ ┌────────────────┐ │   └──────────────┘   └───────────┘               │
           │ │ acceptWaveForm │ │                                                  │
           │ │                │ │                                                  │
           │ └───────┬────────┘ │                                                  │
           │         │          │                                                  │
           │         │          │                                                  │
           └─────────┼──────────┘                                                  │
                     │                                                             │
                     │                                                             │
                     │                                                             │
                     │                                                             │
                     └─────────────────────────────────────────────────────────────┘
                     decode result S

Questions:

  1. There is any drawback in creating Vosk Recognizers at init time (and deleting them when the program server will exit)?

    BTW. That approach would minimize new Recognizer elapsed, even if I noticed this partial latency is really low (few msecs) when a grammar is specified, whereas it increase to many tents of msecs if a grammar is NOT specified.

  2. Does this architecture make sense? Any suggestion / alternative approach is very welcome.

Thanks
Giorgio

@nshmyrev
Copy link
Collaborator

The architecture ok, it is not a problem to pre-init the recognizers.

In general we don't recommend to use voice for something that requires immediate feedback, much more reasonable ways exist then. You'd better use the voice for something that is more free-form. So I think latency of 0.5 s is reasonable. Improvement requires debugging.

@solyarisoftware
Copy link
Author

solyarisoftware commented May 24, 2021

Yes, in my quick tests, Vosk speed (0.5s) is almost double of Coqui STT (~DeepSpeech), using a comparable large model I believe.

I would like to experiment/debug this kind of architecture, as you suggest, considering that maybe is not just about the "realtime" latency as absolute metric. It's also about to set-up a multiuser server architecture considering trade-off between maximizes "performances" and minimizing resources.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants