-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Home-Assistant pipeline for TTS and STT #195
Comments
Thanks for the benchmark! Use of HA approaches for STT/TTS is likely not something we're going to support (for the reasons you highlight and more). That said WIS in CPU mode on most x86_64 systems also dramatically outperforms the HA implementations. We keep running into this issue and my general position (as harsh/absolutist as it may be) is the current HA approaches to voice, STT, TTS, etc just aren't practical. In seven seconds you can find your phone in your house, unlock it, open an app, and just do it there - assuming the tiny model even gets the transcription right in the first place (often doesn't). At that point you're looking at 14 seconds or more (total) to repeat yourself and there are many speech segments it will never transcribe accurately. The wisng README has Raspberry Pi benchmarks (as one example) for the implementation HA uses and long story short apples-to-apples (tiny to tiny) a GTX 1070 is 90x faster just for STT... When you get to models that can provide commercially competitive quality (medium) it's at least 112x faster. The Raspberry Pi takes a whopping 51 seconds to do STT with 3.8 seconds of speech with medium on a Raspberry Pi! It's not at all a fair fight but it's a dramatic demonstration of just how unpractical the Raspberry Pi approach is. At the risk of sounding like I'm being overly critical, HA is fantastic and Willow wouldn't do anything without it. But speech and ML tasks are just a completely different animal that Willow and WIS are highly targeted for. If anything we'd go the other direction - a WIS component for HA so HA can use WIS as a STT/TTS endpoint elsewhere within HA. |
I totally see your concerns regarding the performance of on-device STT and TTS with Home Assistant. The beauty of HA though is that it allows you to customize the STT and TTS engines of the assist pipeline. I'm currently using Nabu Casa for both and the performance is very satisfactory. I realize that by using HA's full pipeline we would basically be bypassing the majority of the functionality of Willow so it's totally fair if you think that's not the goal of the project. Since Willow + ESP S3 Box is a readily available setup with wake word capability, I think a lot of people are looking to be able to fully integrate it with the assist pipeline. |
My last sentence from the prior reply: "If anything we'd go the other direction - a WIS component for HA so HA can use WIS as a STT/TTS endpoint elsewhere within HA." This is the use of WIS elsewhere in the HA pipeline (Willow related or not) for STT/TTS within HA. In terms of what you're describing, none of this is impossible (or even close to it). That said, we cannot be solely responsible for not only Willow, WIS, and WAS but also for Home Assistant, openHAB, Hubitat, and countless other native platforms and integrations people have asked for. That is better left up to those communities. We brought the best voice interface at the best price point with the highest performance and accuracy available in open source. This is our focus and if a community wants functionality like you describe I do not think it is too much to ask for it/them to put in a little effort to bring our base functionality to their platform of choice - in the way they want it. I would love to see a Willow component with everything you describe and more in HA but it's not going to come from us. |
That's totally fair considering the goal of this project. Thanks for taking the time to explain. |
Looking at the code for Willow and WIS I imagine I can stand up some sort of proxy server on HA that pipes audio from Willow to the assist pipeline. The TTS part would be a lot harder to integrate since it looks like Willow is doing it on device for now. When/If the TTS output feature in the README is implemented that part shouldn't be too difficult either. I guess there goes all of my free time 😆. Anyway thank you and everyone else involved in making this project possible. Excited to see the future of local voice assistance. |
This is generally the approach we are thinking of. The idea (essentially) is:
Willow Application Server is under development and for the 1.0 release it will only support the management and update functionality. We will then work on implementing command endpoint and audio proxy support in WAS. In order to maintain our goal of being platform agnostic WAS itself can run standalone. However, it's relatively simple for just management, configuration and proxying audio so an HA component, add-on, something could essentially emulate WAS for use by Willow devices and integrate natively within HA. It would be very light and straightforward as HA already has intents, assist, TTS, STT, etc frameworks so the native HA WAS component would be dramatically simpler to implement than WAS itself - where we need to more-or-less duplicate and then expand the functionality that HA has today. As you can probably tell this is all pretty early and while it may seem convoluted for this discussion it boils down to a Willow component in HA that not only emulates WAS for Willow management, audio, etc it does so by exposing WIS as TTS and STT within HA for full assist pipeline support so it becomes native and seamless. My continuing concern is that abstracting WIS TTS and STT via HA could lead to a reduction in user experience due to response time or any other unforeseen issues. I think this will be perfectly acceptable to many HA users but as @nikito has illustrated the grammar, accuracy, speed, etc of Willow enables a lot of speech and command flows that are very convoluted, difficult, or outright impossible with HA intents (and just HA generally). There are basic voice assistant tasks like setting timers, reminders, asking the time, setting an alarm, checking a calendar, etc that don't really make any sense for HA to be involved in. It can actually get fairly difficult - consider the timer scenario... You could probably do something with HA scripts, etc but with WAS there will just be a timer app you include in the grammar plan to activate. When the timer is up (or an alarm, whatever) WAS sends an event to the Willow device and prints on display, plays audio, etc. Same for a Google Calendar or any of the other things people are doing with Alexa today. We plan on integrating Rasa into WAS so we can build extremely flexible grammar with excellent NLU/NLP capabilities to avoid the awkward syntax that you have with Alexa Skills today "Alexa, Ask My Ford to turn the car on" and some of the limitations of HA. Rasa also adds significant capabilities in terms of session awareness, context, turn by turn, etc that enable all kinds of interesting agent possibilities, especially considering WIS also supports hosting of LLMs. |
I think @nikito comparison is wrong. |
@kristiankielhofner I almost forgot the shopping list! Right now I just tell my voice assistant to add something to my shopping list and in the store I check the boxes on my phone, wonderful. |
Can you share you’re component ? |
Don't think anything is wrong with what I put? The time HA shows indeed includes TTS (believe the HA devs mentioned that when they went over this in the launch party). The total time was around 6 seconds. The same example using WIS, including TTS is 1.27 second. I also ensured I cleared the cache on both sides to eliminate that factor and make it a true test of both the STT and TTS. Note I also show that the time in HA is with the small model, while the time in WIS is with the medium model, which is more accurate on more complex sentences such as the examples I used. I'm also using the stock Whisper component in HA, as that is what the OP was referring to when mentioning using the assist pipelines STT and TTS. 🙂 |
Yeah, I'll publish it tomorrow |
I wasn't comparing the TTS. And the comparison with different models seems strange. My point is that the complete STT cycle in the conditions of one model seems to be the same, both for the variant 'Willow -> WIS -> Willow -> HA' and for the variant: 'Willow -> HA -> SOME_STT -> HA' |
@A6blpka - FYI on your Youtube comparison video - remember voice activity detection. When you're using the HA interface you end recording of speech with an instant mouse click. With Willow we have to wait a reasonable amount of time to ensure the speaker has finished speaking before we stop recording. Check advanced settings -> VAD Timeout. The default is 300ms which is pretty conservative. If you lower it (I use 100ms personally) it will detect the end of speech faster and the ultimate response time reduces by the difference in value. Note that VAD is tricky - in your example video you are smoothly and clearly firing off a command you've probably repeated many times recently. In practice most voice commands aren't that smooth and users speak slowly, hesitate between words, etc so really low VAD times are usually only good for benchmarking. In terms of alarm clocks, reminders, shopping lists, etc - from the perspective of Willow, WIS, and even WAS the sky is the limit! WAS will have modules/apps/integrations (python) to chain together any number of supported WIS features, APIs, WAS apps/integrations, command endpoints (HA, openHAB, etc). Sorry, edit - would you be able to test our new WIS implementation? You can use it with WIS URL:
You also strike me as the kind of person that likes tweaking things. Make sure to check our WIS API documentation. |
@nikito - Would you be interested in testing our first release? We have WAS, dynamic configuration, and OTA working but we would like to see more testing - especially with WAS deployment. If so I'll add you to the repos and we can start issues and discussions across WIS/WAS/Willow. |
@kristiankielhofner absolutely! I have another unit on the way as well so I can test out multi deployment OTA 🙂 |
In the video, I only stop recording by clicking on the last attempt. Recording on the first attempt stops by HA VAD. I have published a component to implement WIS as an HA STT. You can try it over HACS. |
I tested it on a local WIS installation and on https://wisng.tovera.io |
how can configure it to have reply on my esp box ? |
This component doesn't solve it |
If I understand Willow's work correctly, I see a flow like this: |
This is great work! In terms of flow my sense is the cleanest and most robust implementation can only come from an HA integration done via WAS and/or WAS protocol (in concept stage) -OR- abstracted via WAS itself. This would allow for things like abstracting all of the components and flow to the point where Willow doesn't know the difference. One example: Audio in after wake -> STT -> HA pipeline -> potential TTS output to audio -> response to WIllow, show command output and play TTS audio because response has audio. If no audio play success/failure tone (or play custom tones as provided by audio). |
Where can we discuss this? |
I just sent you an invite to the (currently private) willow-application-server repo. |
Question: Is this component the ONLY way to make Willow send commands to HA, or are there alternative ways to do this? |
Willow natively supports sending commands to HA, the above has nothing to do with that. I think that is to use WIS as a STT component natively inside HA, which would really only be useful if you are trying to use the HA ESPHome implementation of assistants instead of willow (which may have mixed results and isn't the optimal way to do it from a willow perspective ;) ) |
No, that's not what you need. |
WIS is an isolated ecosystem. It only works with willow. Take a look at heywillow.io to get an idea of how things work :) |
I think that this discussion does not follow the question I asked when I created this issue. The question was: can Willow (the software running on the S3 BOX) support the HA assist pipelines (effectively bypassing WIS) or not? From what I understood, Willow will not support this, so I think that this issue should be closed. |
With @A6blpka help I found my error; During installation of the HA WIS Integration from HACS, it automatically suggested a localhost URL to port 9000. I didn't understand that I had to replace this with https://infer.tovera.io/api/willow but after doing so, it works perfectly! So yes, the integration allows HA to use WIS. |
Just so you know, ha wis integration is a fork and not the official wis. The fork has modifications to make it work in in this way, but is not the "officially" supported way to use wis; my answer was in the context of wis in our repo. 😉 |
My understanding of this is that the official WIS STT is used. The integration only pipes the voice stream from Assist to WIS instead of piping it to Whisper, right? |
Currently Willow uses only the intent part of the HA assist pipeline.
It would be nice if users could choose if they want to use the entire pipeline (so use the TTS and STT provided by the pipeline).
Would this be a feature you might take into consideration?
The text was updated successfully, but these errors were encountered: