-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TTS Does not handle numbers in text #78
Comments
This is a known limitation of the Speech T5 model we're using. We have an open issue to use a completely different model that would be superior in every regard (including something as basic as speaking numbers): We have a wisng branch that will be merged shortly with a new TTS engine. We're still evaluating them but this issue will be addressed then. |
Saw the other issue and figured that'd be the case, but figured I'd mention it just in case. Look forward to trying out the new engine! 😃 |
While I've implemented many of them I'm a little frozen by "paradox of choice" at this point. There are so many options in terms of voices I'm trying to determine what is most pleasing to the community generally. If you have any input on the various options and voices I'd love to hear it! |
Guess that depends on the engine 🤣 I find it interesting that Coqui seems to let you clone a voice based on an audio sample, which could be useful for the community to sort of forge their own voices. Barring that, I imagine the community would probably want something like Jarvis ala a british sounding male voice, or on the female spectrum (which I think studies show female voices tend to sound better due to the way their tonalities are processed) probably something akin to the Alexa/Google female voices. Personally I'm trying to find a voice that has an Irish accent (something akin to F.R.I.D.A.Y from the Marvel movies 😄 ) as my wife is Irish and likes the idea of our voice assistant having that kind of accent. The other side of things is of course multi-language, which then makes the desire for choices even more grand as I imagine people will want the voice to sound good while also pronouncing the words correctly in their native languages. Are we trying to settle on a general voice for now, and later give people the ability to pick or even create their own? |
Another option could be to choose several voices that sound good, then make a poll with samples and let the community vote for their choice? 😃 In terms of the systems out there I admittedly haven't played with a lot of them, the last two I played around with were Mimic3 and recently Piper. Don't think either of those is leveraging GPGPU though, so may not be as performant as you are looking for? Other than that I've been playing around with the Coqui.ai console with the different voices you can generate there and it seems pretty neat. I may try to spin up a local instance of it and see what can be done with the different tts voices they include with the distro, and try out the cloning to see how that works as well. |
The good (and bad news) is Speech T5 is so poor we almost can't go wrong with ANYTHING else. I selected it because the other contender projects are significantly more involved. SpeechT5 in total was a few lines for a HF Transformers model. The other projects are... Not. At this point I'd be fine with input from another engaged community member as well as the internal team on selecting voice. It just needs to get done and if we don't nail it this go around we'll revisit again (or make it modular, etc). My strong preference is for a the highest quality generic American Female voice as the priority/default. As you note this is a fairly well researched field and generally speaking studies show that people tend to prefer it. Custom voices is actually fairly straightforward and even supported with SpeechT5 now. Look at /docs (or /api/docs with wisng) for the speaker management APIs. Be advised though, much like the rest of SpeechT5 the results aren't fantastic. We still primarily target GPGPU for all of the reasons noted in the README and elsewhere, with the ability to also run CPU-only with speed tradeoffs. Coqui is really easy to spin up locally - integrating it in WIS is another story, but it's certainly something we can do. If you want to install it locally (their docker implementation is solid) and provide feedback that would be great! |
I actually played with some of the Coqui voices last night on Huggingface, and found the Jenny tts voice to be really good! I'm going to try to spin up a local instance and see how it does on my GTX 1070. 😃 |
Great! You can also try some of the tortoise voices, they are generally pretty highly regarded. |
Great to see! The good news is wisng has extremely performant caching of TTS responses via nginx (and we use Cloudflare tiered caching with reserve for our hosted instances) so speed is less of a concern as for the time being most of the TTS output is very repetitive. |
TLDR - There is initial support for ONNX export of VITS models. With the ONNX CUDA execution provider this speeds up VITS by 20% (or so) and with newer GPUs with tensor cores and the ONNX Tensor RT Execution Provider it (likely) speeds up dramatically (as tends to be the case with tensor cores). We would need to add onnxruntime (easy enough) and we already have code to detect tensor core availability so it would pretty straightforward to pull this off. |
Is there somewhere to listen to the voice options? I'd be happy to add another vote to the mix. Just got my 1070 added to the server, so I should have a WIS instance spun up soon. |
I personally just spun up their cuda based docker container and played around with the voices on that. They detail how to do that here: https://tts.readthedocs.io/en/latest/docker_images.html Make sure to use the GPU version. 😊 |
I added support in wisng to convert numbers to words with SpeechT5 so that will at least get you support for numbers. You can use it with wisng branch. I've also used the TTS docker containers extensively. My general take/sense is that TTS would be fairly difficult to integrate directly in WIS, it has a TON of python dependencies pinned to specific versions and it would be tough to integrate them cleanly. I spent some time getting CPU, CUDA, and TensorRT working with an ONNX export of the VITS TTS model. With TensorRT on my 3090 it can do TTS at approximately 10x realtime (as a first pass). That's a substantial boost with TensorRT and will only work on newer GPUs but even with CUDA runtime it's still a bit faster than the default GPU implementation with pytorch. With an OONX export it should also be fairly straightforward to extract the relevant onnxruntime support for VITS from TTS and use it directly with minimal additional dependencies. Still needs more work to even validate the approach but I like what I'm seeing so far. |
Sounds promising! One other thing I noticed, the Coqui docker server exposes a few API endpoints, one of which is API/tts which takes a text parameter argument. It appears to work pretty much the same as the wis tts endpoint currently exposed. I know that's not the same as a direct, wis-integrated implementation, but figured it was worth mentioning. 🙂 |
wisng already uses nginx to cache TTS responses, so there is a specific route match for TTS requests. I'm just not very keen on managing yet another Docker container that exists completely outside of the current inference support in WIS today - it will have a different container base, potentially with different versions of CUDA that depend on different drivers. Their TTS today is also quite slow compared to even SpeechT5 and we'd have to either maintain a fork of TTS or attempt to upstream a bunch of Willow specific changes that they'd likely be reluctant to accept (I know I would be). We're also working towards a concept of full conversational flow management - with support for things like audio in -> STT -> do something -> TTS. This would get pretty messy (not to mention be slower) with an additional API call to what would be an external endpoint for TTS. I only just started on getting Coqui (in some form or fashion) into WIS. It's not impossible, it will just take a little bit of creativity because as I mentioned the dependency management as-is would make WIS a mess (and nightmare to maintain). |
Fair points, totally get it. Between the dependency nightmare and the architectural components (risk of some API changing down the line or some other shift) I can understand the difficulty there. I'll give Coqui a rest for now and play around with the ST5 stuff once my S3 box finally arrives in a couple weeks 😂 Thanks for the insight! EDIT: I'll also try out the wisng branch for the numbers fix. 🙂 If you are looking for any input on any other tts stuff I'm willing to experiment! |
Any testing of wisng would be great! BTW there is a WebRTC endpoint that loads a page to do ASR directly in your browser that you can test with. |
Did some testing on the number handling, works now! I used this text: One thing I notice, it has a slight pause when speaking compound numbers like 52. For instance instead of saying "fifty two" it says "fifty [pause] two". Not sure if there's a way to improve that? Otherwise working great! 😃 |
SpeechT5 has pretty aggressive pausing on spaces, commas, hyphens, etc. I'll look into "speeding it up", potentially with a URI parameter. |
Was just about to comment, I looked at the library being used (num2words) and I think the issue is when it converts a number like 52 you end up with "fifty-two", and the hyphen seems to make T5 pause for a moment. I tested this directly on the TTS server and saw that same behavior. I then went in and removed the hyphen so the text was like "fiftytwo" and it spoke the number correctly without a pause. Maybe possible to do a string replace to remove the hyphen on the num2word generation to make that faster? |
Awesome, thanks for the quick turnaround! :) |
Great! |
I tried testing the TTS using a generated text response from my HA instance as follows:
Currently, the weather is 55 degrees with partly cloudy skies. Under present weather conditions, the temperature feels like 55 degrees. In the next few hours you can expect more of the same, with a temperature of 55 degrees.
What I noticed is the TTS generated silence for the number 55, but spoke all the other text. Seems it does not know how to handle numeric values?
I also noticed it did similar when trying to report time, such as 8:55AM. I also didn't yet try it, but I imagine it may have similar trouble handling date strings as well. Maybe there's a way to have it handle these specific numeric formats?
EDIT: just tried the string "Today is Thursday, June 01 2023." and it was silent on all the numbers. Also tried "Today is Thursday, June 1st 2023." and it says "st" on the 1st part, silence on all other numbers.
The text was updated successfully, but these errors were encountered: