Some thoughts #18
Replies: 1 comment 2 replies
-
Yeah, I think many of these are great ideas for building onto this project. Like you've said, I'm not really spending time anymore adding new features, but I'm more than happy to give my thoughts. STT: TTS: Blank Responses: Text-Only: Additional Models: |
Beta Was this translation helpful? Give feedback.
-
Hello. I am back after a bit of testing. So far things generally work still. However I have some thoughts on things that could be improved or add to the flexibility of what you have so far. Should you decide you wish to develop this more in the future.
Whisper:
For starters, I am curious how hard it would be to swap out what whisper model this is using. Whisper tiny is fast, but is often inaccurate in my uses, with even simple words and phrases with quite clear audio sources. It works, but a way to swap to one of the larger model would be a nice option if one is willing or able to deal with added latency. If it could be made into a setting in the constants.py or wherever that would be nice I think.
TTS:
The audio file to audio voice done by XTTS is rather suboptimal, I'd say its inconsistent at best, but its often choppy in my tests, making it difficult to listen to. Is there any way to run RVC models with this? I am pretty sure that they can't be streamed in real time, so it would add to latency a lot, but I am wondering if its a difficult option to add or if its even possible with the frameworks being used. Assuming one has the build to run all this it could still be quick enough I think.
Blank Responses:
I have noticed that 'human responses' that are blank; where audio was detected by the mic but it wasn't and voice to transcript, that blank message gets forwarded and the AI responds to it, which often means needless chatter and confusing the AI. I'm not sure where in the chain of events this would best bet fixed. But something preventing blank messages being forwarded seems like a good idea.
Text-Only Chat:
This might seem weird considering the base idea on how this was meant to be used. But I would be curious about an option to chat text only with the AI. My interest in this would mainly be to just test with it and see what its behaviors are like and how its responding without the overhead of things like the TTS or transcription of audio. Additionally I am interested in such an option as a way to organically populate memories with the AI quickly through conversation to simulate the vocal chat quicker. If that makes sense. Maybe this can be done already through the Text WebUI? If so I haven't found out how.
Second active Model and 'Agents':
This idea broadens the scope from small changes to something more significant. I was wondering how feasible it would be to implement a second model to act as a kind of "pre-processor" for human input before passing it to the main AI model for response generation.
I have some ideas on how to make the AI behave more dynamically, and I believe the most effective way to achieve this could involve the use of agents. Specifically a smaller model, something like a 1–3b, could quickly analyze the human input and determine the most appropriate prompts to accompany it before sending it to the larger model for processing. This could be done with the larger model being used to the chat, but would be much slower then being able to run the small one first.
That said, this is just an initial thought to explore whether such a feature is even worth considering. Can the system support using two models simultaneously? Based on how the Text WebUI currently operates, it seems like this might not be possible. If it is possible somehow, I am happy to elaborate on how I think such a feature could be used here.
I have some other ideas and thoughts, but these are the most straightforward and practical ones. You mentioned previously you are busy, so none of these are really requests and more just posing the thoughts to see if you think they can be done or how hard it would be to do.
Beta Was this translation helpful? Give feedback.
All reactions