Some thoughts #18

holdengand · 2024-11-19T02:07:16Z

holdengand
Nov 19, 2024

Hello. I am back after a bit of testing. So far things generally work still. However I have some thoughts on things that could be improved or add to the flexibility of what you have so far. Should you decide you wish to develop this more in the future.

Whisper:
For starters, I am curious how hard it would be to swap out what whisper model this is using. Whisper tiny is fast, but is often inaccurate in my uses, with even simple words and phrases with quite clear audio sources. It works, but a way to swap to one of the larger model would be a nice option if one is willing or able to deal with added latency. If it could be made into a setting in the constants.py or wherever that would be nice I think.
TTS:
The audio file to audio voice done by XTTS is rather suboptimal, I'd say its inconsistent at best, but its often choppy in my tests, making it difficult to listen to. Is there any way to run RVC models with this? I am pretty sure that they can't be streamed in real time, so it would add to latency a lot, but I am wondering if its a difficult option to add or if its even possible with the frameworks being used. Assuming one has the build to run all this it could still be quick enough I think.
Blank Responses:
I have noticed that 'human responses' that are blank; where audio was detected by the mic but it wasn't and voice to transcript, that blank message gets forwarded and the AI responds to it, which often means needless chatter and confusing the AI. I'm not sure where in the chain of events this would best bet fixed. But something preventing blank messages being forwarded seems like a good idea.
Text-Only Chat:
This might seem weird considering the base idea on how this was meant to be used. But I would be curious about an option to chat text only with the AI. My interest in this would mainly be to just test with it and see what its behaviors are like and how its responding without the overhead of things like the TTS or transcription of audio. Additionally I am interested in such an option as a way to organically populate memories with the AI quickly through conversation to simulate the vocal chat quicker. If that makes sense. Maybe this can be done already through the Text WebUI? If so I haven't found out how.
Second active Model and 'Agents':
This idea broadens the scope from small changes to something more significant. I was wondering how feasible it would be to implement a second model to act as a kind of "pre-processor" for human input before passing it to the main AI model for response generation.
I have some ideas on how to make the AI behave more dynamically, and I believe the most effective way to achieve this could involve the use of agents. Specifically a smaller model, something like a 1–3b, could quickly analyze the human input and determine the most appropriate prompts to accompany it before sending it to the larger model for processing. This could be done with the larger model being used to the chat, but would be much slower then being able to run the small one first.
That said, this is just an initial thought to explore whether such a feature is even worth considering. Can the system support using two models simultaneously? Based on how the Text WebUI currently operates, it seems like this might not be possible. If it is possible somehow, I am happy to elaborate on how I think such a feature could be used here.

I have some other ideas and thoughts, but these are the most straightforward and practical ones. You mentioned previously you are busy, so none of these are really requests and more just posing the thoughts to see if you think they can be done or how hard it would be to do.

kimjammer · 2024-11-19T03:32:36Z

kimjammer
Nov 19, 2024
Maintainer

Yeah, I think many of these are great ideas for building onto this project. Like you've said, I'm not really spending time anymore adding new features, but I'm more than happy to give my thoughts.

STT:
You can already configure the stt to use a different model by changing the "realtime_model_type" parameter in stt.py. You can take a look at the realtimeSTT readme for details, but you can choose from tiny all the way up to large-v2 or provide your own whisper model as a path. I just set it to tiny to minimize latency, but feel free to play around with the model size and see how it is (try not to run out of vram!)

TTS:
What I've noticed about the voice quality of XTTS is that it depends greatly on the quality of the reference file. Having extremely clean audio helped a ton. If you're getting choppy audio, it's because you're running out of VRAM or your GPU isn't fast enough for realtime (check task manager, and there's some configs you can mess around with for realtimetts in tts.py). You can see my demo video for reference. As for other TTS systems, realtimeTTS does come with a few other engines, like ones for your OS's tts or elevenlabs - as for higher quality local generation, they just added Parler, although apparently that barely runs fast enough on a 4090, so it may not be practical. If you wanted to add a different pipeline all together, like using RVC, you'd have to reimplement tts.py, which should be fairly straightforward.

Blank Responses:
I've noticed this too, but I've never bothered to fix it lol.

Text-Only:
Yup, I've definitely had this thought while testing too, so that's why I made kimjammer/neurofrontend. One of the options is Main->Behavior->New Topic, which just sends your text message to the AI, and if you disable STT & TTS from neurofrontend you can just read the output instead of listening. (Although the tts and stt models will still be loaded, so not reducing overhead in the vram department) Same for the memories, you can go to the memory tab and hit search to see all the current stored memories, delete them, or manually add new ones.

Additional Models:
Agent based architectures have gotten a lot more attention since I made this, and I think it would work well. In fact, the way memories work is kinda like an agent. To generate a memory, the system sends a special prompt and recent messages to the LLM to get it to summarize important info to remember. I think this could easily be done with a smaller model, it's just done with the big model right now since text-generation-webui only hosts 1 model at a time (i think). To do this, you'd probably need to have a separate endpoint for each agent you want to use, which is kinda like how multimodal works right now. You'd just need a LLM runner that can host multiple models simultaneously.

2 replies

holdengand Nov 19, 2024
Author

Yeah, I think many of these are great ideas for building onto this project. Like you've said, I'm not really spending time anymore adding new features, but I'm more than happy to give my thoughts.

To be clear. Are you considering your work 'done' on this project for the foreseeable future then?

Thank you for pointing me to where the TTS model is; that was a easy enough change. I swapped to the 'small' model and I am already notice better accuracy.
I will look more into the TTS stuff at a later date, but good to know it's already got some options. I did see stuff in there early on related to elevenlabs and Microsoft Azure or whatever. Though I want to stick to locally runnable options ideally. I'll look into it later though.
Pointing me to the STT.py file for the whisper model, I used codeium to help me and I think I have fixed the responses to blank entries from the mic.

    def process_text(self, text):
        if not self.enabled:
            return
        
        if text.strip() == "":  # Check if transcription is empty
            return

        print("STT OUTPUT: " + text)
        self.signals.history.append({"role": "user", "content": text})

        self.signals.last_message_time = time.time()
        if not self.signals.AI_speaking:
            self.signals.new_message = True

Adding that one, text.strip(), line seems to have fixed it I think.

Ah, I did see that other project, the backend one. At the time I was still figuring out the general installation so I forgot to check again and see what that actually was. I'll look into that at some point.
It seems like its simple enough to run two models by running two instances of Text WebUI and having them just run on different ports. I may take a crack at playing with this idea a bit more and if I get anything viable or interesting I might report back on it. I think layering agents can offer an option to easily alter how the AI behaves and responds based on context. I need to look more at how the memories work in this. I have looked at it, and I saw the readme. It says you can export the memories to json, how? Is that done with the backend? And on a similar note, is there an easy way to log the raw chat in general?

kimjammer Nov 19, 2024
Maintainer

Yeah, I don't have any plans to work on this, other than some maintenance and providing help when people need it.

Yup, go take a look at the frontend, that project has all the different controls and pages that let you control the backend and see what's going on inside. The export to json is over there as well. There isn't a direct way to log the chat, but in the frontend there's a tab for looking at the entire prompt being sent to the AI, which will include all of the recent messages, so that might work for you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some thoughts #18

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Some thoughts #18

holdengand Nov 19, 2024

Replies: 1 comment · 2 replies

kimjammer Nov 19, 2024 Maintainer

holdengand Nov 19, 2024 Author

kimjammer Nov 19, 2024 Maintainer

holdengand
Nov 19, 2024

Replies: 1 comment 2 replies

kimjammer
Nov 19, 2024
Maintainer

holdengand Nov 19, 2024
Author

kimjammer Nov 19, 2024
Maintainer