Replies: 6 comments 11 replies
-
I think dividing them into separate trainers would be good to avoid adding complexity to the trainer every time we have a different/specific use-case. It's ready a pretty big file (1.2k lines). Plus it's easier to maintain without the if/else that arises by supporting all the different usages. (it's just my opinion, maybe 2 would achieve the same goal) |
Beta Was this translation helpful? Give feedback.
-
I have been away for a while, could you explain what is BaseSTT . If that is STT being merged with TTS (even partially), I don't see any point adding it in Coqui-TTS as Coqui has separate STT module under coqui-STT. Coming to the point in discussion, in my humble opinion, why not stick with the things the way are! why make architectural changes in every new update and reformat code which create bugs that are not foreseeable in new updates and are harder to explore unless rigorously tested on all TTS models/vocoder models? I think these type of changes only facilitates coders not community. Coqui-TTS has have an expanding community and many of these are not coders/programmers as per say but are using Coqui to research and make different applications from it. For example, look how much confusion only coquipit created around config configuration. (though it was necessary evil) To summarize everything, I think these changes should be decided one time and then be kept as they are. |
Beta Was this translation helpful? Give feedback.
-
It is not merged under TTS. I'm just experimenting with new ideas about the integration between TTS and STT systems. For instance, we can use STT models to detect the best TTS model in your training. For such things, we need some STT work here.
🐸TTS has not reached a stable v1.0 yet. Until then things are open to radical changes, API improvements, and maybe (I hope not) complete rewrite. The good thing about this dynamic work is that contributors and users have more opportunities to influence the design of the library they like and use, instead of being forced to use something preset for them. So feel free to spread your ideas and suggestions. Even better, send ✨PRs✨ to make 🐸TTS even more "you" friendly. Trainer V2 is necessary to decouple the dependencies between different tasks and the training cycle to be able to decouple the engineering effort and research. It also opens doors to different models, architectures and domains using the 🐸TTS back-bone.
Confusion is worth changes that make the library better in the long run. We can't stop developing the library for not to cause confusion.
Yes, after we release v1.0 but before the stable version, it is different. |
Beta Was this translation helpful? Give feedback.
-
I would like to put in my two cents. Concerning the 3 different ways to make the core trainer more agnostic I am in favor of the first option:
Concerning other options in mind, I have a personal preference for a modified TTS workchain. I suggest to separate the preprocessing and the training process and to create a clear interface between the two. To my understanding all coqui TTS models are trained with integer based tensors. I think it's useful to make this more transparent to users. Michael Hansen (@synesthesiam) opted for such an interface between larynx and gruut for his rhasspy project. The transformation process (preprocessing) to eventually phonemize text and to convert characters, phonemes, symbols or emojis into integer id's for training has it's own complexity. Preprocessing a dataset for TTS and training a TTS model requires different skills. Creating a clear interface between both worlds allows users to specialise in one of these domains. Sharing datasets with integer id's, instead of text or whatever, will facilitate the training of TTS models. |
Beta Was this translation helpful? Give feedback.
-
I started my TTS trials a few months ago with the csv2phonemeids script included in rhasspy/gruut. I recently checked the phonemes2ids library when it was released in August this year. espeak-phonemizer is my favorite tool to convert datasets for multilingual TTS models, mainly with german, french and dutch. The delimited csv input flag is very useful for these tasks. espeak-phonemizer was an incentive for me to resume my ancient trials about adding luxembourgish as language to espeak-ng that I started in 2014. I have now a first prototype running in my local system. My goal is to launch a pull request in the near future to include luxembourgish as an additional language in the espeak-ng package. Concerning my trials to train a luxembourgish TTS voice (mono- and multispeaker) with very small datasets, I achieved the best results up to now with gruut and non-coqui-tts models, namely rhasspy/Glow-TTS for rhasspy/Larynx, Comprehensive-Tacotron2 and the original VITS. The licensing problem invoked by @synesthesiam is a strong argument to implement a clear interface between dataset-preprocessing and model-training in Coqui-TTS. |
Beta Was this translation helpful? Give feedback.
-
Hey, I'd like to add to the discussion because I'm facing an issue in the implementation of a new model exactly because of the Trainer API. I implemented the model before the Trainer API was released around 6-7 months ago - back then, each model had its own training script. I understand and appreciate the efforts put into making this a more centralised API, however as @WeberJulian mentioned, Specifically to my case (opening a separate discussion about it later today), I have to really mess with multiple optimizers - there's already a pretty heavy
Do you mean here only that there will be a centralised Generic Trainer and afterwards 3 separate classes (TTS, STT, Vocoder) will inherit this generic Trainer? I'm not sure if this is what you're asking, but I believe, for the future agnostic capacity of new model implementations and experimentation, the Trainer API should be split up even one more level down, where there would be a Generic TTS Trainer that is inherited by the different TTS models. I understand this would seem like a bit of a step backwards to the single trainer files (albeit with centralised trainer classes inheriting each other now), but to me personally it seems like a more 'open' invitation to implement new models. I find the current state of |
Beta Was this translation helpful? Give feedback.
-
We want to make the core trainer more agnostic than the way it is right now so that we can add new model categories easily.
I have 3 different ways to do it:
Feel free to share opinions and if you have something different in mind shoot it out 🐸
Beta Was this translation helpful? Give feedback.
All reactions