Trainer API v2 architecture #840

erogol · 2021-09-27T09:01:03Z

erogol
Sep 27, 2021
Maintainer

We want to make the core trainer more agnostic than the way it is right now so that we can add new model categories easily.

I have 3 different ways to do it:

Generic Trainer + separate TTS, STT and vocoder Trainers inheriting from the Generic Trainer
Generic Trainer + BaseTTS, BaseSTT, BaseVocoder deals with the specific differences
Trainer is configuring itself for each use case parsing the config file (This is the current state).

Feel free to share opinions and if you have something different in mind shoot it out 🐸

WeberJulian · 2021-09-28T09:13:24Z

WeberJulian
Sep 28, 2021

I think dividing them into separate trainers would be good to avoid adding complexity to the trainer every time we have a different/specific use-case. It's ready a pretty big file (1.2k lines). Plus it's easier to maintain without the if/else that arises by supporting all the different usages. (it's just my opinion, maybe 2 would achieve the same goal)

0 replies

Sadam1195 · 2021-10-02T11:10:45Z

Sadam1195
Oct 2, 2021

I have been away for a while, could you explain what is BaseSTT . If that is STT being merged with TTS (even partially), I don't see any point adding it in Coqui-TTS as Coqui has separate STT module under coqui-STT.

Coming to the point in discussion, in my humble opinion, why not stick with the things the way are! why make architectural changes in every new update and reformat code which create bugs that are not foreseeable in new updates and are harder to explore unless rigorously tested on all TTS models/vocoder models?

I think these type of changes only facilitates coders not community. Coqui-TTS has have an expanding community and many of these are not coders/programmers as per say but are using Coqui to research and make different applications from it. For example, look how much confusion only coquipit created around config configuration. (though it was necessary evil)

To summarize everything, I think these changes should be decided one time and then be kept as they are.
Rest is up to wise guys who have better understating of broader aspects.

0 replies

erogol · 2021-10-04T13:24:25Z

erogol
Oct 4, 2021
Maintainer Author

I have been away for a while, could you explain what is BaseSTT . If that is STT being merged with TTS (even partially), I don't see any point adding it in Coqui-TTS as Coqui has separate STT module under coqui-STT.

It is not merged under TTS. I'm just experimenting with new ideas about the integration between TTS and STT systems. For instance, we can use STT models to detect the best TTS model in your training. For such things, we need some STT work here.

Coming to the point in discussion, in my humble opinion, why not stick with the things the way are! why make architectural changes in every new update and reformat code which create bugs that are not foreseeable in new updates and are harder to explore unless rigorously tested on all TTS models/vocoder models?

🐸TTS has not reached a stable v1.0 yet. Until then things are open to radical changes, API improvements, and maybe (I hope not) complete rewrite.

The good thing about this dynamic work is that contributors and users have more opportunities to influence the design of the library they like and use, instead of being forced to use something preset for them.

So feel free to spread your ideas and suggestions. Even better, send ✨PRs✨ to make 🐸TTS even more "you" friendly.

Trainer V2 is necessary to decouple the dependencies between different tasks and the training cycle to be able to decouple the engineering effort and research. It also opens doors to different models, architectures and domains using the 🐸TTS back-bone.

I think these type of changes only facilitates coders not community. Coqui-TTS has have an expanding community and many of these are not coders/programmers as per say but are using Coqui to research and make different applications from it. For example, look how much confusion only coquipit created around config configuration. (though it was necessary evil)

Confusion is worth changes that make the library better in the long run. We can't stop developing the library for not to cause confusion.

To summarize everything, I think these changes should be decided one time and then be kept as they are.
Rest is up to wise guys who have better understating of broader aspects.

Yes, after we release v1.0 but before the stable version, it is different.

1 reply

Sadam1195 Oct 4, 2021

Hmm, yeah makes sense. Hope with every radical change, proper documentation is released (that's all I could ask) .

mbarnig · 2021-10-04T16:06:03Z

mbarnig
Oct 4, 2021

I would like to put in my two cents.

Concerning the 3 different ways to make the core trainer more agnostic I am in favor of the first option:

generic trainer + separate specific trainers inheriting from the generic trainer

Concerning other options in mind, I have a personal preference for a modified TTS workchain. I suggest to separate the preprocessing and the training process and to create a clear interface between the two.

To my understanding all coqui TTS models are trained with integer based tensors. I think it's useful to make this more transparent to users.

Michael Hansen (@synesthesiam) opted for such an interface between larynx and gruut for his rhasspy project.

The transformation process (preprocessing) to eventually phonemize text and to convert characters, phonemes, symbols or emojis into integer id's for training has it's own complexity.

Preprocessing a dataset for TTS and training a TTS model requires different skills. Creating a clear interface between both worlds allows users to specialise in one of these domains.

Sharing datasets with integer id's, instead of text or whatever, will facilitate the training of TTS models.

5 replies

synesthesiam Oct 5, 2021

For the purpose of assigning integer ids to phonemes (or any symbols really), I put together a small library called phoneme2ids. It can create a mapping for you from a dataset, and has all of the config options you'd expect (padding, bos/eos symbols, break symbols between phonemes or words, stress/tone separation etc.).

My hope is that the text processing in 🐸 TTS can be made flexible enough for users to install software that can't be included in the main repo due to licensing. For that day, I wrote espeak-phonemizer, which binds directly to the API in libespeak-ng.so for increased speed and the preservation of dipthongs 🙂

erogol Oct 5, 2021
Maintainer Author

The distinction between text and training is not that clear cut since the text processing used in training must also be used at inference. So essentially the model is not separable from the text processing backend.

erogol Oct 5, 2021
Maintainer Author

BTW you can already do what you want to do with the current TTS. You just set your characters to a set of ints and pass ints from phoneme2ids to the model. Then the model uses your ints as the input characters. It is more of a hack than a solution but it would work.

synesthesiam Oct 5, 2021

I agree, but if the text processing is abstracted out and can be easily overridden, then it opens up more options. The downside is that sharing a TTS model entails sharing a text processing pipeline if you opt out of the defaults.

On the more extreme end, I could imagine TTS models being bundled as Python wheels with some code that plugs into 🐸 TTS and requirements.

mbarnig Oct 5, 2021

I want to call back that we discussed this topic already in June 2021 when gruut was added to Coqui-TTS.

I did my first trials with Coqui-TTS with phoneme-ids as input for training and for inference. I confirm that this solution with ints is working with the current Coqui-TTS architecture, but it is cumbersome to adapt this hack continously to new model releases.

I am lost to understand the problems related to sharing models trained with ints. If the same preprocess is used to generate the ints for training and for inference it should work ? Why not share the preprocess-model the same way as the pretrained model?

mbarnig · 2021-10-05T08:32:20Z

mbarnig
Oct 5, 2021

I started my TTS trials a few months ago with the csv2phonemeids script included in rhasspy/gruut. I recently checked the phonemes2ids library when it was released in August this year.

espeak-phonemizer is my favorite tool to convert datasets for multilingual TTS models, mainly with german, french and dutch. The delimited csv input flag is very useful for these tasks. espeak-phonemizer was an incentive for me to resume my ancient trials about adding luxembourgish as language to espeak-ng that I started in 2014. I have now a first prototype running in my local system. My goal is to launch a pull request in the near future to include luxembourgish as an additional language in the espeak-ng package.

Concerning my trials to train a luxembourgish TTS voice (mono- and multispeaker) with very small datasets, I achieved the best results up to now with gruut and non-coqui-tts models, namely rhasspy/Glow-TTS for rhasspy/Larynx, Comprehensive-Tacotron2 and the original VITS.

The licensing problem invoked by @synesthesiam is a strong argument to implement a clear interface between dataset-preprocessing and model-training in Coqui-TTS.

2 replies

skol101 Oct 6, 2021

Concerning my trials to train a luxembourgish TTS voice (mono- and multispeaker) with very small datasets, I achieved the best results up to now with gruut and non-coqui-tts models, namely rhasspy/Glow-TTS for rhasspy/Larynx, Comprehensive-Tacotron2 and the original VITS.

@mbarnig did you have to train VITS from scratch or used their pre-trained generator and finetuned from there?

mbarnig Oct 6, 2021

@skol101 : My understanding is that both generator (G_xxx.pth) and discriminator (D_xxx.pth) checkpoints are required to finetune a VITS model. I was not able to use the pre-trained VITS generators available for LJSpeech and VCTK for finetuning.

All my VITS trainings were done from scratch. I did training only with small luxembourgish datasets and I was astonished about the results. I achieved even better quality by training first a subset of VTCK and then finetuning the resulting checkpoints with my luxembourgish datasets.

My setup is still a proof of concept.

Currently I try to improve the quality of the datasets and of the phonemization to get a state-of-art luxembourgish model.

a-froghyar · 2021-10-07T09:47:45Z

a-froghyar
Oct 7, 2021

Hey, I'd like to add to the discussion because I'm facing an issue in the implementation of a new model exactly because of the Trainer API. I implemented the model before the Trainer API was released around 6-7 months ago - back then, each model had its own training script. I understand and appreciate the efforts put into making this a more centralised API, however as @WeberJulian mentioned, trainer.py is a pretty big file already and I found it to be difficult to read, let alone implement something new.

Specifically to my case (opening a separate discussion about it later today), I have to really mess with multiple optimizers - there's already a pretty heavy if-else statement for the GAN models, so I'll have to do more of the if-else situation here -, and build a bit more complicated optimizer parameter splitting and implement wonky .backward() and .zero_grad() calls. The way the Trainer API is built up right now makes it very difficult for me to do these changes because while I'm trying to utilise the generic functionality of the API, there are just many specifics to the model I implemented that are out of the normal for standard behaviour. I think for the future, the TTS trainer should be more customisable because new models are using newer and 'weirder' optimisation pipelines.

Generic Trainer + separate TTS, STT and vocoder Trainers inheriting from the Generic Trainer

Do you mean here only that there will be a centralised Generic Trainer and afterwards 3 separate classes (TTS, STT, Vocoder) will inherit this generic Trainer? I'm not sure if this is what you're asking, but I believe, for the future agnostic capacity of new model implementations and experimentation, the Trainer API should be split up even one more level down, where there would be a Generic TTS Trainer that is inherited by the different TTS models. I understand this would seem like a bit of a step backwards to the single trainer files (albeit with centralised trainer classes inheriting each other now), but to me personally it seems like a more 'open' invitation to implement new models. I find the current state of trainer.py to be uninviting to make changes in.

3 replies

a-froghyar Oct 7, 2021

@erogol let me know what you think of this please - if I should proceed with trying to implement the Capacitron training changes in the current Trainer API or I should wait for the new version. :)

erogol Oct 7, 2021
Maintainer Author

First of all see these links

https://tts.readthedocs.io/en/latest/implementing_a_new_model.html

https://github.com/coqui-ai/TTS/blob/6d3b2d3cdda0565931d34462e085dcce5ca4c9b2/docs/source/implementing_a_new_model.md

It explains the new API although the API is prone to new changes before the merge.

I dint think you need to wait for the new API. It does not change anything radical in model implementation

The current plan is to have a generic trainer and implementing model specific steps in the model implementation.

BaseTTS model provides some defaults and these can be customized in the model implementation.

You should also check other model implementations. They are pretty much the same .

a-froghyar Oct 7, 2021

Hey, thanks, I've read these already and I'm almost through with it all, it's specifically the changes in trainer.py that are difficult at the moment. Good to know that you're not planning to change the training, then I'll make it work inside the existing way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer API v2 architecture #840

{{title}}

Replies: 6 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Trainer API v2 architecture #840

erogol Sep 27, 2021 Maintainer

Replies: 6 comments · 11 replies

erogol Oct 4, 2021 Maintainer Author

erogol Oct 5, 2021 Maintainer Author

erogol Oct 5, 2021 Maintainer Author

erogol Oct 7, 2021 Maintainer Author

erogol
Sep 27, 2021
Maintainer

Replies: 6 comments 11 replies

erogol
Oct 4, 2021
Maintainer Author

erogol Oct 5, 2021
Maintainer Author

erogol Oct 5, 2021
Maintainer Author

erogol Oct 7, 2021
Maintainer Author