-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multilingual tokenizer #2229
Multilingual tokenizer #2229
Conversation
I'll add tests when that approach/design is validated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just need testing for multi-lang phonemizer
I approved it but it is still Draft. Is there something more to come? |
I just wanted to see if the test were passing. |
@WeberJulian This PR not working for me. A few steps after training were stuck (froze) I tried with Multilingual-MultiSpeaker Training(English, Tamil, Telugu). I checked MultiPhonemizer working. but in training, certain batches after training were freezed |
I think we should be able to set different cleaners per dataset. |
What is the root cause for the above training stuck issue? any reason why we should set different cleaners per dataset? How to implement it? needed initial guidance. I need to add a set of different cleaners per dataset in cleaners.py, and BaseDatasetConfig includes text_cleaners better I will move to issues for this conversation |
It would be nice, But I don't think that @MuruganR96 issue is related to it. multilingual_cleaners is simple and should be compatible with all languages. |
Thank you @WeberJulian @erogol @Edresson PR is working fine. espeak is not having support for the Telugu language. espeak-ng created this problem. so I uninstalled espeak-ng. now I am able to train Multilingual-MultiSpeaker TTS in English & Tamil. |
Add ability to specify language and tokenizer for each dataset.
You set
language
andphonemizer
for each dataset in the config/recipe. Ifphonemizer
is not specified, it's using the default phonemizer for that language.Also added a recipe to train on mailabs with phonemes. (espeak).