Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilingual tokenizer #2229

Merged
merged 6 commits into from
Jan 2, 2023
Merged

Multilingual tokenizer #2229

merged 6 commits into from
Jan 2, 2023

Conversation

WeberJulian
Copy link
Contributor

@WeberJulian WeberJulian commented Dec 20, 2022

Add ability to specify language and tokenizer for each dataset.

from TTS.tts.utils.text.phonemizers.multi_phonemizer import MultiPhonemizer
texts = {
    "tr": "Merhaba, bu Türkçe bit örnek!",
    "en-us": "Hello, this is English example!",
    "de": "Hallo, das ist ein Deutches Beipiel!",
    "zh-cn": "这是中国的例子",
}
phonemes = {}
ph = MultiPhonemizer({"tr": "espeak", "en-us": "", "de": "gruut", "zh-cn": ""})
for lang, text in texts.items():
    phoneme = ph.phonemize(text, lang)
    phonemes[lang] = phoneme
print(phonemes)

You set language and phonemizer for each dataset in the config/recipe. If phonemizer is not specified, it's using the default phonemizer for that language.

Also added a recipe to train on mailabs with phonemes. (espeak).

@WeberJulian
Copy link
Contributor Author

I'll add tests when that approach/design is validated.

Copy link
Contributor

@Edresson Edresson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR :)

Copy link
Member

@erogol erogol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need testing for multi-lang phonemizer

@erogol
Copy link
Member

erogol commented Dec 22, 2022

I approved it but it is still Draft. Is there something more to come?

@WeberJulian WeberJulian marked this pull request as ready for review December 22, 2022 14:31
@WeberJulian
Copy link
Contributor Author

I approved it but it is still Draft. Is there something more to come?

I just wanted to see if the test were passing.
Also wondering about the text_cleaners, should we allow the by the same way we allow setting phonemizer for each dataset?
Or do we consider that phonemizers should also do the work that multilingual_cleaner doesn't (like espeak does)

@MuruganR96
Copy link

MuruganR96 commented Dec 26, 2022

@WeberJulian This PR not working for me. A few steps after training were stuck (froze)

I tried with Multilingual-MultiSpeaker Training(English, Tamil, Telugu).

I checked MultiPhonemizer working. but in training, certain batches after training were freezed

@erogol
Copy link
Member

erogol commented Dec 26, 2022

I think we should be able to set different cleaners per dataset.

@MuruganR96
Copy link

MuruganR96 commented Dec 26, 2022

I think we should be able to set different cleaners per dataset.

What is the root cause for the above training stuck issue? any reason why we should set different cleaners per dataset?

How to implement it? needed initial guidance.

I need to add a set of different cleaners per dataset in cleaners.py, and BaseDatasetConfig includes text_cleaners
and a similar way of implementing MultiCleaners in tokenizer.py init_from_config.

better I will move to issues for this conversation

@Edresson
Copy link
Contributor

rent cleaners per dataset in cleaners.py, and BaseDatasetConfig includes text_cleaners and a similar way of implementing MultiCleaners in tokenizer.py init_from_config.

better I will move to issues for this conversation

It would be nice, But I don't think that @MuruganR96 issue is related to it. multilingual_cleaners is simple and should be compatible with all languages.

@MuruganR96
Copy link

Thank you @WeberJulian @erogol @Edresson

PR is working fine. espeak is not having support for the Telugu language. espeak-ng created this problem. so I uninstalled espeak-ng. now I am able to train Multilingual-MultiSpeaker TTS in English & Tamil.

@erogol erogol merged commit a073977 into dev Jan 2, 2023
@erogol erogol deleted the multilingual-tokenizer branch January 2, 2023 09:03
@Msahiitrpr
Copy link

@erogol @Edresson
espeak-ng is not suporting training language , is there any specific reason ??. how to solve that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants