New Tokenizer API #937

erogol · 2021-11-16T12:53:49Z

Tokenizer API

Tokenizer API is defined by the TTSTokenizer class. It is intended to provide all the text processing functionalities to a tts model. New tokenizers can also be added by subclassing the TTSTokenizer class.
Phonemizer API

Phonemizer API is defined by the BasePhonemizer class and implemented by the ESpeak and Gruut wrappers, ZH_CH,
JP_JA phonemizers. New phonemizers can be added by implementing the BasePhonemizer class.
BaseCharacters

BaseCharacters class provides an API to define the model vocabulary and provide the dictionary to map characters to
token IDs and back. There are two pre-defined classes inheriting from BaseCharacters. IPAPhonemes and Graphemes that respectively define the IPA phoneme character set for models using phonemes and grapheme set for models using raw characters.
Punctuations class

Punctuations class to strip out punctuations and restore them when needed.
Language specific text normalization routines under TTS.tts.utils.text

Under TTS.tts.utils.text there are folders for each language to accommodate the text normalization routines that
are designed for the language.
GlowTTS recipe and model using the new API

Other models are not compatible with the new API currently.

TODO:

synesthesiam · 2021-11-19T16:24:48Z

I'll take a look this weekend, @erogol 👍

Discard but store OOV chars with a warninig message when the OOV char first recognized

stale · 2022-01-31T17:58:06Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

thorstenMueller · 2022-01-31T19:09:05Z

Still relevant.

stale · 2022-03-03T11:26:59Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

erogol · 2022-03-06T12:39:01Z

This is already merged after rebasing

erogol added 19 commits November 16, 2021 13:23

Implement ZH_CH phonemizer

5b3ae7e

Implement JA_JP phonemizer

d97e7ad

Implement gruut wrapper

c6e0c30

Implement espeak wrapper

7a1b51a

Implement multi-phonemizer

97c4fd8

Create text/english folder

da7a992

Implement BasePhonemizer

9ba0bb3

Create language folders under TTS.tts.utils.text

d3e5b01

Implement BaseCharacters, IPAPhonemes, Graphemes

ca85264

Implement TTSTokenizer

9f91efb

Fix imports in cleaners.py

12c2624

Implement Punctuation class

a40c126

Refactor Synthesizer class for TTSTokenizer

80f19ba

Add init_from_config to AudioProcessor

8f9fb62

Remove OLD TOKENIZATION ROUTINES

03b3cb7

Refactor TTSDataset to use TTSTokenizer

7182cba

Style fix

2a90d48

Refactor synthesis.py for TTSTokenizer

c82d67c

Refactor GlowTTS model and recipe for TTSTokenizer

cf5d91a

erogol requested a review from synesthesiam November 16, 2021 12:53

erogol added 9 commits November 17, 2021 12:43

Test character classes

6aae3a8

Update imports for symbols -> characters

826dd2f

Test Phonemizers

1bc7494

Add doc examples

1e3392e

Fix ja_jp_phonemizer

27c6acf

Fixup

d69bb29

Fix BasePhonemizer

d4e70f3

Fix Punctuation

6093b86

Test punctuations

a4147f9

erogol added 16 commits November 24, 2021 18:42

Print duplicate characters

b2e9f3c

Fix GlowTTS

492b6b9

Remove get_characters from BaseTTS

8d848f7

Update config fields for phonemizer

86f1dc9

Add init_from_config as an abstract class

d530e95

Update setup_model for TTS.tts models

21c1fdd

Make style

e9a4d29

Update EspeakWrapper for espeak-ng

913163a

Discard OOV chars in tokenizer

f0ec6a3

Discard but store OOV chars with a warninig message when the OOV char first recognized

Fix IPAPhonemes init_from_config

a1f0315

Add OOV case to tokenizer tests

0fcd44a

Fix print_logs

34134dd

Fix espeak wrapper cmd call

1558ea8

Make style

93d9069

Fix the wrong default loss name for GAN models

ca085d9

Allow init_from_config from model or audio config

3949412

stale bot added the wontfix This will not be worked on but feel free to help. label Dec 30, 2021

coqui-ai deleted a comment from stale bot Jan 1, 2022

stale bot removed the wontfix This will not be worked on but feel free to help. label Jan 1, 2022

erogol mentioned this pull request Jan 7, 2022

[Bug] Document proper handling of digraphs #1072

Closed

stale bot added the wontfix This will not be worked on but feel free to help. label Jan 31, 2022

stale bot removed the wontfix This will not be worked on but feel free to help. label Jan 31, 2022

stale bot added the wontfix This will not be worked on but feel free to help. label Mar 3, 2022

erogol removed the wontfix This will not be worked on but feel free to help. label Mar 6, 2022

erogol closed this Mar 6, 2022

erogol deleted the TokenizerAPI branch May 9, 2022 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Tokenizer API #937

New Tokenizer API #937

erogol commented Nov 16, 2021 •

edited

Loading

synesthesiam commented Nov 19, 2021

stale bot commented Jan 31, 2022

thorstenMueller commented Jan 31, 2022

stale bot commented Mar 3, 2022

erogol commented Mar 6, 2022

New Tokenizer API #937

New Tokenizer API #937

Conversation

erogol commented Nov 16, 2021 • edited Loading

TODO:

synesthesiam commented Nov 19, 2021

stale bot commented Jan 31, 2022

thorstenMueller commented Jan 31, 2022

stale bot commented Mar 3, 2022

erogol commented Mar 6, 2022

erogol commented Nov 16, 2021 •

edited

Loading