Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Tokenizer API #937

Closed
wants to merge 54 commits into from
Closed

New Tokenizer API #937

wants to merge 54 commits into from

Conversation

erogol
Copy link
Member

@erogol erogol commented Nov 16, 2021

  • Tokenizer API

    Tokenizer API is defined by the TTSTokenizer class. It is intended to provide all the text processing functionalities to a tts model. New tokenizers can also be added by subclassing the TTSTokenizer class.

  • Phonemizer API

    Phonemizer API is defined by the BasePhonemizer class and implemented by the ESpeak and Gruut wrappers, ZH_CH,
    JP_JA phonemizers. New phonemizers can be added by implementing the BasePhonemizer class.

  • BaseCharacters

    BaseCharacters class provides an API to define the model vocabulary and provide the dictionary to map characters to
    token IDs and back. There are two pre-defined classes inheriting from BaseCharacters. IPAPhonemes and Graphemes that respectively define the IPA phoneme character set for models using phonemes and grapheme set for models using raw characters.

  • Punctuations class

    Punctuations class to strip out punctuations and restore them when needed.

  • Language specific text normalization routines under TTS.tts.utils.text

    Under TTS.tts.utils.text there are folders for each language to accommodate the text normalization routines that
    are designed for the language.

  • GlowTTS recipe and model using the new API

    Other models are not compatible with the new API currently.

TODO:

  • Add Tokenizer tests
  • Add Punctuation tests
  • Add Phonemizer tests
  • Add Characters tests
  • Update rest of 🐸TTS for the new APIs

@erogol erogol requested a review from synesthesiam November 16, 2021 12:53
@synesthesiam
Copy link
Contributor

I'll take a look this weekend, @erogol 👍

@stale stale bot added the wontfix This will not be worked on but feel free to help. label Dec 30, 2021
@coqui-ai coqui-ai deleted a comment from stale bot Jan 1, 2022
@stale stale bot removed the wontfix This will not be worked on but feel free to help. label Jan 1, 2022
@stale
Copy link

stale bot commented Jan 31, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

@stale stale bot added the wontfix This will not be worked on but feel free to help. label Jan 31, 2022
@thorstenMueller
Copy link
Contributor

Still relevant.

@stale stale bot removed the wontfix This will not be worked on but feel free to help. label Jan 31, 2022
@stale
Copy link

stale bot commented Mar 3, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

@stale stale bot added the wontfix This will not be worked on but feel free to help. label Mar 3, 2022
@erogol erogol removed the wontfix This will not be worked on but feel free to help. label Mar 6, 2022
@erogol
Copy link
Member Author

erogol commented Mar 6, 2022

This is already merged after rebasing

@erogol erogol closed this Mar 6, 2022
@erogol erogol deleted the TokenizerAPI branch May 9, 2022 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants