Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Document proper handling of digraphs #1072

Closed
roedoejet opened this issue Jan 6, 2022 · 8 comments
Closed

[Bug] Document proper handling of digraphs #1072

roedoejet opened this issue Jan 6, 2022 · 8 comments
Labels
bug Something isn't working wontfix This will not be worked on but feel free to help.

Comments

@roedoejet
Copy link

roedoejet commented Jan 6, 2022

🐛 Description

I am trying to run 🐸TTS on new languages and came across a bug - or at least something that I think could be improved in the documentation for new languages, unless I missed something, in which case please point it out to me!

The language I am working with has digraphs in its character set - that is, ɡʷ is a separate character from ɡ. In TTS.tts.utils.text.text_to_sequence, the raw text is transformed to a sequence of character indices, but the function that turns the cleaned text into the sequence (TTS.tts.utils.text._symbols_to_sequence) just iterates through the string, so when processing ɡʷ, the index for ɡ is returned and ʷ is discarded by _should_keep_symbol. It appears there is a way to handle Arpabet digraphs using curly braces. Another way of handling this could be to reverse sort the list of characters according to length, then tokenize the raw text according to that sorted list of characters and pass the tokenized list to TTS.tts.utils.text._symbols_to_sequence. This is what I'm currently doing in a custom cleaner, but is this the "correct" or intended way of handling this? I would love if the tensorboard log tracked text as well, for example by logging a comparison of the raw text with the text reconstructed from the sequence as shown in the unittest below. That would have saved me training a few models and then only figuring out the bug by listening to the audio.

To Reproduce

from TTS.tts.utils.text import text_to_sequence, sequence_to_text
from TTS.tts.utils.text.symbols import make_symbols 

class TestMohawkCharacterInputs(TestCase):
    def setUp(self) -> None:
        self.mohawk_characters = {
        "pad": "_",
        "eos": "~",
        "bos": "^",
        "characters": sorted(['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ', ' '],key=len,reverse=True),
        "punctuations": "!(),-.;? ",
        "phonemes": sorted(['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ'],key=len,reverse=True),
        "unique": True
        }
        self.custom_symbols = make_symbols(**self.mohawk_characters)[0]
        self.mohawk_test_text = ["ɡʷah"]
        self.cleaners = ["basic_cleaners"]
        
    def test_text_parity(self):
        for utt in self.mohawk_test_text:
            seq = text_to_sequence(utt, cleaner_names=self.cleaners, custom_symbols=self.custom_symbols, tp=self.mohawk_characters['characters'], add_blank=False)
            text = sequence_to_text(seq, tp=self.mohawk_characters, add_blank=False, custom_symbols=self.custom_symbols)
            self.assertEqual(text, utt)

Expected behavior

With the above unittest, text_to_sequence("ɡʷah", cleaner_names=self.cleaners, custom_symbols=self.custom_symbols, tp=self.mohawk_characters['characters'], add_blank=False) returns [45, 26, 30] when it should return [24, 26, 30]. It would be nice if digraphs were handled by default, but barring that, it should be documented somewhere (here?) how they should be handled, and ideally tensorboard would perform some input text sanity check.

Environment

{
    "CUDA": {
        "GPU": [
            "Tesla V100-SXM2-16GB"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.10.1+cu102",
        "TTS": "0.5.0",
        "numpy": "1.21.2"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.8.12",
        "version": "#151-Ubuntu SMP Fri Jun 18 19:21:19 UTC 2021"
    }
}

Additional context

Many thanks to the authors for this excellent project!

@erogol
Copy link
Member

erogol commented Jan 7, 2022

Can you somehow give me an example comparing the sound of g with g"?

I wonder if the difference can be learned by the model based on the context.

Currently, our tokenizer has no intended way to handle digraphs. However, you can maybe check the #937 and suggest a solution as someone who is already working on the problem.

@roedoejet
Copy link
Author

Unfortunately I don't think it would be so easy to learn from context. For starters, the languages I am working with are low-resource, and second, these digraphs are often phonemic, so minimal pairs exist, for example gi and gʷi are both valid words that would be encoded the same by this model but have different pronunciations.

I definitely think that if the model can't handle digraphs yet that it might be a good idea to state that on the documentation section for adding a new language.

Here is the cleaner I wrote that solves the problem, but I haven't fixed the other issue yet (#1075). It's not very DRY to have to include the characters from the configuration again, but I couldn't see a simple way to pass the config to all places the cleaner is used. I will have a look at #937 and see if there would be a good way to integrate this functionality.

from nltk.tokenize import RegexpTokenizer

from TTS.tts.utils.text.symbols import make_symbols

def mohawk_cleaners(text):
    mohawk_characters = {
        "pad": "_",
        "eos": "~",
        "bos": "^",
        "characters": ['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ', ' '],
        "punctuations": "!(),-.;? ",
        "phonemes": ['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ'],
        "unique": True
        }
    symbols = make_symbols(**mohawk_characters)[0]
    tokenizer = RegexpTokenizer("|".join(sorted(symbols, key=len, reverse=True)))
    return tokenizer.tokenize(text)

@roedoejet
Copy link
Author

Is this the relevant PR? I can make some suggestions and PR into that branch.

@erogol
Copy link
Member

erogol commented Jan 10, 2022

Exactly. You can send your PR and start suggesting changes there.

@erogol
Copy link
Member

erogol commented Jan 14, 2022

The reason that " in g" is ignored is that it is not defined in the default character or phoneme list.

How about adding appending " to the phoneme list by defining your own custom phoneme list. Would that work?

@roedoejet
Copy link
Author

I'm a bit confused because you are typing g"( \u0067\u0022) whereas the example I gave is ɡʷ (\u0261\u02B7). Either way, just appending ʷ (\u02B7) to the phoneme list string would solve the problem in a way, but it would encode it as a separate input which I don't think makes much sense. Theoretically a digraph is still a single time step, so encoding each graph as a timestep irks me a bit, even if given enough data the model is able to recover. It's likely that it will adversely affect my experiments though given that most of my datasets are between 30 minutes and 4 hours - every little bit of help they can have is good!

The second reason I don't like that solution is that some of my experiments involve changing the method of encoding inputs from one-hot embeddings to multi-hot embeddings based on phonological feature vectors (see Gutkin 2017 or Wells & Richmond 2021). This is a technique for normalizing the input space and making transfer learning easier for low resource TTS and in that case IPA symbols must be tokenized properly (ie not in a character-by-character way) in order to be converted into multi-hot phonological feature vectors. I've implemented this already in ming024's FastSpeech2 implementation, and am hoping to eventually submit a PR for adding a hyperparameter to 🐸TTS to change between one-hot or multi-hot input encodings. The first step though is proper tokenization of a character set that does not require single character units.

@stale stale bot added the wontfix This will not be worked on but feel free to help. label Feb 13, 2022
@coqui-ai coqui-ai deleted a comment from stale bot Feb 14, 2022
@stale stale bot removed the wontfix This will not be worked on but feel free to help. label Feb 14, 2022
@erogol
Copy link
Member

erogol commented Feb 14, 2022

All makes sense. Thanks for explaining. Then I keep the issue open.

@stale
Copy link

stale bot commented Mar 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

@stale stale bot added the wontfix This will not be worked on but feel free to help. label Mar 16, 2022
@stale stale bot closed this as completed Mar 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on but feel free to help.
Projects
None yet
Development

No branches or pull requests

2 participants