[Bug] Document proper handling of digraphs #1072

roedoejet · 2022-01-06T21:46:53Z

🐛 Description

I am trying to run 🐸TTS on new languages and came across a bug - or at least something that I think could be improved in the documentation for new languages, unless I missed something, in which case please point it out to me!

The language I am working with has digraphs in its character set - that is, ɡʷ is a separate character from ɡ. In TTS.tts.utils.text.text_to_sequence, the raw text is transformed to a sequence of character indices, but the function that turns the cleaned text into the sequence (TTS.tts.utils.text._symbols_to_sequence) just iterates through the string, so when processing ɡʷ, the index for ɡ is returned and ʷ is discarded by _should_keep_symbol. It appears there is a way to handle Arpabet digraphs using curly braces. Another way of handling this could be to reverse sort the list of characters according to length, then tokenize the raw text according to that sorted list of characters and pass the tokenized list to TTS.tts.utils.text._symbols_to_sequence. This is what I'm currently doing in a custom cleaner, but is this the "correct" or intended way of handling this? I would love if the tensorboard log tracked text as well, for example by logging a comparison of the raw text with the text reconstructed from the sequence as shown in the unittest below. That would have saved me training a few models and then only figuring out the bug by listening to the audio.

To Reproduce

from TTS.tts.utils.text import text_to_sequence, sequence_to_text
from TTS.tts.utils.text.symbols import make_symbols 

class TestMohawkCharacterInputs(TestCase):
    def setUp(self) -> None:
        self.mohawk_characters = {
        "pad": "_",
        "eos": "~",
        "bos": "^",
        "characters": sorted(['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ', ' '],key=len,reverse=True),
        "punctuations": "!(),-.;? ",
        "phonemes": sorted(['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ'],key=len,reverse=True),
        "unique": True
        }
        self.custom_symbols = make_symbols(**self.mohawk_characters)[0]
        self.mohawk_test_text = ["ɡʷah"]
        self.cleaners = ["basic_cleaners"]
        
    def test_text_parity(self):
        for utt in self.mohawk_test_text:
            seq = text_to_sequence(utt, cleaner_names=self.cleaners, custom_symbols=self.custom_symbols, tp=self.mohawk_characters['characters'], add_blank=False)
            text = sequence_to_text(seq, tp=self.mohawk_characters, add_blank=False, custom_symbols=self.custom_symbols)
            self.assertEqual(text, utt)

Expected behavior

With the above unittest, text_to_sequence("ɡʷah", cleaner_names=self.cleaners, custom_symbols=self.custom_symbols, tp=self.mohawk_characters['characters'], add_blank=False) returns [45, 26, 30] when it should return [24, 26, 30]. It would be nice if digraphs were handled by default, but barring that, it should be documented somewhere (here?) how they should be handled, and ideally tensorboard would perform some input text sanity check.

Environment

{
    "CUDA": {
        "GPU": [
            "Tesla V100-SXM2-16GB"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.10.1+cu102",
        "TTS": "0.5.0",
        "numpy": "1.21.2"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.8.12",
        "version": "#151-Ubuntu SMP Fri Jun 18 19:21:19 UTC 2021"
    }
}

Additional context

Many thanks to the authors for this excellent project!

The text was updated successfully, but these errors were encountered:

erogol · 2022-01-07T15:09:14Z

Can you somehow give me an example comparing the sound of g with g"?

I wonder if the difference can be learned by the model based on the context.

Currently, our tokenizer has no intended way to handle digraphs. However, you can maybe check the #937 and suggest a solution as someone who is already working on the problem.

roedoejet · 2022-01-07T17:16:36Z

Unfortunately I don't think it would be so easy to learn from context. For starters, the languages I am working with are low-resource, and second, these digraphs are often phonemic, so minimal pairs exist, for example gi and gʷi are both valid words that would be encoded the same by this model but have different pronunciations.

I definitely think that if the model can't handle digraphs yet that it might be a good idea to state that on the documentation section for adding a new language.

Here is the cleaner I wrote that solves the problem, but I haven't fixed the other issue yet (#1075). It's not very DRY to have to include the characters from the configuration again, but I couldn't see a simple way to pass the config to all places the cleaner is used. I will have a look at #937 and see if there would be a good way to integrate this functionality.

from nltk.tokenize import RegexpTokenizer

from TTS.tts.utils.text.symbols import make_symbols

def mohawk_cleaners(text):
    mohawk_characters = {
        "pad": "_",
        "eos": "~",
        "bos": "^",
        "characters": ['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ', ' '],
        "punctuations": "!(),-.;? ",
        "phonemes": ['a', 'aː', 'd', 'd͡ʒ', 'e', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰʷ', 'kʷ', 'n', 'o', 'r', 's', 't', 't͡ʃ', 'w', 'àː', 'á', 'áː', 'èː', 'é', 'éː', 'ìː', 'í', 'íː', 'òː', 'ó', 'óː', 'ũ', 'ũ̀ː', 'ṹ', 'ṹː', 'ɡ', 'ɡʷ', 'ʃ', 'ʌ̃', 'ʌ̃ː', 'ʌ̃̀ː', 'ʌ̃́', 'ʌ̃́ː', 'ʔ'],
        "unique": True
        }
    symbols = make_symbols(**mohawk_characters)[0]
    tokenizer = RegexpTokenizer("|".join(sorted(symbols, key=len, reverse=True)))
    return tokenizer.tokenize(text)

roedoejet · 2022-01-07T17:33:29Z

Is this the relevant PR? I can make some suggestions and PR into that branch.

erogol · 2022-01-10T07:39:35Z

Exactly. You can send your PR and start suggesting changes there.

erogol · 2022-01-14T12:44:50Z

The reason that " in g" is ignored is that it is not defined in the default character or phoneme list.

How about adding appending " to the phoneme list by defining your own custom phoneme list. Would that work?

roedoejet · 2022-01-14T17:25:00Z

I'm a bit confused because you are typing g"( \u0067\u0022) whereas the example I gave is ɡʷ (\u0261\u02B7). Either way, just appending ʷ (\u02B7) to the phoneme list string would solve the problem in a way, but it would encode it as a separate input which I don't think makes much sense. Theoretically a digraph is still a single time step, so encoding each graph as a timestep irks me a bit, even if given enough data the model is able to recover. It's likely that it will adversely affect my experiments though given that most of my datasets are between 30 minutes and 4 hours - every little bit of help they can have is good!

The second reason I don't like that solution is that some of my experiments involve changing the method of encoding inputs from one-hot embeddings to multi-hot embeddings based on phonological feature vectors (see Gutkin 2017 or Wells & Richmond 2021). This is a technique for normalizing the input space and making transfer learning easier for low resource TTS and in that case IPA symbols must be tokenized properly (ie not in a character-by-character way) in order to be converted into multi-hot phonological feature vectors. I've implemented this already in ming024's FastSpeech2 implementation, and am hoping to eventually submit a PR for adding a hyperparameter to 🐸TTS to change between one-hot or multi-hot input encodings. The first step though is proper tokenization of a character set that does not require single character units.

erogol · 2022-02-14T10:54:27Z

All makes sense. Thanks for explaining. Then I keep the issue open.

stale · 2022-03-16T17:46:44Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

roedoejet added the bug Something isn't working label Jan 6, 2022

roedoejet mentioned this issue Jan 7, 2022

[Bug] TTS.tts.configs.shared_configs.CharactersConfig does not allow lists #1075

Closed

roedoejet mentioned this issue Jan 7, 2022

feat(tokenizer): add digraph support for TTSTokenizer #1080

Closed

stale bot added the wontfix This will not be worked on but feel free to help. label Feb 13, 2022

coqui-ai deleted a comment from stale bot Feb 14, 2022

stale bot removed the wontfix This will not be worked on but feel free to help. label Feb 14, 2022

stale bot added the wontfix This will not be worked on but feel free to help. label Mar 16, 2022

stale bot closed this as completed Mar 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Document proper handling of digraphs #1072

[Bug] Document proper handling of digraphs #1072

roedoejet commented Jan 6, 2022 •

edited

Loading

erogol commented Jan 7, 2022

roedoejet commented Jan 7, 2022

roedoejet commented Jan 7, 2022

erogol commented Jan 10, 2022

erogol commented Jan 14, 2022

roedoejet commented Jan 14, 2022

erogol commented Feb 14, 2022

stale bot commented Mar 16, 2022

[Bug] Document proper handling of digraphs #1072

[Bug] Document proper handling of digraphs #1072

Comments

roedoejet commented Jan 6, 2022 • edited Loading

🐛 Description

To Reproduce

Expected behavior

Environment

Additional context

erogol commented Jan 7, 2022

roedoejet commented Jan 7, 2022

roedoejet commented Jan 7, 2022

erogol commented Jan 10, 2022

erogol commented Jan 14, 2022

roedoejet commented Jan 14, 2022

erogol commented Feb 14, 2022

stale bot commented Mar 16, 2022

roedoejet commented Jan 6, 2022 •

edited

Loading