-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Document proper handling of digraphs #1072
Comments
Can you somehow give me an example comparing the sound of I wonder if the difference can be learned by the model based on the context. Currently, our tokenizer has no intended way to handle |
Unfortunately I don't think it would be so easy to learn from context. For starters, the languages I am working with are low-resource, and second, these digraphs are often phonemic, so minimal pairs exist, for example I definitely think that if the model can't handle digraphs yet that it might be a good idea to state that on the documentation section for adding a new language. Here is the cleaner I wrote that solves the problem, but I haven't fixed the other issue yet (#1075). It's not very DRY to have to include the characters from the configuration again, but I couldn't see a simple way to pass the config to all places the cleaner is used. I will have a look at #937 and see if there would be a good way to integrate this functionality.
|
Is this the relevant PR? I can make some suggestions and PR into that branch. |
Exactly. You can send your PR and start suggesting changes there. |
The reason that How about adding appending |
I'm a bit confused because you are typing The second reason I don't like that solution is that some of my experiments involve changing the method of encoding inputs from one-hot embeddings to multi-hot embeddings based on phonological feature vectors (see Gutkin 2017 or Wells & Richmond 2021). This is a technique for normalizing the input space and making transfer learning easier for low resource TTS and in that case IPA symbols must be tokenized properly (ie not in a character-by-character way) in order to be converted into multi-hot phonological feature vectors. I've implemented this already in ming024's FastSpeech2 implementation, and am hoping to eventually submit a PR for adding a hyperparameter to 🐸TTS to change between one-hot or multi-hot input encodings. The first step though is proper tokenization of a character set that does not require single character units. |
All makes sense. Thanks for explaining. Then I keep the issue open. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels. |
🐛 Description
I am trying to run 🐸TTS on new languages and came across a bug - or at least something that I think could be improved in the documentation for new languages, unless I missed something, in which case please point it out to me!
The language I am working with has digraphs in its character set - that is,
ɡʷ
is a separate character fromɡ
. InTTS.tts.utils.text.text_to_sequence
, the raw text is transformed to a sequence of character indices, but the function that turns the cleaned text into the sequence (TTS.tts.utils.text._symbols_to_sequence
) just iterates through the string, so when processingɡʷ
, the index forɡ
is returned andʷ
is discarded by_should_keep_symbol
. It appears there is a way to handle Arpabet digraphs using curly braces. Another way of handling this could be to reverse sort the list of characters according to length, then tokenize the raw text according to that sorted list of characters and pass the tokenized list toTTS.tts.utils.text._symbols_to_sequence
. This is what I'm currently doing in a custom cleaner, but is this the "correct" or intended way of handling this? I would love if the tensorboard log tracked text as well, for example by logging a comparison of the raw text with the text reconstructed from the sequence as shown in the unittest below. That would have saved me training a few models and then only figuring out the bug by listening to the audio.To Reproduce
Expected behavior
With the above unittest,
text_to_sequence("ɡʷah", cleaner_names=self.cleaners, custom_symbols=self.custom_symbols, tp=self.mohawk_characters['characters'], add_blank=False)
returns[45, 26, 30]
when it should return[24, 26, 30]
. It would be nice if digraphs were handled by default, but barring that, it should be documented somewhere (here?) how they should be handled, and ideally tensorboard would perform some input text sanity check.Environment
Additional context
Many thanks to the authors for this excellent project!
The text was updated successfully, but these errors were encountered: