-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] BCELoss should not be masked #1192
Comments
Nice catch also thanks for sharing the code to reproduce. It made really easy to test out your arguments. Masking really seems like breaking the loss computation. If you want you can send a ✨PR✨ We could also add the cases above in the Changing stop_tokens to something else is not something I'd like to even though I agree with you since it is referred in the docs and discussions. |
I have rerun training on LJSpeech without masking, with BUT importantly, it solved one major problem. Every sample output i have checked using masked loss suffers from repetition (a well-known problem in NLP): The resulting silence can later be trimmed out, so it is ok, it just makes decoding slower. |
And for how many steps you've trained the model and with what config? Maybe longer training would make it better. |
I used the same config as |
r=1 is rly hard to train using tacotron2, especially if the dataset has samples with long silences. Also you might want to try activating dropout during inference, I remember it improved my inference by a lot when using dca. |
@iamanigeeit thanks for sharing your updates. Do you intend to send a ✨PR✨ that fixes the BCELoss? Or we can also handle it but I guess you deserve the attribution for this. That is why it'd be nice if you could send the PR. |
Finding the right DB value for trimming the silence would help. I agree that training with r=1 is a hard task when the dataset is not processed well for TTS. |
Haha, i don't need the attribution... but i need help using Git as i haven't used it on multi-person projects before. I now have a version cloned from the 33aa27 commit with all my commits pushed to my own repo, including commits on other files. How do i fix this? Here are the files to update: updates.zip |
Follow some steps: 1st Fork the 🐸 TTS repository (use the button "fork" at the top of the page) 2st Clone from your Fork (dev branch). The command will be some like: 3st Change the files that you need. 4st Commit the changes with the commands (obs: Change the commit message :)):
5st Push the commits to your fork with the command: 6st Go to your fork (https://github.com/iamanigeeit/TTS). Github will identify that you've made changes and suggest the pull request and it will show a pull request button below "Go to file", "Add file" and "code" buttons. Now you can click on the pull request button and send a pull request from your dev branch to Coqui's dev branch :). |
@erogol I think losses should not be masked at all. From the original Tacotron paper:
This is what i've found as well. |
@Edresson thanks for the suggestion -- i've made a lot of other changes in prior commits to my own repo, so probably i will create a new fork and update from there... |
I've tried no masking but masking worked better in all of my experiments although it was buggy. |
How did you manage to solve the non-terminating outputs problem? |
I've not observed this problem with my models after training long enough. |
Thanks, i was thinking 100k steps on batch size 32 would be enough, but it seems not... you haven't encountered inference frequently hitting max_decoder_steps? |
you can change max_decoder_steps in the model config. |
I set max_decoder_steps to a big number to test whether the model stops. I think it is a problem when the model doesn't stop generating, although limiting the decoder steps is a workaround. |
Sorry I misunderstood your question.
Yes, I've not observed the model hitting the max_decoder_steps after enough training. |
@Edresson can you look at this when you have time? |
* Fix checkpointing GAN models (#1641) * checkpoint sae step crash fix * checkpoint save step crash fix * Update gan.py updated requested changes * crash fix * Fix the --model_name and --vocoder_name arguments need a <model_type> element (#1469) Co-authored-by: Eren Gölge <erogol@hotmail.com> * Fix Publish CI (#1597) * Try out manylinux * temporary removal of useless pipeline * remove check and use only manylinux * Try --plat-name * Add install requirements * Add back other actions * Add PR trigger * Remove conditions * Fix sythax * Roll back some changes * Add other python versions * Add test pypi upload * Add username * Add back __token__ as username * Modify name of entry to testpypi * Set it to release only * Fix version checking * Fix tokenizer for punc only (#1717) * Remove redundant config field * Fix SSIM loss * Separate loss tests * Fix BCELoss adressing #1192 * Make style * Add durations as aux input for VITS (#1694) * Add durations as aux input for VITS * Make style * Fix tts_tests * Fix test_get_aux_input * Make lint * feat: updated recipes and lr fix (#1718) - updated the recipes activating more losses for more stable training - re-enabling guided attention loss - fixed a bug about not the correct lr fetched for logging * Implement VitsAudioConfig (#1556) * Implement VitsAudioConfig * Update VITS LJSpeech recipe * Update VITS VCTK recipe * Make style * Add missing decorator * Add missing param * Make style * Update recipes * Fix test * Bug fix * Exclude tests folder * Make linter * Make style * Fix device allocation * Fix SSIM loss correction * Fix aux tests (#1753) * Set n_jobs to 1 for resample script * Delete resample test * Set n_jobs 1 in vad test * delete vad test * Revert "Delete resample test" This reverts commit bb7c846. * Remove tests with resample * Fix for FloorDiv Function Warning (#1760) * Fix for Floor Function Warning Fix for Floor Function Warning * Adding double quotes to fix formatting Adding double quotes to fix formatting * Update glow_tts.py * Update glow_tts.py * Fix type in download_vctk.sh (#1739) typo in comment * Update decoder.py (#1792) Minor comment correction. * Update requirements.txt (#1791) Support for #1775 * Update README.md (#1776) Fix typo in different and code sample * Fix & update WaveRNN vocoder model (#1749) * Fixes KeyError bug. Adding logging to dashboard. * Make pep8 compliant * Make style compliant * Still fixing style * Fix rand_segment edge case (input_len == seg_len - 1) * Update requirements.txt; inflect==5.6 (#1809) New inflect version (6.0) depends on pydantic which has some issues irrelevant to 🐸 TTS. #1808 Force inflect==5.6 (pydantic free) install to solve dependency issue. * Update README.md; download progress bar in CLI. (#1797) * Update README.md - minor PR - added model_info usage guide based on #1623 in README.md . * "added tqdm bar for model download" * Update manage.py * fixed style * fixed style * sort imports * Update wavenet.py (#1796) * Update wavenet.py Current version does not use "in_channels" argument. In glowTTS, we use normalizing flows and so "input dim" == "ouput dim" (channels and length). So, the existing code just uses hidden_channel sized tensor as input to first layer as well as outputs hidden_channel sized tensor. However, since it is a generic implementation, I believe it is better to update it for a more general use. * "in_channels -> hidden_channels" * Adjust default to be able to process longer sentences (#1835) Running `tts --text "$text" --out_path …` with a somewhat longer sentences in the text will lead to warnings like “Decoder stopped with max_decoder_steps 500” and the sentences just being cut off in the resulting WAV file. This happens quite frequently when feeding longer texts (e.g. a blog post) to `tts`. It's particular frustrating since the error is not always obvious in the output. You have to notice that there are missing parts. This is something other users seem to have run into as well [1]. This patch simply increases the maximum number of steps allowed for the tacotron decoder to fix this issue, resulting in a smoother default behavior. [1] mozilla/TTS#734 * Fix language flags generated by espeak-ng phonemizer (#1801) * fix language flags generated by espeak-ng phonemizer * Style * Updated language flag regex to consider all language codes alike * fix get_random_embeddings --> get_random_embedding (#1726) * fix get_random_embeddings --> get_random_embedding function typo leads to training crash, no such function * fix typo get_random_embedding * Introduce numpy and torch transforms (#1705) * Refactor audio processing functions * Add tests for numpy transforms * Fix imports * Fix imports2 * Implement bucketed weighted sampling for VITS (#1871) * Update capacitron_layers.py (#1664) crashing because of dimension miss match at line no. 57 [batch, 256] vs [batch , 1, 512] enc_out = torch.cat([enc_out, speaker_embedding], dim=-1) * updates to dataset analysis notebooks for compatibility with latest version of TTS (#1853) * Fix BCE loss issue (#1872) * Fix BCE loss issue * Remove import * Remove deprecated files (#1873) - samplers.py is moved - distribute.py is replaces by the 👟Trainer * Handle when no batch sampler (#1882) * Fix tune wavegrad (#1844) * fix imports in tune_wavegrad * load_config returns Coqpit object instead None * set action (store true) for flag "--use_cuda"; start to tune if module is running as the main program * fix var order in the result of batch collating * make style * make style with black and isort * Bump up to v0.8.0 * Add new DE Thorsten models (#1898) - Tacotron2-DDC - HifiGAN vocoder Co-authored-by: manmay nakhashi <manmay.nakhashi@gmail.com> Co-authored-by: camillem <camillem@users.noreply.github.com> Co-authored-by: WeberJulian <julian.weber@hotmail.fr> Co-authored-by: a-froghyar <adamfroghyar@gmail.com> Co-authored-by: ivan provalov <iprovalo@yahoo.com> Co-authored-by: Tsai Meng-Ting <sarah13680@gmail.com> Co-authored-by: p0p4k <rajiv.punmiya@gmail.com> Co-authored-by: Yuri Pourre <yuripourre@users.noreply.github.com> Co-authored-by: vanIvan <alfa1211@gmail.com> Co-authored-by: Lars Kiesow <lkiesow@uos.de> Co-authored-by: rbaraglia <baraglia.r@live.fr> Co-authored-by: jchai.me <jreus@users.noreply.github.com> Co-authored-by: Stanislav Kachnov <42406556+geth-network@users.noreply.github.com>
I have trained Tacotron2 but during eval / inference, it often doesn't know when to stop decoding. This is a known issue in seq2seq models and i was trying to solve it in TensorFlowTTS when i gave up due to Tensorflow problems.
Training with
enable_bos_eos=True
helps a bit but the output is still 3x the ground truth mel length for shorter audio: see length_data_eos.csv vs length_data_no_eos.csvOne reason is the BCELossMasked criterion -- in its current form, it encourages the model never to stop decoding once it has passed
mel_length
. Some of the loss results don't quite make sense, as seen below:For now i am passing
length=None
to avoid the mask and settingpos_weight=0.2
to experiment. Will update the training results.Additional context
I would also propose renaming
stop_tokens
to eitherstop_probs
orstop_logits
depending on context. Currently,inference()
producesstop_tokens
that represent stop probabilities, whileforward()
produces the logits before sigmoid. Confusingly, both are calledstop_tokens
.The text was updated successfully, but these errors were encountered: