Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glow TTS Avg Loss Not Decreasing - Spanish LJSpeech Dataset #1750

Closed
iprovalo opened this issue Jul 18, 2022 · 17 comments
Closed

Glow TTS Avg Loss Not Decreasing - Spanish LJSpeech Dataset #1750

iprovalo opened this issue Jul 18, 2022 · 17 comments
Labels
bug Something isn't working wontfix This will not be worked on but feel free to help.

Comments

@iprovalo
Copy link
Contributor

iprovalo commented Jul 18, 2022

Describe the bug

When I train Glow TTS on LJSpeech Spanish set (angelina or victor) from AI Labs, the avg loss stays constant.

Victor (4k steps, large batch size, tried 32, 64, 128):
trainer_0_log.txt
config.txt

Angelina (229K steps, batch size 32):
trainer_0_log (1).txt
config.txt

To Reproduce

Train model for up to 230K steps, avg_loss won't change.

Expected behavior

No response

Logs

Loss changes.

Environment

{
    "CUDA": {
        "GPU": [
            "A100-SXM4-40GB"
        ],
        "available": true,
        "version": "11.5"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0+cu115",
        "TTS": "0.7.1",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.7.13",
        "version": "#77~18.04.1-Ubuntu SMP Thu Apr 7 21:38:47 UTC 2022"
    }
}

Additional context

No response

@iprovalo iprovalo added the bug Something isn't working label Jul 18, 2022
@iprovalo
Copy link
Contributor Author

Setting the mixed_precision=True seems to have a positive effect - the loss is changing from epoch 0 to epoch 1, but then, after the epoch 2 the training crashes with an error described in 1683

File "/apps/tts/TTS/TTS/tts/layers/losses.py", line 494, in forward 
RuntimeError:  [!] NaN loss with loss.

glow_tts_es_9.log

@iprovalo
Copy link
Contributor Author

iprovalo commented Jul 19, 2022

Added some logging to the exception to print the tensor with NaN value (mixed_precision=True):
loss:tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)

@lexkoro
Copy link
Collaborator

lexkoro commented Jul 19, 2022

| > current_lr: 0.00000

@thorstenMueller reported a similar problem regarding the learning rate recently on matrix with tacotron2 ddc.

Maybe there was a recent change to the coqui-ai Trainer which impacts the learning rate?

@thorstenMueller
Copy link
Contributor

Thanks for tagging me, @lexkoro. In my DDC training the learning rate started way too low and was printed as 0.00000 in command line output. So maybe if @erogol is online and finds some time he might be able to see if this problem is based on a recent Trainer code change.

DDC-lr-too-low

.

@iprovalo
Copy link
Contributor Author

@lexkoro thank you!

@iprovalo
Copy link
Contributor Author

iprovalo commented Jul 19, 2022

When I compare a healthy (green) and unhealthy (orange) model lr starting point rates, they are the same. I think the log prints an incomplete value, which is just a formatting issue. I am concerned with the avg loss not changing for the problematic model:

Screen Shot 2022-07-19 at 12 59 04 PM

Screen Shot 2022-07-19 at 12 58 50 PM

BTW, @thorstenMueller I was able to get rid of the warning you mentioned on the chat channel:

glow_tts.py:517: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').

by replacing the glow_tts.py preprocess() line:

y_lengths = (y_lengths // self.num_squeeze) * self.num_squeeze

with:

y_lengths = torch.div(y_lengths, self.num_squeeze, rounding_mode='floor') * self.num_squeeze

PR for the warning fix.

@erogol
Copy link
Member

erogol commented Jul 20, 2022

Is the issue "not decreasing" or "getting NaN"? I'm confused.

@iprovalo
Copy link
Contributor Author

iprovalo commented Jul 20, 2022

Is the issue "not decreasing" or "getting NaN"? I'm confused.

I am seeing both issues:

  1. not decreasing avg_loss with mixed_precision=False
  2. NaN loss with mixed_precision=True.

I encountered the first issue and started changing the parameters thinking that's the general area where the loss function calculation varies. Then I discovered the second issue. I believe there is some relatedness to #1683 when mixed_precision=True.

For this ljspeech Spanish dataset, I was debugging the NaN exception a bit more and I found that both z and log_det are NaN values causing this failure in the forward() function below:

class GlowTTSLoss(torch.nn.Module):
...  def forward(self, z, means, scales, log_det, y_lengths, o_dur_log, o_attn_dur, x_lengths):

@thorstenMueller
Copy link
Contributor

BTW, @thorstenMueller I was able to get rid of the warning you mentioned on the chat channel:

Thanks for your analysis and fix PR. 👍

@thorstenMueller
Copy link
Contributor

@thorstenMueller reported a similar problem regarding the learning rate recently on matrix with tacotron2 ddc.

Just to post the solution to my lr problem. @erogol helped me on that (thx!). According to config.json the first 4.000 "steps" are warmup and lr should adjust automatically after these steps. But the config value for scheduler_after_epoch was true instead of false . So warmup phase would have taken to 4.000 epochs instead of 4.000 steps. After setting value to false and restarted training everything works as expected.

@iprovalo
Copy link
Contributor Author

@thorstenMueller reported a similar problem regarding the learning rate recently on matrix with tacotron2 ddc.

Just to post the solution to my lr problem. @erogol helped me on that (thx!). According to config.json the first 4.000 "steps" are warmup and lr should adjust automatically after these steps. But the config value for scheduler_after_epoch was true instead of false . So warmup phase would have taken to 4.000 epochs instead of 4.000 steps. After setting value to false and restarted training everything works as expected.

Thank you for posting this, @thorstenMueller !

@erogol would you have a moment to take a look at the loss staying constant issue? I am suspecting it could be a config mistake on my part related to the character set or phonemes. I don't think it's a dataset issue, since I think the released Tacotron model for Spanish was trained on the same data.

@iprovalo
Copy link
Contributor Author

I noticed that the training loss is increasing, causing the best model to remain the same, thus causing the eval loss remain constant:

Screen Shot 2022-07-24 at 6 50 59 AM

Thinking of @thorstenMueller comments, I looked into the same property scheduler_after_epoch, it is set to True in the Trainer code, and it is not defined for the models (meaning I cannot configure it). As a work-around I tried this instead: lr_scheduler_params={"warmup_steps": 1}

Here is what I have after making that change, it looks like the lr is being adjusted, but a model is still not learning:
Screen Shot 2022-07-24 at 12 41 01 PM

@erogol this could be a dataset issue, is there any other way to confirm this?

@iprovalo
Copy link
Contributor Author

I think it is a dataset issue. Here is 500 steps into Argentinian Spanish female voice training, same model, same parameters (removed lr_scheduler_params={"warmup_steps": 1}), 3921 samples:

Screen Shot 2022-07-24 at 5 09 19 PM

@erogol
Copy link
Member

erogol commented Jul 26, 2022

@iprovalo I've tried LJSpeech recipe with GlowTTS and could not replicate "constant loss" issue. It might be about the dataset.

@stale
Copy link

stale bot commented Aug 31, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

@stale stale bot added the wontfix This will not be worked on but feel free to help. label Aug 31, 2022
@iprovalo
Copy link
Contributor Author

@iprovalo I've tried LJSpeech recipe with GlowTTS and could not replicate "constant loss" issue. It might be about the dataset.

@erogol are you using these public datasets which you preprocess/clean? I tried a few datasets in Spanish, I am still getting some issues. Some advice is appreciated!

@stale stale bot closed this as completed Sep 7, 2022
@erogol
Copy link
Member

erogol commented Sep 8, 2022

@iprovalo I've tried the LJSpeech GlowTTS recipe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on but feel free to help.
Projects
None yet
Development

No branches or pull requests

4 participants