Glow TTS Avg Loss Not Decreasing - Spanish LJSpeech Dataset #1750

iprovalo · 2022-07-18T16:13:38Z

Describe the bug

When I train Glow TTS on LJSpeech Spanish set (angelina or victor) from AI Labs, the avg loss stays constant.

Victor (4k steps, large batch size, tried 32, 64, 128):
trainer_0_log.txt
config.txt

Angelina (229K steps, batch size 32):
trainer_0_log (1).txt
config.txt

To Reproduce

Train model for up to 230K steps, avg_loss won't change.

Expected behavior

No response

Logs

Loss changes.

Environment

{
    "CUDA": {
        "GPU": [
            "A100-SXM4-40GB"
        ],
        "available": true,
        "version": "11.5"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0+cu115",
        "TTS": "0.7.1",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.7.13",
        "version": "#77~18.04.1-Ubuntu SMP Thu Apr 7 21:38:47 UTC 2022"
    }
}

Additional context

No response

iprovalo · 2022-07-19T10:27:07Z

Setting the mixed_precision=True seems to have a positive effect - the loss is changing from epoch 0 to epoch 1, but then, after the epoch 2 the training crashes with an error described in 1683

File "/apps/tts/TTS/TTS/tts/layers/losses.py", line 494, in forward 
RuntimeError:  [!] NaN loss with loss.

glow_tts_es_9.log

iprovalo · 2022-07-19T10:45:17Z

Added some logging to the exception to print the tensor with NaN value (mixed_precision=True):
loss:tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)

lexkoro · 2022-07-19T11:18:02Z

| > current_lr: 0.00000

@thorstenMueller reported a similar problem regarding the learning rate recently on matrix with tacotron2 ddc.

Maybe there was a recent change to the coqui-ai Trainer which impacts the learning rate?

thorstenMueller · 2022-07-19T11:34:35Z

Thanks for tagging me, @lexkoro. In my DDC training the learning rate started way too low and was printed as 0.00000 in command line output. So maybe if @erogol is online and finds some time he might be able to see if this problem is based on a recent Trainer code change.

.

iprovalo · 2022-07-19T11:43:27Z

@lexkoro thank you!

iprovalo · 2022-07-19T12:01:37Z

When I compare a healthy (green) and unhealthy (orange) model lr starting point rates, they are the same. I think the log prints an incomplete value, which is just a formatting issue. I am concerned with the avg loss not changing for the problematic model:

BTW, @thorstenMueller I was able to get rid of the warning you mentioned on the chat channel:

glow_tts.py:517: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').

by replacing the glow_tts.py preprocess() line:

y_lengths = (y_lengths // self.num_squeeze) * self.num_squeeze

with:

y_lengths = torch.div(y_lengths, self.num_squeeze, rounding_mode='floor') * self.num_squeeze

PR for the warning fix.

erogol · 2022-07-20T09:45:59Z

Is the issue "not decreasing" or "getting NaN"? I'm confused.

iprovalo · 2022-07-20T10:05:37Z

Is the issue "not decreasing" or "getting NaN"? I'm confused.

I am seeing both issues:

not decreasing avg_loss with mixed_precision=False
NaN loss with mixed_precision=True.

I encountered the first issue and started changing the parameters thinking that's the general area where the loss function calculation varies. Then I discovered the second issue. I believe there is some relatedness to #1683 when mixed_precision=True.

For this ljspeech Spanish dataset, I was debugging the NaN exception a bit more and I found that both z and log_det are NaN values causing this failure in the forward() function below:

class GlowTTSLoss(torch.nn.Module):
...  def forward(self, z, means, scales, log_det, y_lengths, o_dur_log, o_attn_dur, x_lengths):

thorstenMueller · 2022-07-20T13:24:45Z

BTW, @thorstenMueller I was able to get rid of the warning you mentioned on the chat channel:

Thanks for your analysis and fix PR. 👍

thorstenMueller · 2022-07-22T15:45:10Z

@thorstenMueller reported a similar problem regarding the learning rate recently on matrix with tacotron2 ddc.

Just to post the solution to my lr problem. @erogol helped me on that (thx!). According to config.json the first 4.000 "steps" are warmup and lr should adjust automatically after these steps. But the config value for scheduler_after_epoch was true instead of false . So warmup phase would have taken to 4.000 epochs instead of 4.000 steps. After setting value to false and restarted training everything works as expected.

iprovalo · 2022-07-22T17:29:15Z

@thorstenMueller reported a similar problem regarding the learning rate recently on matrix with tacotron2 ddc.

Just to post the solution to my lr problem. @erogol helped me on that (thx!). According to config.json the first 4.000 "steps" are warmup and lr should adjust automatically after these steps. But the config value for scheduler_after_epoch was true instead of false . So warmup phase would have taken to 4.000 epochs instead of 4.000 steps. After setting value to false and restarted training everything works as expected.

Thank you for posting this, @thorstenMueller !

@erogol would you have a moment to take a look at the loss staying constant issue? I am suspecting it could be a config mistake on my part related to the character set or phonemes. I don't think it's a dataset issue, since I think the released Tacotron model for Spanish was trained on the same data.

iprovalo · 2022-07-24T19:47:53Z

I noticed that the training loss is increasing, causing the best model to remain the same, thus causing the eval loss remain constant:

Thinking of @thorstenMueller comments, I looked into the same property scheduler_after_epoch, it is set to True in the Trainer code, and it is not defined for the models (meaning I cannot configure it). As a work-around I tried this instead: lr_scheduler_params={"warmup_steps": 1}

Here is what I have after making that change, it looks like the lr is being adjusted, but a model is still not learning:

@erogol this could be a dataset issue, is there any other way to confirm this?

iprovalo · 2022-07-25T00:12:42Z

I think it is a dataset issue. Here is 500 steps into Argentinian Spanish female voice training, same model, same parameters (removed lr_scheduler_params={"warmup_steps": 1}), 3921 samples:

erogol · 2022-07-26T11:15:34Z

@iprovalo I've tried LJSpeech recipe with GlowTTS and could not replicate "constant loss" issue. It might be about the dataset.

stale · 2022-08-31T07:52:37Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

iprovalo · 2022-08-31T14:00:30Z

@iprovalo I've tried LJSpeech recipe with GlowTTS and could not replicate "constant loss" issue. It might be about the dataset.

@erogol are you using these public datasets which you preprocess/clean? I tried a few datasets in Spanish, I am still getting some issues. Some advice is appreciated!

erogol · 2022-09-08T07:14:56Z

@iprovalo I've tried the LJSpeech GlowTTS recipe

iprovalo added the bug Something isn't working label Jul 18, 2022

erogol mentioned this issue Jul 20, 2022

"RuntimeError: [!] NaN loss with loss" on GlowTTS introduction example - mailabs dataset #1683

Closed

stale bot added the wontfix This will not be worked on but feel free to help. label Aug 31, 2022

stale bot closed this as completed Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Glow TTS Avg Loss Not Decreasing - Spanish LJSpeech Dataset #1750

Glow TTS Avg Loss Not Decreasing - Spanish LJSpeech Dataset #1750

iprovalo commented Jul 18, 2022 •

edited

Loading

iprovalo commented Jul 19, 2022

iprovalo commented Jul 19, 2022 •

edited

Loading

lexkoro commented Jul 19, 2022

thorstenMueller commented Jul 19, 2022

iprovalo commented Jul 19, 2022

iprovalo commented Jul 19, 2022 •

edited

Loading

erogol commented Jul 20, 2022

iprovalo commented Jul 20, 2022 •

edited

Loading

thorstenMueller commented Jul 20, 2022

thorstenMueller commented Jul 22, 2022

iprovalo commented Jul 22, 2022

iprovalo commented Jul 24, 2022

iprovalo commented Jul 25, 2022

erogol commented Jul 26, 2022

stale bot commented Aug 31, 2022

iprovalo commented Aug 31, 2022

erogol commented Sep 8, 2022

Glow TTS Avg Loss Not Decreasing - Spanish LJSpeech Dataset #1750

Glow TTS Avg Loss Not Decreasing - Spanish LJSpeech Dataset #1750

Comments

iprovalo commented Jul 18, 2022 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

iprovalo commented Jul 19, 2022

iprovalo commented Jul 19, 2022 • edited Loading

lexkoro commented Jul 19, 2022

thorstenMueller commented Jul 19, 2022

iprovalo commented Jul 19, 2022

iprovalo commented Jul 19, 2022 • edited Loading

erogol commented Jul 20, 2022

iprovalo commented Jul 20, 2022 • edited Loading

thorstenMueller commented Jul 20, 2022

thorstenMueller commented Jul 22, 2022

iprovalo commented Jul 22, 2022

iprovalo commented Jul 24, 2022

iprovalo commented Jul 25, 2022

erogol commented Jul 26, 2022

stale bot commented Aug 31, 2022

iprovalo commented Aug 31, 2022

erogol commented Sep 8, 2022

iprovalo commented Jul 18, 2022 •

edited

Loading

iprovalo commented Jul 19, 2022 •

edited

Loading

iprovalo commented Jul 19, 2022 •

edited

Loading

iprovalo commented Jul 20, 2022 •

edited

Loading