The trained models seem overfitted to their training sets? #7

jpc · 2023-06-26T20:53:48Z

I compressed a few examples with the 24kHz libritts_v1 model and they sounded great at very low bitrates but when I tried downsampling something from VCTK to 24kHz and pushing it through the same model the quality suffered a lot. I've seen the same problem when testing it out on some clean speech extracted from a YouTube video.

Since LibriTTS and VCTK are pretty small datasets is it possible that the pretrained models are overfitted to them a bit too much?

The text was updated successfully, but these errors were encountered:

bigpon · 2023-06-26T21:08:34Z

Hi,
we found the same issue that the models are vulnerable to unseen data, especially to data from different corpora which are recorded by different microphones in different environments.

We think it might not be caused by the over-fitting issue because they actually work well for the unseen speakers coming from the same datasets, and the LibriTTS corpus might be large enough for common speech synthesis tasks.
We believe it is caused by the data-driven nature of the neural network.

Therefore, the possible workaround solution for this model may be to train the model using as much/diverse as possible data or directly train the model using the data recorded in the same situation as the data you will use for encoding/decoding.

jpc · 2023-06-27T09:00:30Z

Do you think this problem comes from the encoder or the decoder part?

bigpon · 2023-06-28T14:50:06Z

I think it comes from both.

[Decoder]
Many previous neural vocoder works, which take mel-spectrogram as the input, mention that if the input mel-spectrogram is distorted (ex: predicted by another voice conversion network), the quality of the vocoder will significantly degrade.
However, they also show that fine-tuning the vocoder will ease the quality degradation, so updating the AudioDec decoder/vocoder might be helpful.

[Encoder]
If the input is unseen data, the encoder might not be able to extract a good representation of the input, and the errors will propagate to the decoder resulting in quality degradation.

Therefore, I think fine-tuning the whole model using the new data might achieve the best performance. However, we show that fine-tuning only the encoder still can do denoising well, so fine-tuning the encoder might be more important than the decoder.

bigpon added the question Further information is requested label Jun 26, 2023

BridgetteSong mentioned this issue Jul 25, 2023

Is it missing some activation functions between some layers? #9

Closed

bigpon closed this as completed Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The trained models seem overfitted to their training sets? #7

The trained models seem overfitted to their training sets? #7

jpc commented Jun 26, 2023 •

edited

Loading

bigpon commented Jun 26, 2023 •

edited

Loading

jpc commented Jun 27, 2023

bigpon commented Jun 28, 2023

The trained models seem overfitted to their training sets? #7

The trained models seem overfitted to their training sets? #7

Comments

jpc commented Jun 26, 2023 • edited Loading

bigpon commented Jun 26, 2023 • edited Loading

jpc commented Jun 27, 2023

bigpon commented Jun 28, 2023

jpc commented Jun 26, 2023 •

edited

Loading

bigpon commented Jun 26, 2023 •

edited

Loading