Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The trained models seem overfitted to their training sets? #7

Closed
jpc opened this issue Jun 26, 2023 · 3 comments
Closed

The trained models seem overfitted to their training sets? #7

jpc opened this issue Jun 26, 2023 · 3 comments
Labels
question Further information is requested

Comments

@jpc
Copy link

jpc commented Jun 26, 2023

I compressed a few examples with the 24kHz libritts_v1 model and they sounded great at very low bitrates but when I tried downsampling something from VCTK to 24kHz and pushing it through the same model the quality suffered a lot. I've seen the same problem when testing it out on some clean speech extracted from a YouTube video.

Since LibriTTS and VCTK are pretty small datasets is it possible that the pretrained models are overfitted to them a bit too much?

@bigpon bigpon added the question Further information is requested label Jun 26, 2023
@bigpon
Copy link
Contributor

bigpon commented Jun 26, 2023

Hi,
we found the same issue that the models are vulnerable to unseen data, especially to data from different corpora which are recorded by different microphones in different environments.

We think it might not be caused by the over-fitting issue because they actually work well for the unseen speakers coming from the same datasets, and the LibriTTS corpus might be large enough for common speech synthesis tasks.
We believe it is caused by the data-driven nature of the neural network.

Therefore, the possible workaround solution for this model may be to train the model using as much/diverse as possible data or directly train the model using the data recorded in the same situation as the data you will use for encoding/decoding.

@jpc
Copy link
Author

jpc commented Jun 27, 2023

Do you think this problem comes from the encoder or the decoder part?

@bigpon
Copy link
Contributor

bigpon commented Jun 28, 2023

I think it comes from both.

[Decoder]
Many previous neural vocoder works, which take mel-spectrogram as the input, mention that if the input mel-spectrogram is distorted (ex: predicted by another voice conversion network), the quality of the vocoder will significantly degrade.
However, they also show that fine-tuning the vocoder will ease the quality degradation, so updating the AudioDec decoder/vocoder might be helpful.

[Encoder]
If the input is unseen data, the encoder might not be able to extract a good representation of the input, and the errors will propagate to the decoder resulting in quality degradation.

Therefore, I think fine-tuning the whole model using the new data might achieve the best performance. However, we show that fine-tuning only the encoder still can do denoising well, so fine-tuning the encoder might be more important than the decoder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants