VQVAE Model Details #108

Malik7115 · 2022-06-17T17:48:14Z

Malik7115
Jun 17, 2022

Hello,

Splendid work on the TTS. As a student, I have been trying to understand (and implement) how VQ-VAE model is trained on audio. Specifically, are the inputs to the model images, raw melspecs or normalized mels? How exactly should we transform melspecs as input to the model so that we get good results while training the VQ-VAE.

Thanks alot!

neonbjb · 2022-06-17T19:55:00Z

neonbjb
Jun 17, 2022
Maintainer

Hey, thanks!

The simple answer is you can use the conversion code here to generate the MELs I trained my VQVAE. Use the mel_norms file from Tortoise.

The more descriptive answer is that it doesn't matter. A VQVAE will learn un-normalized or normalized MEL equally well. ("Well" is a pretty ambitious term here. It does a horrible job. It simply learns to cluster MEL frames. That's where the diffusion model comes in.)

You might also refer to this issue for more details about how the VQVAE is trained: neonbjb/DL-Art-School#7

4 replies

Malik7115 Jun 22, 2022
Author

Thanks for pointing me to the model you used. Sorry for the naive question ahead xD......For simplicity, I am using a simpler model for easier understanding of VQ-VAE implementation. I have put this on train, I wanted to know how can we tell that the model is now trained. The outputs always seem a bit lossy (which is natural). In short, how can we learn that the model is now good to stop? Also this repo gives me a golden chance to learn and experiment more about TTS and DL stuff, so I might keep circling back to you for guidance :)

neonbjb Jun 22, 2022
Maintainer

I just trained until the training loss converged. I suspect (but have not confirmed) that squeezing the absolute maximum performance out of the VQ-VAE has no impact on the quality of the downstream model for the likes of things like DALLE and Tortoise. I think what the VQVAE is doing is learning how to cluster MEL frames, then reconstruct those frames from the cluster IDs. The last bit of training is devoted to improving the decoder, not improving that clustering. Meanwhile, we only really care about the clustering since we drop the decoder.

Malik7115 Jul 2, 2022
Author

Hey It's me again....thanks to you I have put my VQVAE on train, and I am getting somewhere (fingers crossed and limited hardware capability xD). Next step is the generative modelling. As of now I want to go with the not so king of generative modelling :) GANs. Do you have any suggestions as to which GAN based architecture one might use and get some results.

P.S. should I make a new topic for discussion on generative models? If thats what you suggest ill make it and shift this comment there.

neonbjb Jul 4, 2022
Maintainer

Hey - I have no experience using GANs for audio. I played with them extensively for about a year trying to get them to work for image super-resolution but I was always disappointed by what they produced.

To that end, I don't really have many suggestions for you; especially not in the vein of text-to-speech. Sorry. :(

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VQVAE Model Details #108

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

VQVAE Model Details #108

Malik7115 Jun 17, 2022

Replies: 1 comment · 4 replies

neonbjb Jun 17, 2022 Maintainer

Malik7115 Jun 22, 2022 Author

neonbjb Jun 22, 2022 Maintainer

Malik7115 Jul 2, 2022 Author

neonbjb Jul 4, 2022 Maintainer

Malik7115
Jun 17, 2022

Replies: 1 comment 4 replies

neonbjb
Jun 17, 2022
Maintainer

Malik7115 Jun 22, 2022
Author

neonbjb Jun 22, 2022
Maintainer

Malik7115 Jul 2, 2022
Author

neonbjb Jul 4, 2022
Maintainer