Replies: 1 comment 4 replies
-
Hey, thanks! The simple answer is you can use the conversion code here to generate the MELs I trained my VQVAE. Use the mel_norms file from Tortoise. The more descriptive answer is that it doesn't matter. A VQVAE will learn un-normalized or normalized MEL equally well. ("Well" is a pretty ambitious term here. It does a horrible job. It simply learns to cluster MEL frames. That's where the diffusion model comes in.) You might also refer to this issue for more details about how the VQVAE is trained: neonbjb/DL-Art-School#7 |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
Splendid work on the TTS. As a student, I have been trying to understand (and implement) how VQ-VAE model is trained on audio. Specifically, are the inputs to the model images, raw melspecs or normalized mels? How exactly should we transform melspecs as input to the model so that we get good results while training the VQ-VAE.
Thanks alot!
Beta Was this translation helpful? Give feedback.
All reactions