GlowTTS paper discussion #1494

p0p4k · 2022-04-13T11:08:20Z

p0p4k
Apr 13, 2022

Hi all, where would a good place to ask technical questions related to the GlowTTS paper?
I am trying to understand glowtts and have a few doubts about the highlighted parts.
I understand that we are using encoder -> decoder model.

DECODER:

I know normalising flows, which will be used in the decoder.
The mel-spec can be seen as a collection of various pixels from 1:Tmel, but we do not know their probablity density functions.
So, we want to train decoder that takes z~ N(z) and gives x~ Pr(mel) as output.
However not to forget that we need to condition on "c" since it must also match the text input and then every pixel in mel will become conditionally independent. Dimension of z and x are same from using flows. Let's call z as the mel spec latent pixel. Even in this latent space, we assume it is normal distribution, but we do not have mu, sigma for z.

ENCODER:

Now, we know we want to use N(z), but dont have the statistics (mu, sigma).
And there are total Tmel N(z) we need to know, product of all those probabilities will give us total Pr(z|c).
Then, to use the condition "c" that is the given text (individual characters), we create an alignment (A) which maps the mel-spec latent pixel (j) to text input (i).
We need a total of Tmel statistics (mu<1>,mu<2>...,mu and same for sigma). Somehow, we need to match this index of Tmel in latent space (j) to text input index (i). So, now we just need Ttext number of statistics instead of Tmel , since A(j) = i.
Knowing that it is possible that several mel spec pixels can correspond to a same character, so A(j) is surjective (many2one) and to keep reading ahead (dont turn back), it is also monotonic.
Total pr(z|c;A, theta) is nothing but mutiplication of all the mel spec latent pixel prbobalities (log of summation for small number computation ease). If this thing goes up, based on our "trained" (imagine its trained) decoder, we should get a resonable mel spec "x" as output given text "c".

TRAINING:

To train this, we need the encoder to output some statistics mu and sigma. So maybe we first generate random (Ttext size mus and sigmas). Once, we have mus and sigmas, we need to find total p(z|c, A). There are different As which can give us different p(z|c,A). The A matrix is a Tmel x Ttext dimension, we know Ttext size from input words "c" and Tmel size from decoder objective to generate mel.
Then using MAS (monoalignment search), we find the alignment that maximizes P(z|c,A). [eq 4]
Once alignment is maximized, we find new mus and sigmas for maximing P(z|c,A*) using gradient descent.
We sample z values from this and hopefully the decoder gives us the corresponding mel-spec, we use gradient descent here too.
While all this is going on, during training, we know Tmel size, which we will not have luxury to know during inference. So train a duration predictor using MSE which learns to predict Tmel given Ttext.

INFERENCE:

Given all training goes well, hopefully, the encoder knows to generate A (from duration predictor), corresponding (mus and sigmas for z) and then decoder translates the z to x using flows inverse function.

Am i correctly understanding this so far? Thanks. Sorry for long post, new to this stuff.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GlowTTS paper discussion #1494

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

GlowTTS paper discussion #1494

p0p4k Apr 13, 2022

Replies: 0 comments

p0p4k
Apr 13, 2022