Architecture description #73

e0xextazy · 2022-05-27T12:37:29Z

e0xextazy
May 27, 2022

Could you be more specific about the drawings that describe the architecture here?
https://nonint.com/2022/04/25/tortoise-architectural-design-doc/

neonbjb · 2022-05-27T17:51:43Z

neonbjb
May 27, 2022
Maintainer

I am midway through writing up a whitepaper that has more detail. If you have any specific questions, please ask.

0 replies

faad3 · 2022-05-30T11:52:41Z

faad3
May 30, 2022

@neonbjb Hello! As far as I understand, each of the models is trained separately. Could you clarify what are the inputs and targets during training for each model? And what loss functions are used?

3 replies

neonbjb May 30, 2022
Maintainer

Yes, they are trained separately.

Vqvae: inputs are the MEL, outputs are the same reconstructed MEL.
Autoregressive: Inputs are BPE encoded text, a conditioning MEL from the same speaker, and the tokenized target MEL. Outputs are next sequence predictions of both the text and MEL tokens.
Clvp: inputs are MEL and BPE encoded text. Targets are a standard clip-style contrastive loss.
Diffusion decoder: Two stages. First stage, inputs are conditioning MEL from the same speaker and tokenized MEL. Second stage, inputs are the pre logit outputs from the autoregressive model and the conditioning MEL. Targets for both stages are the target MEL. Staging is done here because computing the autoregressive outputs is expensive, it is faster to train on the codes first.

faad3 May 30, 2022

Thanks! Do I understand correctly that the tokenized MEL is the output of VQVAE? Does the output probabilities of Autoregressive decoder have the same format as tokenized MEL?

And could you please tell us more about the learning process of Autoregressive decoder. I did not understand what its targets are. And what loss functions are used (not only in Autoregressive)?

neonbjb May 30, 2022
Maintainer

Do I understand correctly that the tokenized MEL is the output of VQVAE?

Correct.

Does the output probabilities of Autoregressive decoder have the same format as tokenized MEL?

Yep.

And could you please tell us more about the learning process of Autoregressive decoder. I did not understand what its targets are. And what loss functions are used (not only in Autoregressive)?

See the DALLE paper. My training process was nearly identical to theirs, except my tokens come from speech rather than images. I even used the same hyperparameters.

e0xextazy · 2022-05-30T11:54:51Z

e0xextazy
May 30, 2022
Author

Which models need to be retrained? Which models can be left unchanged for another language?

4 replies

neonbjb May 30, 2022
Maintainer

Both the autoregressive model and the diffusion decoder would need to be retrained, and those are the most expensive ones.

Unfortunately this wasnt designed as a multilingual model. I think something like this could be built with a very similar dataset.

e0xextazy May 30, 2022
Author

Thank u!

e0xextazy May 30, 2022
Author

Can you tell me about the loss function for each model?

neonbjb May 30, 2022
Maintainer

They're all standard loss functions for their respective model types. For the autoregressive model, read the dalle paper. For the diffusion model, read "improved diffusion" or the original diffusion papers.

neonbjb · 2022-05-30T14:17:56Z

neonbjb
May 30, 2022
Maintainer

very similar dataset.

Sorry, meant "very similar architecture".

0 replies

faad3 · 2022-06-10T14:09:06Z

faad3
Jun 10, 2022

Hello! It's me again) Do I understand correctly that VAE, which you use, gets the vector index from the codebook by argmax in style of VQ-VAE, and not using Gumbel Softmax Relaxation, as it is done in DALL-E?

3 replies

neonbjb Jun 10, 2022
Maintainer

Hey, no worries at all. :)

Yes, Tortoise uses a VQVAE trained with the "argmax" style, not Gumbel relaxation. That being said, I do not think this distinction matters as long as you get a good reconstruction loss and good codebook usage.

The only reason I did not use Gumbel relaxation was that at the time I was unable to train my VQVAE with that method and not see a substantial amount of the codebook going unused. I've been experimenting with this a lot lately and have figured out ways to make this work, example. I am getting better reconstruction losses out of this new architecture than I did out of my lucidrains_dvae.py model. If I did it again, I would probably go this route.

faad3 Jun 10, 2022

I got you. Is codebook updated using EMA?

neonbjb Jun 10, 2022
Maintainer

Yes. This is the Quantizer: https://github.com/neonbjb/DL-Art-School/blob/master/codes/models/vqvae/vqvae.py#L31

faad3 · 2022-06-14T15:01:43Z

faad3
Jun 14, 2022

Hello! I wondered why "codes or latents" are written in several places in your diagrams with architecture. As far as I understand, you trained the diffusion model first on the pure VAE outputs, then on latents obtained by gpt (judging by the config that you provided in DL-Art-School), so how does it work?

4 replies

neonbjb Jun 14, 2022
Maintainer

Your understanding is right. After the diffusion network converged on codes, i transitioned to AR latents. The reasoning here is that forward propagating the AR network is very costly compared to generating codes. It would be an acceptable (and maybe even a better end result) to train from AR latents from the get go.

faad3 Jun 14, 2022

I just don't quite understand, why latents and not codes are passed through the diffusion model?

neonbjb Jun 14, 2022
Maintainer

Latents produce considerably better results. Train both, you'll see. :)

faad3 Jun 14, 2022

Alright, thanks)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture description #73

{{title}}

Replies: 6 comments 14 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Architecture description #73

Replies: 6 comments · 14 replies

neonbjb May 27, 2022 Maintainer

neonbjb May 30, 2022 Maintainer

neonbjb May 30, 2022 Maintainer

e0xextazy May 30, 2022 Author

neonbjb May 30, 2022 Maintainer

e0xextazy May 30, 2022 Author

e0xextazy May 30, 2022 Author

neonbjb May 30, 2022 Maintainer

neonbjb May 30, 2022 Maintainer

neonbjb Jun 10, 2022 Maintainer

neonbjb Jun 10, 2022 Maintainer

neonbjb Jun 14, 2022 Maintainer

neonbjb Jun 14, 2022 Maintainer

Replies: 6 comments 14 replies

neonbjb
May 27, 2022
Maintainer

neonbjb May 30, 2022
Maintainer

neonbjb May 30, 2022
Maintainer

e0xextazy
May 30, 2022
Author

neonbjb May 30, 2022
Maintainer

e0xextazy May 30, 2022
Author

e0xextazy May 30, 2022
Author

neonbjb May 30, 2022
Maintainer

neonbjb
May 30, 2022
Maintainer

neonbjb Jun 10, 2022
Maintainer

neonbjb Jun 10, 2022
Maintainer

neonbjb Jun 14, 2022
Maintainer

neonbjb Jun 14, 2022
Maintainer