This repository provides a multi-mode and multi-speaker expressive speech synthesis framework, including multi-attentive Tacotron, DurIAN, Non-attentive Tacotron.
The framework also includes various deep learning architectures such as Global Style Token (GST), Variational Autoencoder (VAE), and Gaussian Mixture Variational Autoencoder (GMVAE), and X-vectors for building prosody encoder.
- Only provides kernel model files, not including data prepared scripts, training scripts and synthesis scripts
- You can reference ExpressiveTacotron for more training scripts
- Tacotron2
- ForwardAttention
- DurIAN
- Non-attentive Tacotron
- GMMv2 Attention
- Dynamic Convolution Attention (Todo)
- Non-attentive Tacotron: duration stacked convolution layers are concatenated with encoder outputs
This implementation uses code from the following repos: NVIDIA, ESPNet, ERISHA, ForwardAttention