Skip to content
Eren Gölge edited this page Mar 5, 2021 · 6 revisions

TTS is a deep learning based text-to-speech solution. It favors simplicity over complex and large models and yet, it aims to achieve the state of the art results.

Based on user study, TTS is able to achieve on par or better results compared to other commercial and open-source text-to-speech solutions. It also supports various languages and already applied to more than 13 different languages.

The general architecture we use comprises two separate deep neural networks. The first network computes acoustic features from given text input. The second network produces the voice from the computed acoustic features. We call the first model "text2feat" and the second "vocoder".

TTS also servers a Speaker Encoder model that can be used for computing speaker embedding vectors for various purposes including speaker verification, speaker identification, multi-speaker text-to-speech models.

Currently, we implemented the following methods and models.

Text-to-Feat Models

Tricks for more efficient Tacotron learning.

Attention methods for Tacotron Models

  • Guided Attention: paper
  • Forward Backward Decoding: paper
  • Graves Attention: paper
  • Double Decoder Consistency: blog

Speaker Encoder

Vocoders

Clone this wiki locally