Skip to content

yui-mhcp/text_to_speech

Repository files navigation

πŸ˜‹ Text To Speech (TTS)

Check the CHANGELOG file to have a global overview of the latest modifications ! πŸ˜‹

Project structure

β”œβ”€β”€ custom_architectures
β”‚Β Β  β”œβ”€β”€ tacotron2_arch.py       : Tacotron-2 synthesizer architecture
β”‚Β Β  └── waveglow_arch.py        : WaveGlow vocoder architecture
β”œβ”€β”€ custom_layers
β”œβ”€β”€ custom_train_objects
β”‚Β Β  β”œβ”€β”€ losses
β”‚Β Β  β”‚Β Β  └── tacotron_loss.py    : custom Tacotron2 loss
β”œβ”€β”€ example_outputs         : some pre-computed audios (cf the `text_to_speech` notebook)
β”œβ”€β”€ loggers
β”œβ”€β”€ models
β”‚Β Β  β”œβ”€β”€ encoder             : the `AudioEncoder` is used as speaker encoder for the SV2TTS model*
β”‚Β Β  β”œβ”€β”€ tts
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ sv2tts_tacotron2.py : SV2TTS main class
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ tacotron2.py        : Tacotron2 main class
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ vocoder.py          : main functions for complete inference
β”‚Β Β  β”‚Β Β  └── waveglow.py         : WaveGlow main class (both pytorch and tensorflow)
β”œβ”€β”€ pretrained_models
β”œβ”€β”€ unitests
β”œβ”€β”€ utils
β”œβ”€β”€ example_fine_tuning.ipynb
β”œβ”€β”€ example_sv2tts.ipynb
β”œβ”€β”€ example_tacotron2.ipynb
β”œβ”€β”€ example_waveglow.ipynb
└── text_to_speech.ipynb

Check the main project for more information about the unextended modules / structure / main classes.

* Check the encoders project for more information about the models/encoder module

Available features

  • Text-To-Speech (module models.tts) :
Feature Fuction / class Description
Text-To-Speech tts perform TTS on text you want with the model you want
stream tts_stream perform TTS on text you enter
TTS logger loggers.TTSLogger converts logging logs to voice and play it

The text_to_speech notebook provides a concrete demonstration of the tts function

Available models

Model architectures

Available architectures :

  • Synthesizer :
    • Tacotron2 with extensions for multi-speaker (by ID or SV2TTS)
    • SV2TTS extension of the Tacotron2 architecture for multi-speaker based on speaker's embeddings*
  • Vocoder :

The SV2TTS models are fine-tuned from pretrained Tacotron2 models, by using the partial transfer learning procedure (see below for details), which speeds up a lot the training.

Model weights

Name Language Dataset Synthesizer Vocoder Speaker Encoder Trainer Weights
pretrained_tacotron2 en LJSpeech Tacotron2 WaveGlow / NVIDIA Google Drive
tacotron2_siwis fr SIWIS Tacotron2 WaveGlow / me Google Drive
sv2tts_tacotron2_256 fr SIWIS, VoxForge, CommonVoice SV2TTSTacotron2 WaveGlow Google Drive me Google Drive
sv2tts_siwis fr SIWIS, VoxForge, CommonVoice SV2TTSTacotron2 WaveGlow Google Drive me Google Drive
sv2tts_tacotron2_256_v2 fr SIWIS, VoxForge, CommonVoice SV2TTSTacotron2 WaveGlow Google Drive me Google Drive
sv2tts_siwis_v2 fr SIWIS SV2TTSTacotron2 WaveGlow Google Drive me Google Drive

Models must be unzipped in the pretrained_models/ directory !

Important Note : the NVIDIA models available on torch hub requires a compatible GPU with the correct configuration for pytorch. It is the reason why the both models are provided in the expected keras checkpoint πŸ˜„

The sv2tts_siwis models are fine-tuned version of sv2tts_tacotron2_256 on the SIWIS (single-speaker) dataset. Fine-tuning a multi-speaker on a single-speaker dataset tends to improve the stability, and to produce a voice with more intonation, compared to simply training the single-speaker model.

Usage and demonstration

Demonstration

A Google Colab demo is available at this link !

You can also find some audio generated in example_outputs/, or directly in the Colab notebook ;)

Installation and usage

  1. Clone this repository : git clone https://github.com/yui-mhcp/text_to_speech.git
  2. Go to the root of this repository : cd text_to_speech
  3. Install requirements : pip install -r requirements.txt
  4. Open text_to_speech notebook and follow the instruction !

You may have to install ffmpeg for audio loading / saving.

TO-DO list :

  • Make the TO-DO list
  • Comment the code
  • Add pretrained weights for French
  • Make a Google Colab demonstration
  • Implement WaveGlow in tensorflow 2.x
  • Add batch_size support for vocoder inference
  • Add pretrained SV2TTS weights
  • Add a similarity loss to test a new training procedure for single-speaker fine-tuning
  • Add document parsing to perform TTS on document (in progress)
  • Add new languages support
  • Add new TTS architectures / models
  • Train a SV2TTS model based on an encoder trained with the GE2E loss
  • Experimental add support for long text inference
  • Add support for streaming inference

Multi-speaker Text-To-Speech

There are multiple ways to enable multi-speaker speech synthesis :

  1. Use a speaker ID that is embedded by a learnable Embedding layer. The speaker embedding is then learned during training.
  2. Use a Speaker Encoder (SE) to embed audio from the reference speaker. This is often referred as zero-shot voice cloning, as it only requires a sample from the speaker (without training).
  3. Recently, a new prompt-based strategy has been proposed to control the speech with prompts.

Automatic voice cloning with the SV2TTS architecture

Note : in the next paragraphs, encoder refers to the Tacotron Encoder part, while SE refers to a speaker encoder model (detailed below).

The basic intuition

The Speaker Encoder-based Text-To-Speech is inspired from the "From Speaker Verification To Text-To-Speech (SV2TTS)" paper. The authors have proposed an extension of the Tacotron-2 architecture to include information about the speaker's voice.

Here is a short overview of the proposed procedure :

  1. Train a model to identify speakers based on short audio samples : the speaker verification model. This model basically takes as input an audio sample (5-10 sec) from a speaker, and encodes it on a d-dimensional vector, named the embedding. This embedding aims to capture relevant information about the speaker's voice (e.g., frequencies, rythm, pitch, ...).
  2. This pre-trained Speaker Encoder (SE) is then used to encode the voice of the speaker to clone.
  3. The produced embedding is then concatenated with the output of the Tacotron-2 encoder part, such that the Decoder has access to both the encoded text and the speaker embedding.

The objective is that the Decoder will learn to use the speaker embedding to copy its prosody / intonation / ... to read the text with the voice of this speaker.

Limitations and solutions

There are some limitations with the above approach :

  • A perfect generalization to new speakers is really difficult, as it would require large datasets with many speakers.
  • The audio should not have any noise / artifacts to avoid noisy synthetic audios.
  • The Speaker Encoder has to correctly separate speakers, and encode their voice in a meaningful way for the synthesizer.

To tackle these limitations, the proposed solution is to perform a 2-step training :

  • First train a low-quality multi-speakers model on the CommonVoice database. This is one of the largest multilingual database for audio, at the cost of noisy / variable quality audios. This is therefore not suitable to train good quality models, whereas pre-processing still helps to obtain intelligible audios.
  • Once a multi-speaker model is trained, a single-speaker database with few good quality data can be used to fine-tune the model on a single speaker. This allows the model to learn faster, with only limited amount of good quality data, and to produce really good quality audios !

The Speaker Encoder (SE)

The SE part should be able to differentiate speakers, and embed (encode in a 1-D vector) them in a meaningful way.

The model used in the paper is a 3-layer LSTM model with a normalization layer trained with the GE2E loss. The major limitation is that training this model is really slow, and took 2 weeks on 4 GPU's in the CorentinJ master thesis (cf his github)

This project proposes a simpler architecture based on Convolutional Neural Networks (CNN), which is much faster to train compared to LSTM networks. Furthermore, the euclidian distance has been used rather than the cosine metric, which has shown faster convergence. Additionally, a custom cache-based generator is proposed to speed up audio processing. These modifications allowed to train a 99% accuracy model within 2-3 hours on a single RTX 3090 GPU !

The partial Transfer Learning procedure

In order to avoid training a SV2TTS model from scratch, which would be completely impossible on a single GPU, a new partial transfer learning procedure is proposed.

This procedure takes a pre-trained model with a slightly different architecture, and transfer all the common weights (like in regular transfer learning). For the layers with different weights shape, only the common part is transfered, while the remaining weights are initialized to zeros. This result in a new model with different weights to mimic the behavior of the original model.

In the SV2TTS architecture, the speaker embedding is passed to the recurrent layer of the Tacotron2 decoder. This results in a different input shape, making the layer weights matrix different. The partial transfer learning allows to nitialize the model such that it replicates the behavior of the original single-speaker Tacotron2 model !

Contacts and licence

Contacts :

  • Mail : yui-mhcp@tutanota.com
  • Discord : yui0732

Terms of use

The goal of these projects is to support and advance education and research in Deep Learning technology. To facilitate this, all associated code is made available under the GNU Affero General Public License (AGPL) v3, supplemented by a clause that prohibits commercial use (cf the LICENCE file).

These projects are released as "free software", allowing you to freely use, modify, deploy, and share the software, provided you adhere to the terms of the license. While the software is freely available, it is not public domain and retains copyright protection. The license conditions are designed to ensure that every user can utilize and modify any version of the code for their own educational and research projects.

If you wish to use this project in a proprietary commercial endeavor, you must obtain a separate license. For further details on this process, please contact me directly.

For my protection, it is important to note that all projects are available on an "As Is" basis, without any warranties or conditions of any kind, either explicit or implied. However, do not hesitate to report issues on the repository's project, or make a Pull Request to solve it πŸ˜„

Citation

If you find this project useful in your work, please add this citation to give it more visibility ! πŸ˜‹

@misc{yui-mhcp
    author  = {yui},
    title   = {A Deep Learning projects centralization},
    year    = {2021},
    publisher   = {GitHub},
    howpublished    = {\url{https://github.com/yui-mhcp}}
}

Notes and references

The code for this project is a mixture of multiple GitHub projects, to have a fully modulable Tacotron-2 implementation

Papers :

Releases

No releases published

Packages

No packages published

Languages