Skip to content

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

License

Notifications You must be signed in to change notification settings

daniilrobnikov/vits

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Jaehyeon Kim, Jungil Kong, and Juhee Son

In our recent paper, we propose VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.

Visit our demo for audio samples.

We also provide the pretrained models.

** Update note: Thanks to Rishikesh (ऋषिकेश), our interactive TTS demo is now available on Colab Notebook.

VITS at training VITS at inference
VITS at training VITS at inference

Installation:

Clone the repo

git clone git@github.com:daniilrobnikov/vits.git
cd vits

Setting up the conda env

This is assuming you have navigated to the vits root after cloning it.

NOTE: This is tested under python3.11 with conda env. For other python versions, you might encounter version conflicts. NOTE: This is tested under python3.11 with conda env. For other python versions, you might encounter version conflicts.

PyTorch 2.0 Please refer requirements.txt Please refer requirements.txt

# install required packages (for pytorch 2.0)
conda create -n vits python=3.11
conda activate vits
pip install -r requirements.txt

Download datasets

There are three options you can choose from: LJ Speech, VCTK, and custom dataset.

  1. LJ Speech: LJ Speech dataset. Used for single speaker TTS.
  2. VCTK: VCTK dataset. Used for multi-speaker TTS.
  3. Custom dataset: You can use your own dataset. Please refer here.

LJ Speech dataset

  1. download and extract the LJ Speech dataset
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar -xvf LJSpeech-1.1.tar.bz2
  1. rename or create a link to the dataset folder
ln -s /path/to/LJSpeech-1.1/wavs DUMMY1

VCTK dataset

  1. download and extract the VCTK dataset
wget https://datashare.is.ed.ac.uk/bitstream/handle/10283/3443/VCTK-Corpus-0.92.zip
unzip VCTK-Corpus-0.92.zip
  1. (optional): downsample the audio files to 22050 Hz. See audio_resample.ipynb
  2. rename or create a link to the dataset folder
ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY2

Custom dataset

  1. create a folder with wav files
  2. create configuration file in configs. Change the following fields in custom_base.json:
  3. create configuration file in configs. Change the following fields in custom_base.json:
{
  "data": {
    "training_files": "filelists/custom_audio_text_train_filelist.txt.cleaned", // path to training cleaned filelist
    "validation_files": "filelists/custom_audio_text_val_filelist.txt.cleaned", // path to validation cleaned filelist
    "text_cleaners": ["english_cleaners2"], // text cleaner
    "bits_per_sample": 16, // bit depth of wav files
    "training_files": "filelists/custom_audio_text_train_filelist.txt.cleaned", // path to training cleaned filelist
    "validation_files": "filelists/custom_audio_text_val_filelist.txt.cleaned", // path to validation cleaned filelist
    "text_cleaners": ["english_cleaners2"], // text cleaner
    "bits_per_sample": 16, // bit depth of wav files
    "sampling_rate": 22050, // sampling rate if you resampled your wav files
    ...
    "n_speakers": 0, // number of speakers in your dataset if you use multi-speaker setting
    "cleaned_text": true // if you already cleaned your text (See text_phonemizer.ipynb), set this to true
  },
  ...
    "cleaned_text": true // if you already cleaned your text (See text_phonemizer.ipynb), set this to true
  },
  ...
}
  1. install espeak-ng (optional)

NOTE: This is required for the preprocess.py and inference.ipynb notebook to work. If you don't need it, you can skip this step. Please refer espeak-ng

  1. preprocess text

You can do this step by step way:

ln -s /path/to/custom_dataset DUMMY3

Training Examples

# LJ Speech
python train.py -c configs/ljs_base.json -m ljs_base

# VCTK
python train_ms.py -c configs/vctk_base.json -m vctk_base

# Custom dataset (multi-speaker)
python train_ms.py -c configs/custom_base.json -m custom_base

# Custom dataset (multi-speaker)
python train_ms.py -c configs/custom_base.json -m custom_base

Inference Example

See inference.ipynb See inference_batch.ipynb for multiple sentences inference

Pretrained Models

We also provide the pretrained models

Audio Samples

Todo

  • text preprocessing
    • update cleaners for multi-language support with 100+ languages
    • update vocabulary to support all symbols and features from IPA. See phonemes.md
    • handling unknown, out of vocabulary symbols. Please refer vocab.py and vocab - TorchText
    • remove cleaners from text preprocessing. Most cleaners are already implemented in phonemizer. See cleaners.py
    • remove necessity for speakers indexation. See vits/issues/58
  • audio preprocessing
  • filelists preprocessing
  • other
    • rewrite code for python 3.11
    • replace Cython Monotonic Alignment Search with numba.jit. See vits-finetuning
    • updated inference to support batch processing
  • pretrained models
    • training the model for Bengali language. (For now: 55_000 iterations, ~26 epochs)
    • add pretrained models for multiple languages
  • future work
    • update model to naturalspeech. Please refer naturalspeech
    • add support for streaming. Please refer vits_chinese
    • update naturalspeech to multi-speaker
    • replace speakers with multi-speaker embeddings
    • replace speakers with multilingual training. Each speaker is a language with thhe same IPA symbols
    • add support for in-context learning

Acknowledgements

  • This repo is based on VITS
  • Text to phones converter for multiple languages is based on phonemizer
  • We also thank GhatGPT for providing writing assistance.

References

vits

About

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published