VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Jaehyeon Kim, Jungil Kong, and Juhee Son

In our recent paper, we propose VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.

Visit our demo for audio samples.

We also provide the pretrained models.

** Update note: Thanks to Rishikesh (ऋषिकेश), our interactive TTS demo is now available on Colab Notebook.

VITS at training	VITS at inference

Installation:

Clone the repo

git clone git@github.com:daniilrobnikov/vits.git
cd vits

Setting up the conda env

This is assuming you have navigated to the vits root after cloning it.

NOTE: This is tested under python3.11 with conda env. For other python versions, you might encounter version conflicts. NOTE: This is tested under python3.11 with conda env. For other python versions, you might encounter version conflicts.

PyTorch 2.0 Please refer requirements.txt Please refer requirements.txt

# install required packages (for pytorch 2.0)
conda create -n vits python=3.11
conda activate vits
pip install -r requirements.txt

Download datasets

There are three options you can choose from: LJ Speech, VCTK, and custom dataset.

LJ Speech: LJ Speech dataset. Used for single speaker TTS.
VCTK: VCTK dataset. Used for multi-speaker TTS.
Custom dataset: You can use your own dataset. Please refer here.

LJ Speech dataset

download and extract the LJ Speech dataset

wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar -xvf LJSpeech-1.1.tar.bz2

rename or create a link to the dataset folder

ln -s /path/to/LJSpeech-1.1/wavs DUMMY1

VCTK dataset

download and extract the VCTK dataset

wget https://datashare.is.ed.ac.uk/bitstream/handle/10283/3443/VCTK-Corpus-0.92.zip
unzip VCTK-Corpus-0.92.zip

(optional): downsample the audio files to 22050 Hz. See audio_resample.ipynb
rename or create a link to the dataset folder

ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY2

Custom dataset

create a folder with wav files
create configuration file in configs. Change the following fields in custom_base.json:
create configuration file in configs. Change the following fields in custom_base.json:

{
  "data": {
    "training_files": "filelists/custom_audio_text_train_filelist.txt.cleaned", // path to training cleaned filelist
    "validation_files": "filelists/custom_audio_text_val_filelist.txt.cleaned", // path to validation cleaned filelist
    "text_cleaners": ["english_cleaners2"], // text cleaner
    "bits_per_sample": 16, // bit depth of wav files
    "training_files": "filelists/custom_audio_text_train_filelist.txt.cleaned", // path to training cleaned filelist
    "validation_files": "filelists/custom_audio_text_val_filelist.txt.cleaned", // path to validation cleaned filelist
    "text_cleaners": ["english_cleaners2"], // text cleaner
    "bits_per_sample": 16, // bit depth of wav files
    "sampling_rate": 22050, // sampling rate if you resampled your wav files
    ...
    "n_speakers": 0, // number of speakers in your dataset if you use multi-speaker setting
    "cleaned_text": true // if you already cleaned your text (See text_phonemizer.ipynb), set this to true
  },
  ...
    "cleaned_text": true // if you already cleaned your text (See text_phonemizer.ipynb), set this to true
  },
  ...
}

install espeak-ng (optional)

NOTE: This is required for the preprocess.py and inference.ipynb notebook to work. If you don't need it, you can skip this step. Please refer espeak-ng

preprocess text

You can do this step by step way:

create a dataset of text files. See text_dataset.ipynb
phonemize or just clean up the text. Please refer text_phonemizer.ipynb
create filelists and cleaned version with train test split. See text_split.ipynb
rename or create a link to the dataset folder. Please refer text_split.ipynb

ln -s /path/to/custom_dataset DUMMY3

Training Examples

# LJ Speech
python train.py -c configs/ljs_base.json -m ljs_base

# VCTK
python train_ms.py -c configs/vctk_base.json -m vctk_base

# Custom dataset (multi-speaker)
python train_ms.py -c configs/custom_base.json -m custom_base

# Custom dataset (multi-speaker)
python train_ms.py -c configs/custom_base.json -m custom_base

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
configs		configs
filelists		filelists
pics		pics
preprocess		preprocess
text		text
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
attentions.py		attentions.py
batch_inference.ipynb		batch_inference.ipynb
commons.py		commons.py
data_utils.py		data_utils.py
inference.ipynb		inference.ipynb
inference_batch.ipynb		inference_batch.ipynb
inference_batch.py		inference_batch.py
losses.py		losses.py
mel_processing.py		mel_processing.py
models.py		models.py
modules.py		modules.py
monotonic_align.py		monotonic_align.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
train.py		train.py
train_ms.py		train_ms.py
transforms.py		transforms.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Jaehyeon Kim, Jungil Kong, and Juhee Son

Installation:

Setting up the conda env

Download datasets

LJ Speech dataset

VCTK dataset

Custom dataset

Training Examples

Inference Example

Pretrained Models

Audio Samples

Todo

Acknowledgements

References

vits

About

Releases

Packages

Languages

License

daniilrobnikov/vits

Folders and files

Latest commit

History

Repository files navigation

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Jaehyeon Kim, Jungil Kong, and Juhee Son

Installation:

Setting up the conda env

Download datasets

LJ Speech dataset

VCTK dataset

Custom dataset

Training Examples

Inference Example

Pretrained Models

Audio Samples

Todo

Acknowledgements

References

vits

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages