In our recent paper, we propose VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.
Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.
Visit our demo for audio samples.
We also provide the pretrained models.
** Update note: Thanks to Rishikesh (ऋषिकेश), our interactive TTS demo is now available on Colab Notebook.
VITS at training | VITS at inference |
---|---|
Clone the repo
git clone git@github.com:daniilrobnikov/vits.git
cd vits
This is assuming you have navigated to the vits
root after cloning it.
NOTE: This is tested under python3.11
with conda env. For other python versions, you might encounter version conflicts.
NOTE: This is tested under python3.11
with conda env. For other python versions, you might encounter version conflicts.
PyTorch 2.0 Please refer requirements.txt Please refer requirements.txt
# install required packages (for pytorch 2.0)
conda create -n vits python=3.11
conda activate vits
pip install -r requirements.txt
There are three options you can choose from: LJ Speech, VCTK, and custom dataset.
- LJ Speech: LJ Speech dataset. Used for single speaker TTS.
- VCTK: VCTK dataset. Used for multi-speaker TTS.
- Custom dataset: You can use your own dataset. Please refer here.
- download and extract the LJ Speech dataset
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar -xvf LJSpeech-1.1.tar.bz2
- rename or create a link to the dataset folder
ln -s /path/to/LJSpeech-1.1/wavs DUMMY1
- download and extract the VCTK dataset
wget https://datashare.is.ed.ac.uk/bitstream/handle/10283/3443/VCTK-Corpus-0.92.zip
unzip VCTK-Corpus-0.92.zip
- (optional): downsample the audio files to 22050 Hz. See audio_resample.ipynb
- rename or create a link to the dataset folder
ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY2
- create a folder with wav files
- create configuration file in configs. Change the following fields in
custom_base.json
: - create configuration file in configs. Change the following fields in
custom_base.json
:
{
"data": {
"training_files": "filelists/custom_audio_text_train_filelist.txt.cleaned", // path to training cleaned filelist
"validation_files": "filelists/custom_audio_text_val_filelist.txt.cleaned", // path to validation cleaned filelist
"text_cleaners": ["english_cleaners2"], // text cleaner
"bits_per_sample": 16, // bit depth of wav files
"training_files": "filelists/custom_audio_text_train_filelist.txt.cleaned", // path to training cleaned filelist
"validation_files": "filelists/custom_audio_text_val_filelist.txt.cleaned", // path to validation cleaned filelist
"text_cleaners": ["english_cleaners2"], // text cleaner
"bits_per_sample": 16, // bit depth of wav files
"sampling_rate": 22050, // sampling rate if you resampled your wav files
...
"n_speakers": 0, // number of speakers in your dataset if you use multi-speaker setting
"cleaned_text": true // if you already cleaned your text (See text_phonemizer.ipynb), set this to true
},
...
"cleaned_text": true // if you already cleaned your text (See text_phonemizer.ipynb), set this to true
},
...
}
- install espeak-ng (optional)
NOTE: This is required for the preprocess.py and inference.ipynb notebook to work. If you don't need it, you can skip this step. Please refer espeak-ng
- preprocess text
You can do this step by step way:
- create a dataset of text files. See text_dataset.ipynb
- phonemize or just clean up the text. Please refer text_phonemizer.ipynb
- create filelists and cleaned version with train test split. See text_split.ipynb
- rename or create a link to the dataset folder. Please refer text_split.ipynb
ln -s /path/to/custom_dataset DUMMY3
# LJ Speech
python train.py -c configs/ljs_base.json -m ljs_base
# VCTK
python train_ms.py -c configs/vctk_base.json -m vctk_base
# Custom dataset (multi-speaker)
python train_ms.py -c configs/custom_base.json -m custom_base
# Custom dataset (multi-speaker)
python train_ms.py -c configs/custom_base.json -m custom_base
See inference.ipynb See inference_batch.ipynb for multiple sentences inference
We also provide the pretrained models
- text preprocessing
- update cleaners for multi-language support with 100+ languages
- update vocabulary to support all symbols and features from IPA. See phonemes.md
- handling unknown, out of vocabulary symbols. Please refer vocab.py and vocab - TorchText
- remove cleaners from text preprocessing. Most cleaners are already implemented in phonemizer. See cleaners.py
- remove necessity for speakers indexation. See vits/issues/58
- audio preprocessing
- batch audio resampling. Please refer audio_resample.ipynb
- code snippets to find corrupted files in dataset. Please refer audio_find_corrupted.ipynb
- code snippets to delete by extension files in dataset. Please refer delete_by_ext.ipynb
- replace scipy and librosa dependencies with torchaudio. See load and MelScale docs
- automatic audio range normalization. Please refer Loading audio data - Torchaudio docs
- add support for stereo audio (multi-channel). See Loading audio data - Torchaudio docs
- add support for various audio bit depths (bits per sample). See load - Torchaudio docs
- add support for various sample rates. Please refer load - Torchaudio docs
- test stereo audio (multi-channel) training
- filelists preprocessing
- add filelists preprocessing for multi-speaker. Please refer text_split.ipynb
- code snippets for train test split. Please refer text_split.ipynb
- notebook to link filelists with actual wavs. Please refer text_split.ipynb
- other
- rewrite code for python 3.11
- replace Cython Monotonic Alignment Search with numba.jit. See vits-finetuning
- updated inference to support batch processing
- pretrained models
- training the model for Bengali language. (For now: 55_000 iterations, ~26 epochs)
- add pretrained models for multiple languages
- future work
- update model to naturalspeech. Please refer naturalspeech
- add support for streaming. Please refer vits_chinese
- update naturalspeech to multi-speaker
- replace speakers with multi-speaker embeddings
- replace speakers with multilingual training. Each speaker is a language with thhe same IPA symbols
- add support for in-context learning
- This repo is based on VITS
- Text to phones converter for multiple languages is based on phonemizer
- We also thank GhatGPT for providing writing assistance.