Skip to content

Aasthaengg/Text2SpeechSynthesis-IndicLanguages

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text to Speech Synthesis for Indic Languages using Tacotron2

Align

About

This repository contains steps to train NVIDIA/tacotron2 on a multi speaker Hindi Language dataset.

Demo

You can play the demo here.

Index

  1. Pre-requisites
  2. Setup
  3. Dataset
  4. Data Preprocessing
  5. Training
  6. Inference
  7. Related repos
  8. Acknowledgements

1. Pre-requisites

  • NVIDIA GPU
  • NVIDIA CUDA installation. More on it here.

2. Setup

2.1 Torch from binary

Clone the repository

git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
# if you are updating an existing checkout
git submodule sync
git submodule update --init --recursive --jobs 0

Build and install

export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
python setup.py develop

2.2 Apex

Apex is used for mixed precision and distribued training.

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

2.3 Other Python requirements

Other pythonic dependencies are listed in requirements.txt

pip3 install -r requirements.txt

3. Dataset

This version of Tacotron2 is trained with two different datasets. You can choose either of them.

3.1 About OpenSLR Dataset

  • This dataset is a high-quality Hindi multi-speaker speech dataset from OpenSLR
  • The Hindi speech dataset is split into train and test sets with 95.05 hours and 5.55 hours of audio respectively
  • There are 4506 and 386 unique sentences taken from Hindi stories in the train and test sets, respectively, with no overlap of sentences. The train set contains utterances from a set of 59 speakers, and the test set contains speakers from a disjoint set of 19 speakers
  • The audio files are sampled at 8kHz, 16-bit encoding. The total vocabulary size of the train and test set is 6542

3.2 About IIIT-Hyderabad Dataset

  • This dataset consists of single-speaker samples from IIIT-Hyd
  • There are 9368 samples available for training
  • The data has a sampling rate of 48kHz

3.3 Download and Extract OpenSLR Dataset

# Download train dataset
wget https://www.openslr.org/resources/103/Hindi_train.tar.gz

# Extract train dataset
tar xvf Hindi_train.tar.gz

# Download test dataset
wget https://www.openslr.org/resources/103/Hindi_test.tar.gz

# Extract test dataset
tar xvf Hindi_test.tar.gz

# Copy the data to dataset folder
mkdir HindiDataset
mv train HindiDataset/
mv test HindiDataset/

3.4 Download and Extract IIT-Hyd Dataset

Request for the dataset here.

# Copy the data to dataset folder
mkdir HindiDataset
mv Dataset HindiDataset/train_raw

4. Data Preprocessing

4.1 OpenSLR

OpenSLR data consists of transcription.txt file. This needs to be converted into the format compatible with tacotron2 training.

4.1.1 Upsample to 22050 Hz

python3 upsampler.py HindiDataset/train/audio/
python3 upsampler.py HindiDataset/test/audio/

4.1.2 Creating train and test text files

Run filelist_creator.py to create text files for training

python3 filelist_creator.py HindiDataset/

4.1.3 Update hparams.py

Open this file and change training_files and validation_files accordingly

training_files='filelists/openslr_hindi_train.txt',
validation_files='filelists/openslr_hindi_test.txt',

4.2 IIIT-Hyd

This dataset provides an annotations.csv file. This needs to be converted into the format compatible with tacotron2 training.

4.2.1 Downsample the data to 22050 Hz

mkdir -p HindiDataset/train
python3 format_changer.py HindiDataset/train_raw/ HindiDataset/train/

4.2.2 Create train and test text files

cp annotations.csv filelists/iiit-hyd_hindi_train.txt
cp annotations.csv filelists/iiit-hyd_hindi_test.txt

4.2.3 Update hparams.py

Open this file and change training_files and validation_files accordingly

training_files='filelists/iiit-hyd_hindi_train.txt',
validation_files='filelists/iiit-hyd_hindi_test.txt',

5. Training

5.1 Training using a pre-trained model (recommended)

Training using a pre-trained model can lead to faster convergence By default, the dataset dependent text embedding layers are ignored

Download NVIDIA pre-trained Tacotron 2 model

python3 train.py --output_directory=outdir --log_directory=logdir -c tacotron2_statedict.pt --warm_start

5.2 Training from scratch

You can train a model from scratch

python3 train.py --output_directory=outdir --log_directory=logdir

5.3 Multi-GPU (distributed) and Automatic Mixed Precision Training

python3 -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True

5.4 View Tensorboard

Model accuracy and alignment can be easily monitored using tensorboard

tensorboard --logdir=outdir/logdir

6. Inference demo

6.1 Download pre-trained models

  1. Download waveglow from here
  2. Download pre-trained openslr hindi from here
  3. Download pre-trained iiit-hyd hindi from here

N.b. When performing Mel-Spectrogram to Audio synthesis, make sure Tacotron 2 and the Mel decoder were trained on the same mel-spectrogram representation.

6.2 Run jupyter notebook

jupyter notebook --ip=127.0.0.1 --port=31337

Run inference.ipynb.

6.3 Multiple Vocoders

The vocoders supported in this repository are WaveGlow, MelGAN and HiFiGAN.

First install NeMo

git clone https://github.com/NVIDIA/NeMo.git
cd NeMo
./reinstall.sh

Use the inference.ipynb notebook to use different vocoders.


7. Related repos

NVIDIA/tacotron2 Original work this repository is inspired from.

WaveGlow Faster than real time Flow-based Generative Network for Speech Synthesis

nv-wavenet Faster than real time WaveNet.


8. Acknowledgements

This implementation uses code from the following repos: Keith Ito, Prem Seetharaman as described in our code.

We are inspired by Ryuchi Yamamoto's Tacotron PyTorch implementation.

We are thankful to the Tacotron 2 paper authors, specially Jonathan Shen, Yuxuan Wang and Zongheng Yang.

About

Tacotron2+waveglow

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published