- A Pytorch Implementation of end-to-end Speech Synthesis using Transformer Network.
- This model can be trained almost 3 to 4 times faster than most of the autoregressive models, since Transformers lie under one of the fastest computing autoregressive models.
- We learned the post network using CBHG(Convolutional Bank + Highway network + GRU) model of tacotron and converted the spectrogram into raw wave using griffin-lim algorithm, and in future We want to use pre-trained hifi-gan vocoder for generating raw audio.
Category | Technologies |
---|---|
Programming Languages | |
Frameworks | |
Libraries | |
Deep Learning Models | |
Dataset | |
Tools | |
Visualization & Analysis |
Text-to-Speech/ โ โโโ README.md โโโ Text-to-Speech-Audio-Generation.ipynb โโโ Text-to-Speech-Training-Postnet.ipynb โโโ Text-to-Speech-Training-Transformer.ipynb โโโ hyperparams.py โโโ module.py โโโ network.py โโโ prepare_data.ipynb โโโ prepare_data.py โโโ preprocess.py โโโ requirements.txt โโโ synthesis.py โโโ train_postnet.py โโโ train_transformer.py โโโ utils.py โ โโโ __pycache__/ โ โโโ hyperparams.cpython-311.pyc โ โโโ utils.cpython-311.pyc โ โโโ png/ โ โโโ alphas.png โ โโโ attention.gif โ โโโ attention_encoder.gif โ โโโ attention_decoder.gif โ โโโ model.png โ โโโ test_loss_per_epoch.png โ โโโ training_loss.png โ โโโ training_loss_per_epoch.png โ โโโ text/ โโโ __init__.py โโโ cleaners.py โโโ cmudict.py โโโ numbers.py โโโ symbols.py
- Install python==3.11.10
- Install requirements:
pip install -r requirements.txt
- We used The LJSpeech Dataset (aka LJSpeech-1.1), a speech dataset which consists of pairs of text script and short audio(wavs) clips of a single speaker. The complete dataset (13,100 pairs) can be downloaded either from Kaggle or Keithito .
- This is the raw data which will be prepared further for training.
- You can download the pretrained model checkpoints from Checkpoints (50k for Transformer model / 45k for Postnet)
- You can load the checkpoints for the respective models.
- Attention Plots represent the multihead attention of all layers, num_heads=4 is used for three attention layers.
- Only a few multiheads showed diagonal alignment i.e. Diagonal alignment in attention plots typically suggests that the model is learning to align tokens in a sequence effectively.
- I used Noam-style warmup and decay. This refers to a learning rate schedule commonly used in training deep learning models, particularly in the context of Transformer models(as introduced in in the "Attention is All You Need" paper)
- The image below shows the alphas of scaled positional encoding. The encoder alpha is constant for almost first 15k steps and then increases for the rest of the training. The decoder alpha decreases a bit for first 2k steps then it is almost constant for rest of the training.
- We didn't use the stop token in the implementation, since model didn't train with its usage.
- For Transformer model, it is very important to concatenate the input and context vectors for correctly utilising the Attention mechanism.
Good Morning, Everyone!!
goodmorning.wav
She sells seashells on the seashore.
seashells.wav
Thank you so much Warren for all your support.
tywarren.wav
hyperparams.py
contains all the hyperparams that are required in this Project.prepare_data.py
performs preparing of data which is converting raw audio to mel, linear spectrogram for faster training time. The scripts for preprocessing of text data is in./text/
directory.prepare_data.ipynb
is the notebook to be run for preparing the data for further training.preprocess.py
contains all the methods for loading the dataset.module.py
contains all the methods like Encoder Prenet, Feed Forward Network(FFN), PostConvolutional Network, MultiHeadAttention, Attention, Prenet, CBHG(Convolutional Bank + Highway + Gated), etc.network.py
contains Encoder, MelDecoder, Model and Model Postnet networks.train_transformer.py
contains the script for training the autoregressive attention network. (text --> mel)Text-to-Speech-Training-Transformer.ipynb
is the notebook to be run for training the transformer network.train_postnet.py
contains the script for training the PostConvolutional network. (mel --> linear)Text-to-Speech-Training-Postnet.ipynb
is the notebook to be run for training the PostConvolutional network.synthesis.py
contains the script to generate the audio samples by the trained Text-to-Speech model.Text-to-Speech-Audio-Generation.ipynb
is the notebook to be run for generating audio samples by loading trained model checkpointsutils.py
contains the methods for detailed preprocessing particularly for mel spectrogram and audio waveforms.
- STEP 1. Download and extract LJSpeech-1.1 data at any directory you want.
- STEP 2. Change these two paths in
hyperparams.py
according to your system paths for preparing data locally.# For local use: (prepare_data.ipynb)
data_path_used_for_prepare_data = 'your\path\to\LJSpeech-1.1'
output_path_used_for_prepare_data = 'your\path\to\LJSpeech-1.1' - STEP 3. Run the
prepare_data.ipynb
after correctly assigning the paths. - STEP 4. The prepared data will be stored in the form:
- Prepared data is uploaded to kaggle datasets for direct use.
- STEP 1. For Training Transformer adjust these paths in
hyperparams.py
.
# General: data_path = 'your\path\to\LJSpeech-1.1' checkpoint_path = 'your\path\to\outputdir'
- STEP 2. Run the
Text-to-Speech-Training-Transformer.ipynb
after correctly assigning the paths. - STEP 1. For Training Posnet adjust these paths in
hyperparams.py
.
# General: data_path = 'your\path\to\LJSpeech-1.1' checkpoint_path = 'your\path\to\outputdir'
- STEP 2. Run the
Text-to-Speech-Training-Postnet.ipynb
after correctly assigning the paths.
LJSpeech-1.1/
โ
โโโ README.md
โโโ metadata.csv
โโโ wavs/
โ โโโ LJ001-001.wav
โ โโโ LJ001-001.mag.npy
โ โโโ LJ001-001.pt.npy
โ โโโ LJ001-002.wav
โ โโโ LJ001-002.mag.npy
โ โโโ LJ001-002.pt.npy
โ โโโ ...
- STEP 1. Change the audio sample output path in
hyperparams.py
sample_path = 'your\path\to\outputdir\of\samples'
- STEP 2. Run the
Text-to-Speech-Audio-Generation.ipynb
but make sure to run with correct arguments:--transformer_checkpoint your\path\to\checkpoint_transformer_50000.pth.tar --postnet_checkpoint your\path\to\checkpoint_postnet_45000.pth.tar --max_len 400 --text "Your Text Input"
- We are grateful to CoC VJTI and the Project X programme.
- Special thanks to our mentor Warren Jacinto for perfectly mentoring and supporting us throughout.
- Additionally, we are also thankful for all the Project X mentors for their inputs and advice on our project.