ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech
This is an unofficial PyTorch implementation of ResGrad as a high-quality denoising model for Text to Speech. In short, this model generates the spectrogram using FastSpeech2 and then removes the noise in the spectrogram using the Diffusion method to synthesize high-quality speeches. As mentioned in the paper the implementation is based on FastSpeech2 and Grad-TTS. Also, the HiFiGAN model is used to generate waveforms from synthesized spectrograms.
Data structures:
dataset/data_name/synthesizer_data/
test_data/
speaker1/
sample1.txt
sample1.wav
...
...
train_data/
...
test.txt (sample1|speaker1|*phoneme_sequence \n ...)
train.txt (sample1|speaker1|*phoneme_sequence \n ...)
Preprocessing:
python synthesizer/prepare_align.py config/data_name/config.yaml
python synthesizer/preprocess.py config/data_name/config.yaml
Train synthesizer:
python train_synthesizer.py --config config/data_name/config.yaml
Prepare data for ResGrade:
python resgrad_data.py --synthesizer_restore_step 1000000 --data_file_path dataset/data_name/synthesizer_data/train.txt \
--config config/data_name/config.yaml
Train ResGrade:
python train_resgrad.py --config config/data_name/config.yaml
Inference:
python inference.py --text "phonemes sequence example" \
--synthesizer_restore_step 1000000 --regrad_restore_step 1000000 --vocoder_restore_step 2500000 \
--config config/data_name/config.yaml --result_dir output/data_name/results
- ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech, Z. Chen, et al.
- FastSpeech 2: Fast and High-Quality End-to-End Text to Speech, Y. Ren, et al.
- Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech, V. Popov, et al.