A PyTorch implementation of Towards Achieving Robust Universal Neural Vocoding. Audio samples can be found here. Colab demo can be found here. Accompanying Tacotron implementation can be found here
Ensure you have Python 3.6 and PyTorch 1.7 or greater installed. Then install the package with:
pip install univoc
import torch
import soundfile as sf
from univoc import Vocoder
# download pretrained weights (and optionally move to GPU)
vocoder = Vocoder.from_pretrained(
"https://github.com/bshall/UniversalVocoding/releases/download/v0.2/univoc-ljspeech-7mtpaq.pt"
).cuda()
# load log-Mel spectrogram from file or from tts (see https://github.com/bshall/Tacotron for example)
mel = ...
# generate waveform
with torch.no_grad():
wav, sr = vocoder.generate(mel)
# save output
sf.write("path/to/save.wav", wav, sr)
- Clone the repo:
git clone https://github.com/bshall/UniversalVocoding
cd ./UniversalVocoding
- Install requirements:
pip install -r requirements.txt
- Download and extract the LJ-Speech dataset:
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar -xvjf LJSpeech-1.1.tar.bz2
- Download the train split here and extract it in the root directory of the repo.
- Extract Mel spectrograms and preprocess audio:
python preprocess.py in_dir=path/to/LJSpeech-1.1 out_dir=datasets/LJSpeech-1.1
- Train the model:
python train.py checkpoint_dir=ljspeech dataset_dir=datasets/LJSpeech-1.1
Pretrained weights for the 10-bit LJ-Speech model are available here.
- Trained on 16kHz audio from a single speaker. For an older version trained on 102 different speakers form the ZeroSpeech 2019: TTS without T English dataset click here.
- Uses an embedding layer instead of one-hot encoding.