Unofficial PyTorch implementation of Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech.
Audio samples are available on the project demo page.
I use Identity as a shortcut connection (instead of Linear) in residual blocks and don't use biases, so my implementation has slightly fewer parameters than described in the paper (1.52 vs 1.91).
The cutoff-ratio of the pseudo quadratue mirror filter bank can be set to a specific value or to None. In the latter case, the optimal filter will be automatically synthesized before the start of training.
To start training for, say, 500K iterations, run the command:
train.py -l log -c config/mb_train.yaml -i 500000
To continue training from the last saved checkpoint for another 500K iterations, run the command:
train.py -l log -i 500000
The training results will be posted in the log folder and available for viewing via the tensorboard.
Pretrained multi-band vocoder (config and weights) can be downloaded here. This model was trained for 500K iterations on the LJSpeech dataset.
import sounddevice as sd
import librosa
import yaml
config_path = "models/melgan.yaml"
model_path = "models/melgan.pt"
cfg = yaml.load(open(config_path, "r"), Loader=yaml.FullLoader)
sr = cfg["data"]["sample_rate"]
vocoder = from_config(cfg)
vocoder.G.load_state_dict(torch.load(model_path))
# out-of-distribution sample (female)
x = torch.from_numpy(librosa.load(librosa.example("libri3"), sr=sr)[0])
# wav-to-mel
y = vocoder.encode(x)
with torch.no_grad():
# mel-to-wav
x_hat = vocoder.decode(y)
# play restored wav
sd.play(x_hat, sr, blocking=True)