Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All contributors #1

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 24 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,11 @@ If you are new, you can also find [here](http://www.erogol.com/text-speech-deep-
- Speaker Encoder to compute speaker embeddings efficiently.
- Vocoder models (MelGAN, Multiband-MelGAN, GAN-TTS)
- Support for multi-speaker TTS training.
- Support for Multi-GPUs training.
- Ability to convert Torch models to Tensorflow 2.0 for inference.
- Released trained models.
- Efficient training codes for PyTorch. (soon for Tensorflow 2.0)
- Codes to convert Torch models to Tensorflow 2.0.
- Detailed training anlaysis on console and Tensorboard.
- Released pre-trained models.
- Fast and efficient model training.
- Detailed training logs on console and Tensorboard.
- Tools to curate Text2Speech datasets under```dataset_analysis```.
- Demo server for model testing.
- Notebooks for extensive model benchmarking.
Expand All @@ -50,6 +50,22 @@ Or you can use ```requirements.txt``` to install the requirements only.

```pip install -r requirements.txt```

### Directory Structure
```
|- TTS/
| |- train.py (train your TTS model.)
| |- distribute.py (train your TTS model using Multiple GPUs)
| |- config.json (TTS model configuration file)
| |- tf (Tensorflow 2 utilities and model implementations)
| |- layers/ (model layer definitions)
| |- models/ (model definitions)
| |- notebooks/ (Jupyter Notebooks for model evaluation and parameter selection)
| |- data_analysis/ (TTS Dataset analysis tools and notebooks.)
| |- utils/ (TTS utilities -io, visualization, data processing etc.-)
| |- speaker_encoder/ (Speaker Encoder implementation with the same folder structure.)
| |- vocoder/ (Vocoder implementations with the same folder structure.)
```

### Docker
A barebone `Dockerfile` exists at the root of the project, which should let you quickly setup the environment. By default, it will start the server and let you query it. Make sure to use `nvidia-docker` to use your GPUs. Make sure you follow the instructions in the [`server README`](server/README.md) before you build your image so that the server can find the model within the image.

Expand Down Expand Up @@ -87,7 +103,7 @@ Audio length is approximately 6 secs.


## Datasets and Data-Loading
TTS provides a generic dataloder easy to use for new datasets. You need to write an preprocessor function to integrate your own dataset.Check ```datasets/preprocess.py``` to see some examples. After the function, you need to set ```dataset``` field in ```config.json```. Do not forget other data related fields too.
TTS provides a generic dataloader easy to use for new datasets. You need to write an preprocessor function to integrate your own dataset.Check ```datasets/preprocess.py``` to see some examples. After the function, you need to set ```dataset``` field in ```config.json```. Do not forget other data related fields too.

Some of the open-sourced datasets that we successfully applied TTS, are linked below.

Expand Down Expand Up @@ -150,15 +166,8 @@ If you like to use TTS to try a new idea and like to share your experiments with

## [Contact/Getting Help](https://github.com/mozilla/TTS/wiki/Contact-and-Getting-Help)

## Major TODOs
- [x] Implement the model.
- [x] Generate human-like speech on LJSpeech dataset.
- [x] Generate human-like speech on a different dataset (Nancy) (TWEB).
- [x] Train TTS with r=1 successfully.
- [x] Enable process based distributed training. Similar to (https://github.com/fastai/imagenet-fast/).
- [x] Adapting Neural Vocoder. TTS works with WaveRNN and ParallelWaveGAN (https://github.com/erogol/WaveRNN and https://github.com/erogol/ParallelWaveGAN)
- [ ] Multi-speaker embedding.
- [ ] Model optimization (model export, model pruning etc.)
## Contributors


<!--## References
- [Efficient Neural Audio Synthesis](https://arxiv.org/pdf/1802.08435.pdf)
Expand All @@ -169,6 +178,7 @@ If you like to use TTS to try a new idea and like to share your experiments with
- [WaveRNN](https://arxiv.org/pdf/1802.08435.pdf)
- [Faster WaveNet](https://arxiv.org/abs/1611.09482)
- [Parallel WaveNet](https://arxiv.org/abs/1711.10433)

-->

### References
Expand Down
2 changes: 1 addition & 1 deletion vocoder/layers/melgan.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ def __init__(self, channels, num_res_blocks, kernel_size):
nn.Conv1d(channels,
channels,
kernel_size=kernel_size,
dilation=layer_padding,
dilation=layer_dilation,
bias=True)),
nn.LeakyReLU(0.2),
weight_norm(
Expand Down
2 changes: 0 additions & 2 deletions vocoder/layers/pqmf.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
"""Pseudo QMF modules."""

import numpy as np
import torch
import torch.nn.functional as F
Expand Down
14 changes: 7 additions & 7 deletions vocoder/models/melgan_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,16 +77,16 @@ def __init__(self,
]
self.layers = nn.Sequential(*layers)

def forward(self, cond_features):
return self.layers(cond_features)
def forward(self, c):
return self.layers(c)

def inference(self, cond_features):
cond_features = cond_features.to(self.layers[1].weight.device)
cond_features = torch.nn.functional.pad(
cond_features,
def inference(self, c):
c = c.to(self.layers[1].weight.device)
c = torch.nn.functional.pad(
c,
(self.inference_padding, self.inference_padding),
'replicate')
return self.layers(cond_features)
return self.layers(c)

def remove_weight_norm(self):
for _, layer in enumerate(self.layers):
Expand Down
2 changes: 1 addition & 1 deletion vocoder/tests/test_melgan_discriminator.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ def test_melgan_multi_scale_discriminator():
scores, feats = model(dummy_input)
assert len(scores) == 3
assert len(scores) == len(feats)
assert np.all(scores[0].shape == (4, 1, 16))
assert np.all(scores[0].shape == (4, 1, 64))
assert np.all(feats[0][0].shape == (4, 16, 4096))
assert np.all(feats[0][1].shape == (4, 64, 1024))
assert np.all(feats[0][2].shape == (4, 256, 256))
112 changes: 112 additions & 0 deletions vocoder/tf/convert_melgan_torch_to_tf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
import argparse
import os

import numpy as np
import tensorflow as tf
import torch
from fuzzywuzzy import fuzz

from TTS.utils.io import load_config
from TTS.vocoder.tf.utils.convert_torch_to_tf_utils import (
compare_torch_tf, convert_tf_name, transfer_weights_torch_to_tf)
from TTS.vocoder.tf.utils.generic_utils import \
setup_generator as setup_tf_generator
from TTS.vocoder.tf.utils.io import save_checkpoint
from TTS.vocoder.utils.generic_utils import setup_generator

# prevent GPU use
os.environ['CUDA_VISIBLE_DEVICES'] = ''

# define args
parser = argparse.ArgumentParser()
parser.add_argument('--torch_model_path',
type=str,
help='Path to target torch model to be converted to TF.')
parser.add_argument('--config_path',
type=str,
help='Path to config file of torch model.')
parser.add_argument(
'--output_path',
type=str,
help='path to output file including file name to save TF model.')
args = parser.parse_args()

# load model config
config_path = args.config_path
c = load_config(config_path)
num_speakers = 0

# init torch model
model = setup_generator(c)
checkpoint = torch.load(args.torch_model_path,
map_location=torch.device('cpu'))
state_dict = checkpoint['model']
model.load_state_dict(state_dict)
model.remove_weight_norm()
state_dict = model.state_dict()

# init tf model
model_tf = setup_tf_generator(c)

common_sufix = '/.ATTRIBUTES/VARIABLE_VALUE'
# get tf_model graph by passing an input
# B x D x T
dummy_input = tf.random.uniform((7, 80, 64), dtype=tf.float32)
mel_pred = model_tf(dummy_input, training=False)

# get tf variables
tf_vars = model_tf.weights

# match variable names with fuzzy logic
torch_var_names = list(state_dict.keys())
tf_var_names = [we.name for we in model_tf.weights]
var_map = []
for tf_name in tf_var_names:
# skip re-mapped layer names
if tf_name in [name[0] for name in var_map]:
continue
tf_name_edited = convert_tf_name(tf_name)
ratios = [
fuzz.ratio(torch_name, tf_name_edited)
for torch_name in torch_var_names
]
max_idx = np.argmax(ratios)
matching_name = torch_var_names[max_idx]
del torch_var_names[max_idx]
var_map.append((tf_name, matching_name))

# pass weights
tf_vars = transfer_weights_torch_to_tf(tf_vars, dict(var_map), state_dict)

# Compare TF and TORCH models
# check embedding outputs
model.eval()
dummy_input_torch = torch.ones((1, 80, 10))
dummy_input_tf = tf.convert_to_tensor(dummy_input_torch.numpy())
dummy_input_tf = tf.transpose(dummy_input_tf, perm=[0, 2, 1])
dummy_input_tf = tf.expand_dims(dummy_input_tf, 2)

out_torch = model.layers[0](dummy_input_torch)
out_tf = model_tf.model_layers[0](dummy_input_tf)
out_tf_ = tf.transpose(out_tf, perm=[0, 3, 2, 1])[:, :, 0, :]

assert compare_torch_tf(out_torch, out_tf_) < 1e-5

for i in range(1, len(model.layers)):
print(f"{i} -> {model.layers[i]} vs {model_tf.model_layers[i]}")
out_torch = model.layers[i](out_torch)
out_tf = model_tf.model_layers[i](out_tf)
out_tf_ = tf.transpose(out_tf, perm=[0, 3, 2, 1])[:, :, 0, :]
diff = compare_torch_tf(out_torch, out_tf_)
assert diff < 1e-5, diff

dummy_input_torch = torch.ones((1, 80, 10))
dummy_input_tf = tf.convert_to_tensor(dummy_input_torch.numpy())
output_torch = model.inference(dummy_input_torch)
output_tf = model_tf(dummy_input_tf, training=False)
assert compare_torch_tf(output_torch, output_tf) < 1e-5, compare_torch_tf(
output_torch, output_tf)
# save tf model
save_checkpoint(model_tf, checkpoint['step'], checkpoint['epoch'],
args.output_path)
print(' > Model conversion is successfully completed :).')
58 changes: 58 additions & 0 deletions vocoder/tf/layers/melgan.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
import tensorflow as tf


class ReflectionPad1d(tf.keras.layers.Layer):
def __init__(self, padding):
super(ReflectionPad1d, self).__init__()
self.padding = padding

def call(self, x):
print(x.shape)
return tf.pad(x, [[0, 0], [self.padding, self.padding], [0, 0], [0, 0]], "REFLECT")


class ResidualStack(tf.keras.layers.Layer):
def __init__(self, channels, num_res_blocks, kernel_size, name):
super(ResidualStack, self).__init__(name=name)

assert (kernel_size - 1) % 2 == 0, " [!] kernel_size has to be odd."
base_padding = (kernel_size - 1) // 2

self.blocks = []
num_layers = 2
for idx in range(num_res_blocks):
layer_kernel_size = kernel_size
layer_dilation = layer_kernel_size**idx
layer_padding = base_padding * layer_dilation
block = [
tf.keras.layers.LeakyReLU(0.2),
ReflectionPad1d(layer_padding),
tf.keras.layers.Conv2D(filters=channels,
kernel_size=(kernel_size, 1),
dilation_rate=(layer_dilation, 1),
use_bias=True,
padding='valid',
name=f'blocks.{idx}.{num_layers}'),
tf.keras.layers.LeakyReLU(0.2),
tf.keras.layers.Conv2D(filters=channels,
kernel_size=(1, 1),
use_bias=True,
name=f'blocks.{idx}.{num_layers + 2}')
]
self.blocks.append(block)
self.shortcuts = [
tf.keras.layers.Conv2D(channels,
kernel_size=1,
use_bias=True,
name=f'shortcuts.{i}')
for i in range(num_res_blocks)
]

def call(self, x):
# breakpoint()
for block, shortcut in zip(self.blocks, self.shortcuts):
res = shortcut(x)
for layer in block:
x = layer(x)
x += res
return x
66 changes: 66 additions & 0 deletions vocoder/tf/layers/pqmf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
import numpy as np
import tensorflow as tf

from scipy import signal as sig


class PQMF(tf.keras.layers.Layer):
def __init__(self, N=4, taps=62, cutoff=0.15, beta=9.0):
super(PQMF, self).__init__()
# define filter coefficient
self.N = N
self.taps = taps
self.cutoff = cutoff
self.beta = beta

QMF = sig.firwin(taps + 1, cutoff, window=('kaiser', beta))
H = np.zeros((N, len(QMF)))
G = np.zeros((N, len(QMF)))
for k in range(N):
constant_factor = (2 * k + 1) * (np.pi /
(2 * N)) * (np.arange(taps + 1) -
((taps - 1) / 2))
phase = (-1)**k * np.pi / 4
H[k] = 2 * QMF * np.cos(constant_factor + phase)

G[k] = 2 * QMF * np.cos(constant_factor - phase)

# [N, 1, taps + 1] == [filter_width, in_channels, out_channels]
self.H = np.transpose(H[:, None, :], (2, 1, 0)).astype('float32')
self.G = np.transpose(G[None, :, :], (2, 1, 0)).astype('float32')

# filter for downsampling & upsampling
updown_filter = np.zeros((N, N, N), dtype=np.float32)
for k in range(N):
updown_filter[0, k, k] = 1.0
self.updown_filter = updown_filter.astype(np.float32)

def analysis(self, x):
"""
x : B x 1 x T
"""
x = tf.transpose(x, perm=[0, 2, 1])
x = tf.pad(x, [[0, 0], [self.taps // 2, self.taps // 2], [0, 0]], constant_values=0.0)
x = tf.nn.conv1d(x, self.H, stride=1, padding='VALID')
x = tf.nn.conv1d(x,
self.updown_filter,
stride=self.N,
padding='VALID')
x = tf.transpose(x, perm=[0, 2, 1])
return x

def synthesis(self, x):
"""
x : B x 1 x T
"""
x = tf.transpose(x, perm=[0, 2, 1])
x = tf.nn.conv1d_transpose(
x,
self.updown_filter * self.N,
strides=self.N,
output_shape=(tf.shape(x)[0], tf.shape(x)[1] * self.N,
self.N))
x = tf.pad(x, [[0, 0], [self.taps // 2, self.taps // 2], [0, 0]], constant_values=0.0)
x = tf.nn.conv1d(x, self.G, stride=1, padding="VALID")
x = tf.transpose(x, perm=[0, 2, 1])
return x
Loading