coqui-ai · erogol · Jun 18, 2020 · Jun 19, 2020 · Jun 19, 2020 · Jun 19, 2020
diff --git a/README.md b/README.md
@@ -22,11 +22,11 @@ If you are new, you can also find [here](http://www.erogol.com/text-speech-deep-
     - Speaker Encoder to compute speaker embeddings efficiently.
     - Vocoder models (MelGAN, Multiband-MelGAN, GAN-TTS)
 - Support for multi-speaker TTS training.
+- Support for Multi-GPUs training.
 - Ability to convert Torch models to Tensorflow 2.0 for inference.
-- Released trained models.
-- Efficient training codes for PyTorch. (soon for Tensorflow 2.0)
-- Codes to convert Torch models to Tensorflow 2.0.
-- Detailed training anlaysis on console and Tensorboard.
+- Released pre-trained models.
+- Fast and efficient model training.
+- Detailed training logs on console and Tensorboard.
 - Tools to curate Text2Speech datasets under```dataset_analysis```.
 - Demo server for model testing.
 - Notebooks for extensive model benchmarking.
@@ -50,6 +50,22 @@ Or you can use ```requirements.txt``` to install the requirements only.
 
 ```pip install -r requirements.txt```
 
+### Directory Structure
+```
+|- TTS/
+|   |- train.py         (train your TTS model.)
+|   |- distribute.py    (train your TTS model using Multiple GPUs)
+|   |- config.json      (TTS model configuration file)
+|   |- tf               (Tensorflow 2 utilities and model implementations)
+|   |- layers/          (model layer definitions)
+|   |- models/          (model definitions)
+|   |- notebooks/       (Jupyter Notebooks for model evaluation and parameter selection)
+|   |- data_analysis/   (TTS Dataset analysis tools and notebooks.)
+|   |- utils/           (TTS utilities -io, visualization, data processing etc.-)
+|   |- speaker_encoder/ (Speaker Encoder implementation with the same folder structure.)
+|   |- vocoder/         (Vocoder implementations with the same folder structure.)
+```
+
 ### Docker
 A barebone `Dockerfile` exists at the root of the project, which should let you quickly setup the environment. By default, it will start the server and let you query it. Make sure to use `nvidia-docker` to use your GPUs. Make sure you follow the instructions in the [`server README`](server/README.md) before you build your image so that the server can find the model within the image.
 
@@ -87,7 +103,7 @@ Audio length is approximately 6 secs.
 
 
 ## Datasets and Data-Loading
-TTS provides a generic dataloder easy to use for new datasets. You need to write an preprocessor function to integrate your own dataset.Check ```datasets/preprocess.py``` to see some examples. After the function, you need to set ```dataset``` field in ```config.json```. Do not forget other data related fields too.
+TTS provides a generic dataloader easy to use for new datasets. You need to write an preprocessor function to integrate your own dataset.Check ```datasets/preprocess.py``` to see some examples. After the function, you need to set ```dataset``` field in ```config.json```. Do not forget other data related fields too.
 
 Some of the open-sourced datasets that we successfully applied TTS, are linked below.
 
@@ -150,15 +166,8 @@ If you like to use TTS to try a new idea and like to share your experiments with
 
 ## [Contact/Getting Help](https://github.com/mozilla/TTS/wiki/Contact-and-Getting-Help)
 
-## Major TODOs
-- [x] Implement the model.
-- [x] Generate human-like speech on LJSpeech dataset.
-- [x] Generate human-like speech on a different dataset (Nancy) (TWEB).
-- [x] Train TTS with r=1 successfully.
-- [x] Enable process based distributed training. Similar to (https://github.com/fastai/imagenet-fast/).
-- [x] Adapting Neural Vocoder. TTS works with WaveRNN and ParallelWaveGAN (https://github.com/erogol/WaveRNN and https://github.com/erogol/ParallelWaveGAN)
-- [ ] Multi-speaker embedding.
-- [ ] Model optimization (model export, model pruning etc.)
+## Contributors
+
 
 <!--## References
 - [Efficient Neural Audio Synthesis](https://arxiv.org/pdf/1802.08435.pdf)
@@ -169,6 +178,7 @@ If you like to use TTS to try a new idea and like to share your experiments with
 - [WaveRNN](https://arxiv.org/pdf/1802.08435.pdf)
 - [Faster WaveNet](https://arxiv.org/abs/1611.09482)
 - [Parallel WaveNet](https://arxiv.org/abs/1711.10433)
+
 -->
 
 ### References

diff --git a/vocoder/layers/melgan.py b/vocoder/layers/melgan.py
@@ -21,7 +21,7 @@ def __init__(self, channels, num_res_blocks, kernel_size):
                     nn.Conv1d(channels,
                               channels,
                               kernel_size=kernel_size,
-                              dilation=layer_padding,
+                              dilation=layer_dilation,
                               bias=True)),
                 nn.LeakyReLU(0.2),
                 weight_norm(

diff --git a/vocoder/layers/pqmf.py b/vocoder/layers/pqmf.py
@@ -1,5 +1,3 @@
-"""Pseudo QMF modules."""
-
 import numpy as np
 import torch
 import torch.nn.functional as F

diff --git a/vocoder/models/melgan_generator.py b/vocoder/models/melgan_generator.py
@@ -77,16 +77,16 @@ def __init__(self,
         ]
         self.layers = nn.Sequential(*layers)
 
-    def forward(self, cond_features):
-        return self.layers(cond_features)
+    def forward(self, c):
+        return self.layers(c)
 
-    def inference(self, cond_features):
-        cond_features = cond_features.to(self.layers[1].weight.device)
-        cond_features = torch.nn.functional.pad(
-            cond_features,
+    def inference(self, c):
+        c = c.to(self.layers[1].weight.device)
+        c = torch.nn.functional.pad(
+            c,
             (self.inference_padding, self.inference_padding),
             'replicate')
-        return self.layers(cond_features)
+        return self.layers(c)
 
     def remove_weight_norm(self):
         for _, layer in enumerate(self.layers):

diff --git a/vocoder/tests/test_melgan_discriminator.py b/vocoder/tests/test_melgan_discriminator.py
@@ -20,7 +20,7 @@ def test_melgan_multi_scale_discriminator():
     scores, feats = model(dummy_input)
     assert len(scores) == 3
     assert len(scores) == len(feats)
-    assert np.all(scores[0].shape == (4, 1, 16))
+    assert np.all(scores[0].shape == (4, 1, 64))
     assert np.all(feats[0][0].shape == (4, 16, 4096))
     assert np.all(feats[0][1].shape == (4, 64, 1024))
     assert np.all(feats[0][2].shape == (4, 256, 256))
diff --git a/vocoder/tf/convert_melgan_torch_to_tf.py b/vocoder/tf/convert_melgan_torch_to_tf.py
@@ -0,0 +1,112 @@
+import argparse
+import os
+
+import numpy as np
+import tensorflow as tf
+import torch
+from fuzzywuzzy import fuzz
+
+from TTS.utils.io import load_config
+from TTS.vocoder.tf.utils.convert_torch_to_tf_utils import (
+    compare_torch_tf, convert_tf_name, transfer_weights_torch_to_tf)
+from TTS.vocoder.tf.utils.generic_utils import \
+    setup_generator as setup_tf_generator
+from TTS.vocoder.tf.utils.io import save_checkpoint
+from TTS.vocoder.utils.generic_utils import setup_generator
+
+# prevent GPU use
+os.environ['CUDA_VISIBLE_DEVICES'] = ''
+
+# define args
+parser = argparse.ArgumentParser()
+parser.add_argument('--torch_model_path',
+                    type=str,
+                    help='Path to target torch model to be converted to TF.')
+parser.add_argument('--config_path',
+                    type=str,
+                    help='Path to config file of torch model.')
+parser.add_argument(
+    '--output_path',
+    type=str,
+    help='path to output file including file name to save TF model.')
+args = parser.parse_args()
+
+# load model config
+config_path = args.config_path
+c = load_config(config_path)
+num_speakers = 0
+
+# init torch model
+model = setup_generator(c)
+checkpoint = torch.load(args.torch_model_path,
+                        map_location=torch.device('cpu'))
+state_dict = checkpoint['model']
+model.load_state_dict(state_dict)
+model.remove_weight_norm()
+state_dict = model.state_dict()
+
+# init tf model
+model_tf = setup_tf_generator(c)
+
+common_sufix = '/.ATTRIBUTES/VARIABLE_VALUE'
+# get tf_model graph by passing an input
+# B x D x T
+dummy_input = tf.random.uniform((7, 80, 64), dtype=tf.float32)
+mel_pred = model_tf(dummy_input, training=False)
+
+# get tf variables
+tf_vars = model_tf.weights
+
+# match variable names with fuzzy logic
+torch_var_names = list(state_dict.keys())
+tf_var_names = [we.name for we in model_tf.weights]
+var_map = []
+for tf_name in tf_var_names:
+    # skip re-mapped layer names
+    if tf_name in [name[0] for name in var_map]:
+        continue
+    tf_name_edited = convert_tf_name(tf_name)
+    ratios = [
+        fuzz.ratio(torch_name, tf_name_edited)
+        for torch_name in torch_var_names
+    ]
+    max_idx = np.argmax(ratios)
+    matching_name = torch_var_names[max_idx]
+    del torch_var_names[max_idx]
+    var_map.append((tf_name, matching_name))
+
+# pass weights
+tf_vars = transfer_weights_torch_to_tf(tf_vars, dict(var_map), state_dict)
+
+# Compare TF and TORCH models
+# check embedding outputs
+model.eval()
+dummy_input_torch = torch.ones((1, 80, 10))
+dummy_input_tf = tf.convert_to_tensor(dummy_input_torch.numpy())
+dummy_input_tf = tf.transpose(dummy_input_tf, perm=[0, 2, 1])
+dummy_input_tf = tf.expand_dims(dummy_input_tf, 2)
+
+out_torch = model.layers[0](dummy_input_torch)
+out_tf = model_tf.model_layers[0](dummy_input_tf)
+out_tf_ = tf.transpose(out_tf, perm=[0, 3, 2, 1])[:, :, 0, :]
+
+assert compare_torch_tf(out_torch, out_tf_) < 1e-5
+
+for i in range(1, len(model.layers)):
+    print(f"{i} -> {model.layers[i]} vs {model_tf.model_layers[i]}")
+    out_torch = model.layers[i](out_torch)
+    out_tf = model_tf.model_layers[i](out_tf)
+    out_tf_ = tf.transpose(out_tf, perm=[0, 3, 2, 1])[:, :, 0, :]
+    diff = compare_torch_tf(out_torch, out_tf_)
+    assert diff < 1e-5, diff
+
+dummy_input_torch = torch.ones((1, 80, 10))
+dummy_input_tf = tf.convert_to_tensor(dummy_input_torch.numpy())
+output_torch = model.inference(dummy_input_torch)
+output_tf = model_tf(dummy_input_tf, training=False)
+assert compare_torch_tf(output_torch, output_tf) < 1e-5, compare_torch_tf(
+    output_torch, output_tf)
+# save tf model
+save_checkpoint(model_tf, checkpoint['step'], checkpoint['epoch'],
+                args.output_path)
+print(' > Model conversion is successfully completed :).')
diff --git a/vocoder/tf/layers/melgan.py b/vocoder/tf/layers/melgan.py
@@ -0,0 +1,58 @@
+import tensorflow as tf
+
+
+class ReflectionPad1d(tf.keras.layers.Layer):
+    def __init__(self, padding):
+        super(ReflectionPad1d, self).__init__()
+        self.padding = padding
+
+    def call(self, x):
+        print(x.shape)
+        return tf.pad(x, [[0, 0], [self.padding, self.padding], [0, 0], [0, 0]], "REFLECT")
+
+
+class ResidualStack(tf.keras.layers.Layer):
+    def __init__(self, channels, num_res_blocks, kernel_size, name):
+        super(ResidualStack, self).__init__(name=name)
+
+        assert (kernel_size - 1) % 2 == 0, " [!] kernel_size has to be odd."
+        base_padding = (kernel_size - 1) // 2
+
+        self.blocks = []
+        num_layers = 2
+        for idx in range(num_res_blocks):
+            layer_kernel_size = kernel_size
+            layer_dilation = layer_kernel_size**idx
+            layer_padding = base_padding * layer_dilation
+            block = [
+                tf.keras.layers.LeakyReLU(0.2),
+                ReflectionPad1d(layer_padding),
+                tf.keras.layers.Conv2D(filters=channels,
+                                       kernel_size=(kernel_size, 1),
+                                       dilation_rate=(layer_dilation, 1),
+                                       use_bias=True,
+                                       padding='valid',
+                                       name=f'blocks.{idx}.{num_layers}'),
+                tf.keras.layers.LeakyReLU(0.2),
+                tf.keras.layers.Conv2D(filters=channels,
+                                       kernel_size=(1, 1),
+                                       use_bias=True,
+                                       name=f'blocks.{idx}.{num_layers + 2}')
+            ]
+            self.blocks.append(block)
+        self.shortcuts = [
+            tf.keras.layers.Conv2D(channels,
+                                   kernel_size=1,
+                                   use_bias=True,
+                                   name=f'shortcuts.{i}')
+            for i in range(num_res_blocks)
+        ]
+
+    def call(self, x):
+        # breakpoint()
+        for block, shortcut in zip(self.blocks, self.shortcuts):
+            res = shortcut(x)
+            for layer in block:
+                x = layer(x)
+            x += res
+        return x
diff --git a/vocoder/tf/layers/pqmf.py b/vocoder/tf/layers/pqmf.py
@@ -0,0 +1,66 @@
+import numpy as np
+import tensorflow as tf
+
+from scipy import signal as sig
+
+
+class PQMF(tf.keras.layers.Layer):
+    def __init__(self, N=4, taps=62, cutoff=0.15, beta=9.0):
+        super(PQMF, self).__init__()
+        # define filter coefficient
+        self.N = N
+        self.taps = taps
+        self.cutoff = cutoff
+        self.beta = beta
+
+        QMF = sig.firwin(taps + 1, cutoff, window=('kaiser', beta))
+        H = np.zeros((N, len(QMF)))
+        G = np.zeros((N, len(QMF)))
+        for k in range(N):
+            constant_factor = (2 * k + 1) * (np.pi /
+                                             (2 * N)) * (np.arange(taps + 1) -
+                                                         ((taps - 1) / 2))
+            phase = (-1)**k * np.pi / 4
+            H[k] = 2 * QMF * np.cos(constant_factor + phase)
+
+            G[k] = 2 * QMF * np.cos(constant_factor - phase)
+
+        # [N, 1, taps + 1] == [filter_width, in_channels, out_channels]
+        self.H = np.transpose(H[:, None, :], (2, 1, 0)).astype('float32')
+        self.G = np.transpose(G[None, :, :], (2, 1, 0)).astype('float32')
+
+        # filter for downsampling & upsampling
+        updown_filter = np.zeros((N, N, N), dtype=np.float32)
+        for k in range(N):
+            updown_filter[0, k, k] = 1.0
+        self.updown_filter = updown_filter.astype(np.float32)
+
+    def analysis(self, x):
+        """
+        x : B x 1 x T
+        """
+        x = tf.transpose(x, perm=[0, 2, 1])
+        x = tf.pad(x, [[0, 0], [self.taps // 2, self.taps // 2], [0, 0]], constant_values=0.0)
+        x = tf.nn.conv1d(x, self.H, stride=1, padding='VALID')
+        x = tf.nn.conv1d(x,
+                         self.updown_filter,
+                         stride=self.N,
+                         padding='VALID')
+        x = tf.transpose(x, perm=[0, 2, 1])
+        return x
+
+    def synthesis(self, x):
+        """
+        x : B x 1 x T
+        """
+        x = tf.transpose(x, perm=[0, 2, 1])
+        x = tf.nn.conv1d_transpose(
+            x,
+            self.updown_filter * self.N,
+            strides=self.N,
+            output_shape=(tf.shape(x)[0], tf.shape(x)[1] * self.N,
+                          self.N))
+        x = tf.pad(x, [[0, 0], [self.taps // 2, self.taps // 2], [0, 0]], constant_values=0.0)
+        x = tf.nn.conv1d(x, self.G, stride=1, padding="VALID")
+        x = tf.transpose(x, perm=[0, 2, 1])
+        return x