Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multilingual support #11

Closed
wants to merge 95 commits into from
Closed
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
9fddb8f
Added reversal classifier in tacotron2
WeberJulian Feb 21, 2021
154408e
Added loss and config for adverserial classifier
WeberJulian Feb 21, 2021
9cf7a10
Removing cosine similarity classifier
WeberJulian Feb 21, 2021
446c829
Removing unused GradientClippingFunction
WeberJulian Feb 21, 2021
3c162e6
fixes reversal classifier
WeberJulian Feb 22, 2021
616b18b
reversal classifier first training
WeberJulian Feb 23, 2021
c94eeb0
Add resample script
WeberJulian Mar 3, 2021
3686713
Using path.join instead of concat
WeberJulian Mar 3, 2021
e09743f
fix speaker_id default value for evaluation
WeberJulian Mar 4, 2021
cccc5d6
linter + test
WeberJulian Mar 5, 2021
b133cbf
test case
WeberJulian Mar 5, 2021
32968fa
fix french_cleaners
WeberJulian Mar 5, 2021
8fd272c
fix linter issues
WeberJulian Mar 6, 2021
fdf9dad
Merge branch 'dev' of https://github.com/coqui-ai/TTS into dev
WeberJulian Mar 11, 2021
358f86f
Merge branch 'dev' into multilingual
WeberJulian Mar 11, 2021
24ef107
Fix input dim with gst
WeberJulian Mar 12, 2021
807afd3
Adding multilingual encoder
WeberJulian Mar 13, 2021
82a8d26
linter fixes
WeberJulian Mar 13, 2021
f0baa74
fix linter issues again
WeberJulian Mar 13, 2021
c63a568
last linter fix
WeberJulian Mar 13, 2021
4bfa6ff
fix tests
WeberJulian Mar 14, 2021
36a1460
Trains without crash
WeberJulian Apr 11, 2021
e88ea1d
Actually using multilingual in Tacotron2
WeberJulian Apr 12, 2021
9f3f69e
working with test sentences
WeberJulian Apr 12, 2021
aaf43f1
retore path works
WeberJulian Apr 13, 2021
d9f3bed
Replace lang_ and langs_ contractions
WeberJulian Apr 13, 2021
199aefe
Correct synthesis bug and add GL sythesis notbook
WeberJulian Apr 13, 2021
fd95f39
Merge branch 'dev'
WeberJulian Apr 13, 2021
4072cdd
Enhancments
WeberJulian Apr 13, 2021
ca4dafd
remove unused language_embedding
WeberJulian Apr 13, 2021
904c6b1
fix synthesis
WeberJulian Apr 13, 2021
523e87b
fix resample after optimization
WeberJulian Apr 18, 2021
c8d1767
add weighted_sampler
WeberJulian Apr 18, 2021
2ef3868
notebook changes
WeberJulian Apr 18, 2021
6152bc0
fir odd number of languages
WeberJulian Apr 19, 2021
e12d01b
Add feature to specify speaker/language test file
WeberJulian Apr 19, 2021
698663d
Add language embedding after encoder
WeberJulian Apr 24, 2021
bc9bbb7
new preprocessors
WeberJulian Apr 24, 2021
6be6d1b
Merge remote-tracking branch 'coqui/dev' into multilingual
WeberJulian Apr 24, 2021
99bb49f
HifiGan Sythesis
WeberJulian Apr 29, 2021
d732ebe
Edresson's fix
WeberJulian Apr 30, 2021
de298e3
cleanup after first successfull trainning
WeberJulian May 20, 2021
7d525c0
quick fix
WeberJulian May 25, 2021
72df02f
Refacto and reversal loss fix
WeberJulian May 25, 2021
6ae4695
set speaker_embedding_dim back to 512
WeberJulian May 25, 2021
abfad8f
Temporary fix
WeberJulian May 27, 2021
4920676
added genereted
WeberJulian May 28, 2021
95d983e
support single language training
WeberJulian Jun 1, 2021
e971b9a
separate speaker and language sampler
WeberJulian Jun 1, 2021
44c8791
perfect sampler and generated fixes
WeberJulian Jun 4, 2021
5094923
Generated encoder now runs but slow
WeberJulian Jun 4, 2021
bf69366
first training generated encoder
WeberJulian Jun 4, 2021
ee9bfa7
Fixed inference
WeberJulian Jun 10, 2021
c5faf4e
Generated encoder working
WeberJulian Jun 14, 2021
357be71
add glowTTS multilingual support
Edresson Jun 14, 2021
2ec6232
switch to batch sampler
WeberJulian Jun 16, 2021
120a701
fix batch_n_iter
WeberJulian Jun 16, 2021
4bd6ff4
Merge pull request #1 from Edresson/multilingual
Edresson Jun 17, 2021
f09ec9b
Fixes
WeberJulian Jun 17, 2021
e986558
Merge branch 'multilingual' of https://github.com/WeberJulian/TTS-1 i…
WeberJulian Jun 17, 2021
2299c18
Bug fix on LibriTTS preprocess
Edresson Jun 17, 2021
a51a91f
bug fix
Edresson Jun 18, 2021
d32d3f8
add script for remove silence using VAD
Edresson Jun 18, 2021
bb3897e
fix split dataset
WeberJulian Jun 19, 2021
0353fa3
Merge branch 'multilingual' of https://github.com/WeberJulian/TTS-1 i…
WeberJulian Jun 19, 2021
8e99c13
add stochastic duration predictor
Edresson Jun 20, 2021
b124a5e
Merge branch 'multilingual' of https://github.com/WeberJulian/TTS-1 i…
Edresson Jun 20, 2021
ed4777e
fix documentation
Edresson Jun 20, 2021
e7ecac5
bug fix on Vad remove silence script
Edresson Jun 20, 2021
ea26c10
add extra slots for new languages
Edresson Jun 21, 2021
500fef3
Move Dataloaders closer to the train and eval call
WeberJulian Jun 24, 2021
2f35f76
set glowtts noise_scale value to 0
Edresson Jun 27, 2021
fb03de5
cond language embedding on duration predictor and add inference for v…
Edresson Jun 28, 2021
4905202
bug fix in decoderr inference
Edresson Jun 28, 2021
69230f0
add reversal classifier in GlowTTS
Edresson Jun 30, 2021
2542426
bugfix
Edresson Jul 1, 2021
eb5ede1
add extract spectrogram script
Edresson Jul 2, 2021
8b819f8
fixes
WeberJulian Jul 5, 2021
e85b047
fix eval bug
WeberJulian Jul 5, 2021
c4bf6e7
Merge remote-tracking branch 'origin/multilingual' into multilingual
WeberJulian Jul 5, 2021
8e5afec
Allow for reversal classifier when eval contains unseen speakers
WeberJulian Jul 19, 2021
1ccd691
Allow for differences in feat and wav paths for vocoder training
WeberJulian Jul 21, 2021
fe7eb0a
add pitch predictor support
Edresson Jul 21, 2021
e1f1476
add freeze model parts option in config
Edresson Jul 21, 2021
6e71856
add pitch predictor support
Edresson Jul 21, 2021
a3d523a
Merge remote-tracking branch 'origin/multilingual' into multilingual
WeberJulian Jul 21, 2021
7a1d186
add config datasets support to the gan dataloader
Edresson Jul 21, 2021
a30eadd
add pitch transform
Edresson Jul 23, 2021
3394378
bug fix
Edresson Jul 23, 2021
a19ab0b
pitch predictor bug fix
Edresson Jul 24, 2021
6c8eb30
glowtts singke speaker train bug fix
Edresson Jul 23, 2021
f7e1e37
update pitch predictor network
Edresson Jul 25, 2021
43e4415
bug fix
Edresson Aug 2, 2021
aa10b54
add VITS model support
Edresson Aug 5, 2021
a7963e0
bug fix
Edresson Aug 5, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions TTS/bin/resample.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,10 @@

def resample_file(func_args):
filename, output_sr = func_args
y, sr = librosa.load(filename, sr=output_sr)
librosa.output.write_wav(filename, y, sr)
y, sr = librosa.load(filename, sr=None)
if output_sr != sr:
y = librosa.resample(y, sr, output_sr)
librosa.output.write_wav(filename, y, output_sr)


if __name__ == "__main__":
Expand Down
97 changes: 86 additions & 11 deletions TTS/bin/train_tacotron.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import sys
import time
import traceback
from random import randrange
import random

import numpy as np
import torch
Expand All @@ -17,7 +17,7 @@
from TTS.tts.utils.generic_utils import setup_model
from TTS.tts.utils.io import save_best_model, save_checkpoint
from TTS.tts.utils.measures import alignment_diagonal_score
from TTS.tts.utils.speakers import parse_speakers
from TTS.tts.utils.speakers import parse_speakers, parse_languages
from TTS.tts.utils.synthesis import synthesis
from TTS.tts.utils.text.symbols import make_symbols, phonemes, symbols
from TTS.tts.utils.visual import plot_alignment, plot_spectrogram
Expand Down Expand Up @@ -70,6 +70,31 @@ def setup_loader(ap, r, is_val=False, verbose=False, dataset=None):
dataset.sort_items()

sampler = DistributedSampler(dataset) if num_gpus > 1 else None
if getattr(c, "weighted_sampler", False) and sampler is None and not is_val:
print("Using weighted sampler")
# get speaker/language names
speaker_names = np.array([item[2] for item in dataset.items])
language_names = np.array([item[3] for item in dataset.items])

unique_speaker_names = np.unique(speaker_names).tolist()
unique_language_names = np.unique(language_names).tolist()

speaker_ids = [unique_speaker_names.index(s) for s in speaker_names]
language_ids = [unique_language_names.index(l) for l in language_names]

# count number samples by speaker/language
speaker_count = np.array([len(np.where(speaker_names == s)[0]) for s in unique_speaker_names])
language_count = np.array([len(np.where(language_names == l)[0]) for l in unique_language_names])

# create weight
weight_speaker = 1. / speaker_count
weight_language = 1. / language_count
samples_weight = np.array([weight_speaker[s] for s in speaker_ids]) + np.array([weight_language[l] for l in language_ids])
dataset_samples_weight = torch.from_numpy(samples_weight).double()

# create sampler
sampler = torch.utils.data.sampler.WeightedRandomSampler(dataset_samples_weight, len(dataset_samples_weight))

loader = DataLoader(
dataset,
batch_size=c.eval_batch_size if is_val else c.batch_size,
Expand All @@ -92,12 +117,13 @@ def format_data(data):
mel_input = data[4]
mel_lengths = data[5]
stop_targets = data[6]
language_names = data[7]
max_text_length = torch.max(text_lengths.float())
max_spec_length = torch.max(mel_lengths.float())

if c.use_speaker_embedding:
if c.use_external_speaker_embedding_file:
speaker_embeddings = data[8]
speaker_embeddings = data[9]
speaker_ids = None
else:
speaker_ids = [speaker_mapping[speaker_name] for speaker_name in speaker_names]
Expand All @@ -107,6 +133,12 @@ def format_data(data):
speaker_embeddings = None
speaker_ids = None

if c.use_language_embedding:
language_ids = [language_mapping[language_name] for language_name in language_names]
language_ids = torch.LongTensor(language_ids)
else:
language_ids = None

# set stop targets view, we predict a single stop token per iteration.
stop_targets = stop_targets.view(text_input.shape[0], stop_targets.size(1) // c.r, -1)
stop_targets = (stop_targets.sum(2) > 0.0).unsqueeze(2).float().squeeze(2)
Expand All @@ -123,6 +155,8 @@ def format_data(data):
speaker_ids = speaker_ids.cuda(non_blocking=True)
if speaker_embeddings is not None:
speaker_embeddings = speaker_embeddings.cuda(non_blocking=True)
if language_ids is not None:
language_ids = language_ids.cuda(non_blocking=True)

return (
text_input,
Expand All @@ -132,12 +166,26 @@ def format_data(data):
linear_input,
stop_targets,
speaker_ids,
language_ids,
speaker_embeddings,
max_text_length,
max_spec_length,
)


def extract_parameters(test_sentence):
splited = test_sentence.split('|')
if len(splited) == 1: # No language or speaker info
return (splited[0], None, None)
if len(splited) == 2: # No language info
sentence, speaker = splited
return (sentence, speaker_mapping[speaker], None)
if len(splited) == 3:
sentence, speaker, language = splited
return (sentence, speaker_mapping[speaker], language_mapping[language])
raise RuntimeError("Invalid line was given in the test sentence file.")


def train(data_loader, model, criterion, optimizer, optimizer_st, scheduler, ap, global_step, epoch, scaler, scaler_st):
model.train()
epoch_time = 0
Expand All @@ -160,6 +208,7 @@ def train(data_loader, model, criterion, optimizer, optimizer_st, scheduler, ap,
linear_input,
stop_targets,
speaker_ids,
language_ids,
speaker_embeddings,
max_text_length,
max_spec_length,
Expand All @@ -186,22 +235,25 @@ def train(data_loader, model, criterion, optimizer, optimizer_st, scheduler, ap,
stop_tokens,
decoder_backward_output,
alignments_backward,
speaker_prediction,
) = model(
text_input,
text_lengths,
mel_input,
mel_lengths,
speaker_ids=speaker_ids,
speaker_embeddings=speaker_embeddings,
language_ids=language_ids,
)
else:
decoder_output, postnet_output, alignments, stop_tokens = model(
decoder_output, postnet_output, alignments, stop_tokens, speaker_prediction = model(
text_input,
text_lengths,
mel_input,
mel_lengths,
speaker_ids=speaker_ids,
speaker_embeddings=speaker_embeddings,
language_ids=language_ids,
)
decoder_backward_output = None
alignments_backward = None
Expand All @@ -228,6 +280,8 @@ def train(data_loader, model, criterion, optimizer, optimizer_st, scheduler, ap,
alignment_lengths,
alignments_backward,
text_lengths,
speaker_prediction,
speaker_ids,
)

# check nan loss
Expand Down Expand Up @@ -405,6 +459,7 @@ def evaluate(data_loader, model, criterion, ap, global_step, epoch):
linear_input,
stop_targets,
speaker_ids,
language_ids,
speaker_embeddings,
_,
_,
Expand All @@ -420,12 +475,23 @@ def evaluate(data_loader, model, criterion, ap, global_step, epoch):
stop_tokens,
decoder_backward_output,
alignments_backward,
speaker_prediction,
) = model(
text_input, text_lengths, mel_input, speaker_ids=speaker_ids, speaker_embeddings=speaker_embeddings
text_input,
text_lengths,
mel_input,
speaker_ids=speaker_ids,
language_ids=language_ids,
speaker_embeddings=speaker_embeddings,
)
else:
decoder_output, postnet_output, alignments, stop_tokens = model(
text_input, text_lengths, mel_input, speaker_ids=speaker_ids, speaker_embeddings=speaker_embeddings
decoder_output, postnet_output, alignments, stop_tokens, speaker_prediction = model(
text_input,
text_lengths,
mel_input,
speaker_ids=speaker_ids,
language_ids=language_ids,
speaker_embeddings=speaker_embeddings,
)
decoder_backward_output = None
alignments_backward = None
Expand All @@ -452,6 +518,8 @@ def evaluate(data_loader, model, criterion, ap, global_step, epoch):
alignment_lengths,
alignments_backward,
text_lengths,
speaker_prediction,
speaker_ids,
)

# step time
Expand Down Expand Up @@ -536,12 +604,14 @@ def evaluate(data_loader, model, criterion, ap, global_step, epoch):
test_audios = {}
test_figures = {}
print(" | > Synthesizing test sentences")
speaker_id = 0 if c.use_speaker_embedding else None
# Those defaults are used if speaker and language are not defined in the test_sentences_file
speaker_id = 5 if c.use_speaker_embedding else None
speaker_embedding = (
speaker_mapping[list(speaker_mapping.keys())[randrange(len(speaker_mapping) - 1)]]["embedding"]
speaker_mapping[list(speaker_mapping.keys())[random.randrange(len(speaker_mapping) - 1)]]["embedding"]
if c.use_external_speaker_embedding_file and c.use_speaker_embedding
else None
)
language_id = 0 if c.use_language_embedding else None
style_wav = c.get("gst_style_input")
if style_wav is None and c.use_gst:
# inicialize GST with zero dict.
Expand All @@ -552,13 +622,16 @@ def evaluate(data_loader, model, criterion, ap, global_step, epoch):
style_wav = c.get("gst_style_input")
for idx, test_sentence in enumerate(test_sentences):
try:
test_sentence, speaker_id, language_id = extract_parameters(test_sentence)
wav, alignment, decoder_output, postnet_output, stop_tokens, _ = synthesis(
model,
test_sentence,
c,
use_cuda,
ap,
speaker_id=speaker_id,
language_id=language_id,
language_mapping=language_mapping,
speaker_embedding=speaker_embedding,
style_wav=style_wav,
truncated=False,
Expand All @@ -584,7 +657,7 @@ def evaluate(data_loader, model, criterion, ap, global_step, epoch):

def main(args): # pylint: disable=redefined-outer-name
# pylint: disable=global-variable-undefined
global meta_data_train, meta_data_eval, speaker_mapping, symbols, phonemes, model_characters
global meta_data_train, meta_data_eval, speaker_mapping, symbols, phonemes, model_characters, language_mapping
# Audio processor
ap = AudioProcessor(**c.audio)

Expand All @@ -600,6 +673,7 @@ def main(args): # pylint: disable=redefined-outer-name

# load data instances
meta_data_train, meta_data_eval = load_meta_data(c.datasets)
# meta_data_train = random.sample(meta_data_train, len(meta_data_train)//64) # to speedup train phase for dev purposes

# set the portion of the data used for training
if "train_portion" in c.keys():
Expand All @@ -609,8 +683,9 @@ def main(args): # pylint: disable=redefined-outer-name

# parse speakers
num_speakers, speaker_embedding_dim, speaker_mapping = parse_speakers(c, args, meta_data_train, OUT_PATH)
num_langs, language_embedding_dim, language_mapping = parse_languages(c, args, meta_data_train, OUT_PATH)

model = setup_model(num_chars, num_speakers, c, speaker_embedding_dim)
model = setup_model(num_chars, num_speakers, num_langs, c, speaker_embedding_dim, language_embedding_dim)

# scalers for mixed precision training
scaler = torch.cuda.amp.GradScaler() if c.mixed_precision else None
Expand Down
4 changes: 4 additions & 0 deletions TTS/tts/configs/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,10 @@
"gst_style_tokens": 10,
"gst_use_speaker_embedding": false
},
"reversal_classifier": false,
"reversal_classifier_dim": 256,
"reversal_classifier_w": 0.125,
"reversal_gradient_clipping": 0.25,

// DATASETS
"datasets": // List of datasets. They all merged and they get different speaker_ids.
Expand Down
15 changes: 10 additions & 5 deletions TTS/tts/datasets/TTSDataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ def __init__(
print("\n > DataLoader initialization")
print(" | > Use phonemes: {}".format(self.use_phonemes))
if use_phonemes:
print(" | > phoneme language: {}".format(phoneme_language))
print(" | > Default phoneme language: {}".format(phoneme_language))
print(" | > Number of instances : {}".format(len(self.items)))

def load_wav(self, filename):
Expand Down Expand Up @@ -133,10 +133,10 @@ def _load_or_generate_phoneme_sequence(
def load_data(self, idx):
item = self.items[idx]

if len(item) == 4:
text, wav_file, speaker_name, attn_file = item
if len(item) == 5:
text, wav_file, speaker_name, language_name, attn_file = item
else:
text, wav_file, speaker_name = item
text, wav_file, speaker_name, language_name = item
attn = None

wav = np.asarray(self.load_wav(wav_file), dtype=np.float32)
Expand All @@ -153,7 +153,7 @@ def load_data(self, idx):
self.phoneme_cache_path,
self.enable_eos_bos,
self.cleaners,
self.phoneme_language,
language_name if language_name else self.phoneme_language,
self.tp,
self.add_blank,
)
Expand Down Expand Up @@ -181,6 +181,7 @@ def load_data(self, idx):
"attn": attn,
"item_idx": self.items[idx][1],
"speaker_name": speaker_name,
"language_name": language_name,
"wav_file_name": os.path.basename(wav_file),
}
return sample
Expand Down Expand Up @@ -294,6 +295,9 @@ def collate_fn(self, batch):
text = [batch[idx]["text"] for idx in ids_sorted_decreasing]

speaker_name = [batch[idx]["speaker_name"] for idx in ids_sorted_decreasing]

language_name = [batch[idx]["language_name"] for idx in ids_sorted_decreasing]

# get speaker embeddings
if self.speaker_mapping is not None:
wav_files_names = [batch[idx]["wav_file_name"] for idx in ids_sorted_decreasing]
Expand Down Expand Up @@ -360,6 +364,7 @@ def collate_fn(self, batch):
mel,
mel_lengths,
stop_targets,
language_name,
item_idxs,
speaker_embedding,
attns,
Expand Down
Loading