Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Dataset support + Gentle-based custom dataset preprocessing support #78

Merged
merged 28 commits into from
Apr 30, 2018
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
3ee20fb
Fixed typeerror (torch.index_select received an invalid combination o…
engiecat Mar 3, 2018
720cf1c
Fixed Nonetype error in collect_features
engiecat Mar 3, 2018
6d4d594
requirements.txt fix
engiecat Mar 3, 2018
e84b923
Memory Leakage bugfix + hparams change
engiecat Mar 3, 2018
030de15
Pre-PR modifications
engiecat Mar 8, 2018
3075486
Pre-PR modifications 2
engiecat Mar 8, 2018
b8252ae
Pre-PR modifications 3
engiecat Mar 8, 2018
052d030
Post-PR modification
engiecat Mar 10, 2018
92a84d9
remove requirements.txt
engiecat Mar 10, 2018
a155fb9
num_workers to 1 in train.py
engiecat Mar 10, 2018
747f2e0
Merge branch 'master' into master
engiecat Mar 10, 2018
5214c24
Windows log filename bugfix
engiecat Mar 10, 2018
e22388a
Revert "Windows log filename bugfix"
engiecat Mar 10, 2018
d7908d0
Merge remote-tracking branch 'upstream/master'
engiecat Mar 10, 2018
a6969ac
merge 2
engiecat Mar 10, 2018
15eb591
Windows Filename bugfix
engiecat Mar 10, 2018
d1258e7
Cleanup before PR
engiecat Mar 10, 2018
89760d2
Merge pull request #3 from r9y9/master
engiecat Mar 10, 2018
ba182f9
JSON format Metadata support
engiecat Mar 18, 2018
5d104e6
Web based Gentle aligner support
engiecat Apr 21, 2018
32cab90
Merge pull request #4 from r9y9/master
engiecat Apr 21, 2018
6d8973a
Merge pull request #5 from r9y9/master
engiecat Apr 27, 2018
9bae706
README change + gentle patch
engiecat Apr 27, 2018
3c61d46
Merge branch 'master' of https://github.com/engiecat/deepvoice3_pytorch
engiecat Apr 27, 2018
d9e8cc7
.gitignore change
engiecat Apr 28, 2018
543a418
Flake8 Fix
engiecat Apr 28, 2018
132cd14
Post PR commit - Also fixed #5
engiecat Apr 30, 2018
8fc35ad
Post-PR 2 - .gitignore
engiecat Apr 30, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ log
generated
data
text
datasets

# Created by https://www.gitignore.io

Expand Down Expand Up @@ -199,3 +200,9 @@ Temporary Items

# Linux trash folder which might appear on any partition or disk
.Trash-*
vctk_preprocess/WorkingHowToUseThis.txt
GoTBook1.01.txt
presets/deepvoice3_got.json
presets/deepvoice3_gotOnly.json
presets/deepvoice3_stest.json
presets/deepvoice3_test.json
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be safely removed? Assuming this is for your local only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I will remove it! :) Thanks for telling me

53 changes: 48 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ A notebook supposed to be executed on https://colab.research.google.com is avail
- Convolutional sequence-to-sequence model with attention for text-to-speech synthesis
- Multi-speaker and single speaker versions of DeepVoice3
- Audio samples and pre-trained models
- Preprocessor for [LJSpeech (en)](https://keithito.com/LJ-Speech-Dataset/), [JSUT (jp)](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html) datasets
- Language-dependent frontend text processor for English and Japanese
- Preprocessor for [LJSpeech (en)](https://keithito.com/LJ-Speech-Dataset/), [JSUT (jp)](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html) datasets, as well as [carpedm20/multi-speaker-tacotron-tensorflow](https://github.com/carpedm20/multi-Speaker-tacotron-tensorflow) compatible custom dataset (in JSON format)
- Language-dependent frontend text processor for English and Japanese

### Samples

Expand Down Expand Up @@ -102,7 +102,7 @@ python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljs
- LJSpeech (en): https://keithito.com/LJ-Speech-Dataset/
- VCTK (en): http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html
- JSUT (jp): https://sites.google.com/site/shinnosuketakamichi/publication/jsut
- NIKL (ko): http://www.korean.go.kr/front/board/boardStandardView.do?board_id=4&mn_id=17&b_seq=464
- NIKL (ko) (**Need korean cellphone number to access it**): http://www.korean.go.kr/front/board/boardStandardView.do?board_id=4&mn_id=17&b_seq=464

### 1. Preprocessing

Expand All @@ -128,6 +128,47 @@ python preprocess.py --preset=presets/deepvoice3_ljspeech.json ljspeech ~/data/L

When this is done, you will see extracted features (mel-spectrograms and linear spectrograms) in `./data/ljspeech`.

#### 1-1. Building custom dataset. (using json_meta)
Building your own dataset, with metadata in JSON format (compatible with [carpedm20/multi-speaker-tacotron-tensorflow](https://github.com/carpedm20/multi-Speaker-tacotron-tensorflow)) is currently supported.
Usage:

```
python preprocess.py json_meta ${list-of-JSON-metadata-paths} ${out_dir} --preset=<json>
```
You may need to modify pre-existing preset JSON file, especially `n_speakers`. For english multispeaker, start with `presets/deepvoice3_vctk.json`.

Assuming you have dataset A (Speaker A) and dataset B (Speaker B), each described in the JSON metadata file `./datasets/datasetA/alignment.json` and `./datasets/datasetB/alignment.json`, then you can preprocess data by:

```
python preprocess.py json_meta "./datasets/datasetA/alignment.json,./datasets/datasetB/alignment.json" "./datasets/processed_A+B" --preset=(path to preset json file)
```

#### 1-2. Preprocessing custom english datasets with long silence. (Based on [vctk_preprocess](vctk_preprocess/))

Some dataset, especially automatically generated dataset may include long silence and undesirable leading/trailing noises, undermining the char-level seq2seq model.
(e.g. VCTK, although this is covered in vctk_preprocess)

To deal with the problem, `gentle_web_align.py` will
- **Prepare phoneme alignments for all utterances**
- Cut silences during preprocessing

`gentle_web_align.py` uses [Gentle](https://github.com/lowerquality/gentle), a kaldi based speech-text alignment tool. This accesses web-served Gentle application, aligns given sound segments with transcripts and converts the result to HTK-style label files, to be processed in `preprocess.py`. Gentle can be run in Linux/Mac/Windows(via Docker).

Preliminary results show that while HTK/festival/merlin-based method in `vctk_preprocess/prepare_vctk_labels.py` works better on VCTK, Gentle is more stable with audio clips with ambient noise. (e.g. movie excerpts)

Usage:
(Assuming Gentle is running at `localhost:8567` (Default when not specified))
1. When sound file and transcript files are saved in separate folders. (e.g. sound files are at `datasetA/wavs` and transcripts are at `datasetA/txts`)
```
python gentle_web_align.py -w "datasetA/wavs/*.wav" -t "datasetA/txts/*.txt" --server_addr=localhost --port=8567
```

2. When sound file and transcript files are saved in nested structure. (e.g. `datasetB/speakerN/blahblah.wav` and `datasetB/speakerN/blahblah.txt`)
```
python gentle_web_align.py --nested-directories="datasetB" --server_addr=localhost --port=8567
```
**Once you have phoneme alignment for each utterance, you can extract features by running `preprocess.py`**

### 2. Training

Usage:
Expand All @@ -139,7 +180,7 @@ python train.py --data-root=${data-root} --preset=<json> --hparams="parameters y
Suppose you build a DeepVoice3-style model using LJSpeech dataset, then you can train your model by:

```
python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech/
python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech/
```

Model checkpoints (.pth) and alignments (.png) are saved in `./checkpoints` directory per 10000 steps by default.
Expand Down Expand Up @@ -247,7 +288,9 @@ From my experience, it can get reasonable speech quality very quickly rather tha
There are two important options used above:

- `--restore-parts=<N>`: It specifies where to load model parameters. The differences from the option `--checkpoint=<N>` are 1) `--restore-parts=<N>` ignores all invalid parameters, while `--checkpoint=<N>` doesn't. 2) `--restore-parts=<N>` tell trainer to start from 0-step, while `--checkpoint=<N>` tell trainer to continue from last step. `--checkpoint=<N>` should be ok if you are using exactly same model and continue to train, but it would be useful if you want to customize your model architecture and take advantages of pre-trained model.
- `--speaker-id=<N>`: It specifies what speaker of data is used for training. This should only be specified if you are using multi-speaker dataset. As for VCTK, speaker id is automatically assigned incrementally (0, 1, ..., 107) according to the `speaker_info.txt` in the dataset.
- `--speaker-id=<N>`: It specifies what speaker of data is used for training. This should only be specified if you are using multi-speaker dataset. As for VCTK, speaker id is automatically assigned incrementally (0, 1, ..., 107) according to the `speaker_info.txt` in the dataset.

If you are training multi-speaker model, speaker adaptation will only work **when `n_speakers` is identical**.

## Acknowledgements

Expand Down
153 changes: 153 additions & 0 deletions gentle_web_align.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# -*- coding: utf-8 -*-
"""
Created on Sat Apr 21 09:06:37 2018
Phoneme alignment and conversion in HTK-style label file using Web-served Gentle
This works on any type of english dataset.
This allows its usage on Windows (Via Docker) and external server.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure, the reason using server-based Gentle rather than python API is that it allows use on Windows, right? Any other reasons?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, and also because Gentle is python 2 compatible only, while this repo is python3 compatible.

In addition, if we use server-based Gentle, we can also use external server.

Preliminary results show that gentle has better performance with noisy dataset
(e.g. movie extracted audioclips)
*This work was derived from vctk_preprocess/prepare_htk_alignments_vctk.py
@author: engiecat(github)

usage:
gentle_web_align.py (-w wav_pattern) (-t text_pattern) [options]
gentle_web_align.py (--nested-directories=<main_directory>) [options]

options:
-w <wav_pattern> --wav_pattern=<wav_pattern> Pattern of wav files to be aligned
-t <txt_pattern> --txt_pattern=<txt_pattern> Pattern of txt transcript files to be aligned (same name required)
--nested-directories=<main_directory> Process every wav/txt file in the subfolders of the given folder
--server_addr=<server_addr> Server address that serves gentle. [default: localhost]
--port=<port> Server port that serves gentle. [default: 8567]
--max_unalign=<max_unalign> Maximum threshold for unalignment occurence (0.0 ~ 1.0) [default: 0.3]
--skip-already-done Skips if there are preexisting .lab file
-h --help show this help message and exit
"""

from docopt import docopt
from glob import glob
from tqdm import tqdm
import os.path
import requests
import numpy as np

def write_hts_label(labels, lab_path):
lab = ""
for s, e, l in labels:
s, e = float(s) * 1e7, float(e) * 1e7
s, e = int(s), int(e)
lab += "{} {} {}\n".format(s, e, l)
print(lab)
with open(lab_path, "w", encoding='utf-8') as f:
f.write(lab)


def json2hts(data):
emit_bos = False
emit_eos = False

phone_start = 0
phone_end = None
labels = []
failure_count = 0

for word in data["words"]:
case = word["case"]
if case != "success":
failure_count += 1 # instead of failing everything,
#raise RuntimeError("Alignment failed")
continue
start = float(word["start"])
word_end = float(word["end"])

if not emit_bos:
labels.append((phone_start, start, "silB"))
emit_bos = True

phone_start = start
phone_end = None
for phone in word["phones"]:
ph = str(phone["phone"][:-2])
duration = float(phone["duration"])
phone_end = phone_start + duration
labels.append((phone_start, phone_end, ph))
phone_start += duration
assert np.allclose(phone_end, word_end)
if not emit_eos:
labels.append((phone_start, phone_end, "silE"))
emit_eos = True
unalign_ratio = float(failure_count) / len(data['words'])
return unalign_ratio, labels


def gentle_request(wav_path,txt_path, server_addr, port, debug=False):
print('\n')
response = None
wav_name = os.path.basename(wav_path)
txt_name = os.path.basename(txt_path)
if os.path.splitext(wav_name)[0] != os.path.splitext(txt_name)[0]:
print(' [!] wav name and transcript name does not match - exiting...')
return response
with open(txt_path, 'r', encoding='utf-8-sig') as txt_file:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing encoding='utf-8-sig' is (almost) windows specific..? Did you see UnicodeError with encoding='utf-8'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it was in my case (probably because I am currently mixing up with Windows (for running pyTorch) and Linux(for data preparation/alignment)), and I think that setting encoidng='utf-8-sig' when opening file is better for ensuring compatibility.

print('Transcript - '+''.join(txt_file.readlines()))
with open(wav_path,'rb') as wav_file, open(txt_path, 'rb') as txt_file:
params = (('async','false'),)
files={'audio':(wav_name,wav_file),
'transcript':(txt_name,txt_file),
}
server_path = 'http://'+server_addr+':'+str(port)+'/transcriptions'
response = requests.post(server_path, params=params,files=files)
if response.status_code != 200:
print(' [!] External server({}) returned bad response({})'.format(server_path, response.status_code))
if debug:
print('Response')
print(response.json())
return response

if __name__ == '__main__':
arguments = docopt(__doc__)
server_addr = arguments['--server_addr']
port = int(arguments['--port'])
max_unalign = float(arguments['--max_unalign'])
if arguments['--nested-directories'] == None:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: I'd slightly prefer is None to ==None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I will change this too

wav_paths = sorted(glob(arguments['--wav_pattern']))
txt_paths = sorted(glob(arguments['--txt_pattern']))
else:
# if this is multi-foldered environment
# (e.g. DATASET/speaker1/blahblah.wav)
wav_paths=[]
txt_paths=[]
topdir = arguments['--nested-directories']
subdirs = [f for f in os.listdir(topdir) if os.path.isdir(os.path.join(topdir, f))]
for subdir in subdirs:
wav_pattern_subdir = os.path.join(topdir, subdir, '*.wav')
txt_pattern_subdir = os.path.join(topdir, subdir, '*.txt')
wav_paths.extend(sorted(glob(wav_pattern_subdir)))
txt_paths.extend(sorted(glob(txt_pattern_subdir)))

t = tqdm(range(len(wav_paths)))
for idx in t:
try:
t.set_description("Align via Gentle")
wav_path = wav_paths[idx]
txt_path = txt_paths[idx]
lab_path = os.path.splitext(wav_path)[0]+'.lab'
if os.path.exists(lab_path) and arguments['--skip-already-done']:
print('[!] skipping because of pre-existing .lab file - {}'.format(lab_path))
continue
res=gentle_request(wav_path,txt_path, server_addr, port)
unalign_ratio, lab = json2hts(res.json())
print('[*] Unaligned Ratio - {}'.format(unalign_ratio))
if unalign_ratio > max_unalign:
print('[!] skipping this due to bad alignment')
continue
write_hts_label(lab, lab_path)
except:
# if sth happens, skip it
import traceback
tb = traceback.format_exc()
print('[!] ERROR while processing {}'.format(wav_paths[idx]))
print('[!] StackTrace - ')
print(tb)


8 changes: 8 additions & 0 deletions hparams.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,14 @@
# Forced garbage collection probability
# Use only when MemoryError continues in Windows (Disabled by default)
#gc_probability = 0.001,

# json_meta mode only
# 0: "use all",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider spaces rather than tab.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops O.o will change it.
Another vestigial element

# 1: "ignore only unmatched_alignment",
# 2: "fully ignore recognition",
ignore_recognition_level = 2,
min_text=20,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking about this and something like min_frames to remove short audio clips from training data. Just out of curiosity, did you get improvements by this? I believe the parameter highly depends on datasets and I'd be happy if you could leave a comment for exmaple: min_text=20 works good for dataset A but can be adjusted depends on dataset

Copy link
Contributor Author

@engiecat engiecat Apr 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it was implemented for some reasons.

  1. My automatic alignment tool(which I am going to release it soon) cannot handle short speeches well.
  2. From my experience, short speeches in non-dedicated dataset(especially that are extracted from movie clips) were prone to noises, and different cadence of speech.
    (e.g. word "help" in "The help that is needed is not there." vs. "help" in "HELP!!!")
  3. (From my experience with other deep learning based tts) Even if the dataset is nearly noise-free and has uniform cadence, short speeches tend to interfere the result. (probably because my test set is usually at least 3 words long)

But it was implemented as a quick-fix, and I do know that min_frame is much much better solution.

Will leave the comments :)

process_only_htk_aligned = False,
)


Expand Down
Loading