-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom Dataset support + Gentle-based custom dataset preprocessing support #78
Changes from 26 commits
3ee20fb
720cf1c
6d4d594
e84b923
030de15
3075486
b8252ae
052d030
92a84d9
a155fb9
747f2e0
5214c24
e22388a
d7908d0
a6969ac
15eb591
d1258e7
89760d2
ba182f9
5d104e6
32cab90
6d8973a
9bae706
3c61d46
d9e8cc7
543a418
132cd14
8fc35ad
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,153 @@ | ||
# -*- coding: utf-8 -*- | ||
""" | ||
Created on Sat Apr 21 09:06:37 2018 | ||
Phoneme alignment and conversion in HTK-style label file using Web-served Gentle | ||
This works on any type of english dataset. | ||
This allows its usage on Windows (Via Docker) and external server. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to be sure, the reason using server-based Gentle rather than python API is that it allows use on Windows, right? Any other reasons? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, and also because Gentle is python 2 compatible only, while this repo is python3 compatible. In addition, if we use server-based Gentle, we can also use external server. |
||
Preliminary results show that gentle has better performance with noisy dataset | ||
(e.g. movie extracted audioclips) | ||
*This work was derived from vctk_preprocess/prepare_htk_alignments_vctk.py | ||
@author: engiecat(github) | ||
|
||
usage: | ||
gentle_web_align.py (-w wav_pattern) (-t text_pattern) [options] | ||
gentle_web_align.py (--nested-directories=<main_directory>) [options] | ||
|
||
options: | ||
-w <wav_pattern> --wav_pattern=<wav_pattern> Pattern of wav files to be aligned | ||
-t <txt_pattern> --txt_pattern=<txt_pattern> Pattern of txt transcript files to be aligned (same name required) | ||
--nested-directories=<main_directory> Process every wav/txt file in the subfolders of the given folder | ||
--server_addr=<server_addr> Server address that serves gentle. [default: localhost] | ||
--port=<port> Server port that serves gentle. [default: 8567] | ||
--max_unalign=<max_unalign> Maximum threshold for unalignment occurence (0.0 ~ 1.0) [default: 0.3] | ||
--skip-already-done Skips if there are preexisting .lab file | ||
-h --help show this help message and exit | ||
""" | ||
|
||
from docopt import docopt | ||
from glob import glob | ||
from tqdm import tqdm | ||
import os.path | ||
import requests | ||
import numpy as np | ||
|
||
def write_hts_label(labels, lab_path): | ||
lab = "" | ||
for s, e, l in labels: | ||
s, e = float(s) * 1e7, float(e) * 1e7 | ||
s, e = int(s), int(e) | ||
lab += "{} {} {}\n".format(s, e, l) | ||
print(lab) | ||
with open(lab_path, "w", encoding='utf-8') as f: | ||
f.write(lab) | ||
|
||
|
||
def json2hts(data): | ||
emit_bos = False | ||
emit_eos = False | ||
|
||
phone_start = 0 | ||
phone_end = None | ||
labels = [] | ||
failure_count = 0 | ||
|
||
for word in data["words"]: | ||
case = word["case"] | ||
if case != "success": | ||
failure_count += 1 # instead of failing everything, | ||
#raise RuntimeError("Alignment failed") | ||
continue | ||
start = float(word["start"]) | ||
word_end = float(word["end"]) | ||
|
||
if not emit_bos: | ||
labels.append((phone_start, start, "silB")) | ||
emit_bos = True | ||
|
||
phone_start = start | ||
phone_end = None | ||
for phone in word["phones"]: | ||
ph = str(phone["phone"][:-2]) | ||
duration = float(phone["duration"]) | ||
phone_end = phone_start + duration | ||
labels.append((phone_start, phone_end, ph)) | ||
phone_start += duration | ||
assert np.allclose(phone_end, word_end) | ||
if not emit_eos: | ||
labels.append((phone_start, phone_end, "silE")) | ||
emit_eos = True | ||
unalign_ratio = float(failure_count) / len(data['words']) | ||
return unalign_ratio, labels | ||
|
||
|
||
def gentle_request(wav_path,txt_path, server_addr, port, debug=False): | ||
print('\n') | ||
response = None | ||
wav_name = os.path.basename(wav_path) | ||
txt_name = os.path.basename(txt_path) | ||
if os.path.splitext(wav_name)[0] != os.path.splitext(txt_name)[0]: | ||
print(' [!] wav name and transcript name does not match - exiting...') | ||
return response | ||
with open(txt_path, 'r', encoding='utf-8-sig') as txt_file: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm guessing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, it was in my case (probably because I am currently mixing up with Windows (for running pyTorch) and Linux(for data preparation/alignment)), and I think that setting encoidng='utf-8-sig' when opening file is better for ensuring compatibility. |
||
print('Transcript - '+''.join(txt_file.readlines())) | ||
with open(wav_path,'rb') as wav_file, open(txt_path, 'rb') as txt_file: | ||
params = (('async','false'),) | ||
files={'audio':(wav_name,wav_file), | ||
'transcript':(txt_name,txt_file), | ||
} | ||
server_path = 'http://'+server_addr+':'+str(port)+'/transcriptions' | ||
response = requests.post(server_path, params=params,files=files) | ||
if response.status_code != 200: | ||
print(' [!] External server({}) returned bad response({})'.format(server_path, response.status_code)) | ||
if debug: | ||
print('Response') | ||
print(response.json()) | ||
return response | ||
|
||
if __name__ == '__main__': | ||
arguments = docopt(__doc__) | ||
server_addr = arguments['--server_addr'] | ||
port = int(arguments['--port']) | ||
max_unalign = float(arguments['--max_unalign']) | ||
if arguments['--nested-directories'] == None: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nits: I'd slightly prefer There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great! I will change this too |
||
wav_paths = sorted(glob(arguments['--wav_pattern'])) | ||
txt_paths = sorted(glob(arguments['--txt_pattern'])) | ||
else: | ||
# if this is multi-foldered environment | ||
# (e.g. DATASET/speaker1/blahblah.wav) | ||
wav_paths=[] | ||
txt_paths=[] | ||
topdir = arguments['--nested-directories'] | ||
subdirs = [f for f in os.listdir(topdir) if os.path.isdir(os.path.join(topdir, f))] | ||
for subdir in subdirs: | ||
wav_pattern_subdir = os.path.join(topdir, subdir, '*.wav') | ||
txt_pattern_subdir = os.path.join(topdir, subdir, '*.txt') | ||
wav_paths.extend(sorted(glob(wav_pattern_subdir))) | ||
txt_paths.extend(sorted(glob(txt_pattern_subdir))) | ||
|
||
t = tqdm(range(len(wav_paths))) | ||
for idx in t: | ||
try: | ||
t.set_description("Align via Gentle") | ||
wav_path = wav_paths[idx] | ||
txt_path = txt_paths[idx] | ||
lab_path = os.path.splitext(wav_path)[0]+'.lab' | ||
if os.path.exists(lab_path) and arguments['--skip-already-done']: | ||
print('[!] skipping because of pre-existing .lab file - {}'.format(lab_path)) | ||
continue | ||
res=gentle_request(wav_path,txt_path, server_addr, port) | ||
unalign_ratio, lab = json2hts(res.json()) | ||
print('[*] Unaligned Ratio - {}'.format(unalign_ratio)) | ||
if unalign_ratio > max_unalign: | ||
print('[!] skipping this due to bad alignment') | ||
continue | ||
write_hts_label(lab, lab_path) | ||
except: | ||
# if sth happens, skip it | ||
import traceback | ||
tb = traceback.format_exc() | ||
print('[!] ERROR while processing {}'.format(wav_paths[idx])) | ||
print('[!] StackTrace - ') | ||
print(tb) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -125,6 +125,14 @@ | |
# Forced garbage collection probability | ||
# Use only when MemoryError continues in Windows (Disabled by default) | ||
#gc_probability = 0.001, | ||
|
||
# json_meta mode only | ||
# 0: "use all", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please consider spaces rather than tab. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oops O.o will change it. |
||
# 1: "ignore only unmatched_alignment", | ||
# 2: "fully ignore recognition", | ||
ignore_recognition_level = 2, | ||
min_text=20, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was also thinking about this and something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually it was implemented for some reasons.
But it was implemented as a quick-fix, and I do know that min_frame is much much better solution. Will leave the comments :) |
||
process_only_htk_aligned = False, | ||
) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be safely removed? Assuming this is for your local only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I will remove it! :) Thanks for telling me