-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'develop' of https://github.com/PaddlePaddle/DeepSpeech …
…into ctcdecoders
- Loading branch information
Showing
63 changed files
with
3,509 additions
and
634 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
# TTS Datasets | ||
<!-- | ||
see https://openslr.org/ | ||
--> | ||
## Mandarin | ||
- [CSMSC](https://www.data-baker.com/open_source.html): Chinese Standard Mandarin Speech Copus | ||
- Duration/h: 12 | ||
- Number of Sentences: 10,000 | ||
- Size: 2.14GB | ||
- Speaker: 1 female, ages 20 ~30 | ||
- Sample Rate: 48 kHz、16bit | ||
- Mean Words per Clip: 16 | ||
- [AISHELL-3](http://www.aishelltech.com/aishell_3) | ||
- Duration/h: 85 | ||
- Number of Sentences: 88,035 | ||
- Size: 17.75GB | ||
- Speaker: 218 | ||
- Sample Rate: 44.1 kHz、16bit | ||
|
||
## English | ||
- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) | ||
- Duration/h: 24 | ||
- Number of Sentences: 13,100 | ||
- Size: 2.56GB | ||
- Speaker: 1, age 20 ~30 | ||
- Sample Rate: 22050 Hz、16bit | ||
- Mean Words per Clip: 17.23 | ||
- [VCTK](https://datashare.ed.ac.uk/handle/10283/3443) | ||
- Number of Sentences: 44,583 | ||
- Size: 10.94GB | ||
- Speaker: 110 | ||
- Sample Rate: 48 kHz、16bit | ||
- Mean Words per Clip: 17.23 | ||
|
||
## Japanese | ||
<!-- | ||
see https://sites.google.com/site/shinnosuketakamichi/publication/corpus | ||
--> | ||
|
||
- [tri-jek](https://sites.google.com/site/shinnosuketakamichi/research-topics/tri-jek_corpus): Japanese-English-Korean tri-lingual corpus | ||
- [JSSS-misc](https://sites.google.com/site/shinnosuketakamichi/research-topics/jsss-misc_corpus): misc tasks of JSSS corpus | ||
- [JTubeSpeech](https://github.com/sarulab-speech/jtubespeech): Corpus of Japanese speech collected from YouTube | ||
- [J-MAC](https://sites.google.com/site/shinnosuketakamichi/research-topics/j-mac_corpus): Japanese multi-speaker audiobook corpus | ||
- [J-KAC](https://sites.google.com/site/shinnosuketakamichi/research-topics/j-kac_corpus): Japanese Kamishibai and audiobook corpus | ||
- [JMD](https://sites.google.com/site/shinnosuketakamichi/research-topics/jmd_corpus): Japanese multi-dialect corpus | ||
- [JSSS](https://sites.google.com/site/shinnosuketakamichi/research-topics/jsss_corpus): Japanese multi-style (summarization and simplification) corpus | ||
- [RWCP-SSD-Onomatopoeia](https://www.ksuke.net/dataset/rwcp-ssd-onomatopoeia): onomatopoeic word dataset for environmental sounds | ||
- [Life-m](https://sites.google.com/site/shinnosuketakamichi/research-topics/life-m_corpus): landmark image-themed music corpus | ||
- [PJS](https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus): Phoneme-balanced Japanese singing voice corpus | ||
- [JVS-MuSiC](https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_music): Japanese multi-speaker singing-voice corpus | ||
- [JVS](https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus): Japanese multi-speaker voice corpus | ||
- [JSUT-book](https://sites.google.com/site/shinnosuketakamichi/publication/jsut-book): audiobook corpus by a single Japanese speaker | ||
- [JSUT-vi](https://sites.google.com/site/shinnosuketakamichi/publication/jsut-vi): vocal imitation corpus by a single Japanese speaker | ||
- [JSUT-song](https://sites.google.com/site/shinnosuketakamichi/publication/jsut-song): singing voice corpus by a single Japanese singer | ||
- [JSUT](https://sites.google.com/site/shinnosuketakamichi/publication/jsut): a large-scaled corpus of reading-style Japanese speech by a single speaker | ||
|
||
## Emotions | ||
### English | ||
- [CREMA-D](https://github.com/CheyneyComputerScience/CREMA-D) | ||
- [Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset](https://kunzhou9646.github.io/controllable-evc/) | ||
- paper : [Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset](https://arxiv.org/abs/2010.14794) | ||
### Mandarin | ||
- [EMOVIE Dataset](https://viem-ccy.github.io/EMOVIE/dataset_release ) | ||
- paper: [EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model](https://arxiv.org/abs/2106.09317) | ||
- MASC | ||
- paper: [MASC: A Speech Corpus in Mandarin for Emotion Analysis and Affective Speaker Recognition](https://ieeexplore.ieee.org/document/4013501) | ||
### English && Mandarin | ||
- [Emotional Voice Conversion: Theory, Databases and ESD](https://github.com/HLTSingapore/Emotional-Speech-Data) | ||
- paper: [Emotional Voice Conversion: Theory, Databases and ESD](https://arxiv.org/abs/2105.14762) | ||
|
||
## Music | ||
- [GiantMIDI-Piano](https://github.com/bytedance/GiantMIDI-Piano) | ||
- [MAESTRO Dataset](https://magenta.tensorflow.org/datasets/maestro) | ||
- [tf code](https://www.tensorflow.org/tutorials/audio/music_generation) | ||
- [Opencpop](https://wenet.org.cn/opencpop/) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
#!/bin/bash | ||
|
||
train_output_path=$1 | ||
|
||
stage=0 | ||
stop_stage=0 | ||
|
||
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then | ||
python3 ${BIN_DIR}/../inference.py \ | ||
--inference_dir=${train_output_path}/inference \ | ||
--am=fastspeech2_aishell3 \ | ||
--voc=pwgan_aishell3 \ | ||
--text=${BIN_DIR}/../sentences.txt \ | ||
--output_dir=${train_output_path}/pd_infer_out \ | ||
--phones_dict=dump/phone_id_map.txt \ | ||
--speaker_dict=dump/speaker_id_map.txt \ | ||
--spk_id=0 | ||
fi | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
# This configuration is for Paddle to train Tacotron 2. Compared to the | ||
# original paper, this configuration additionally use the guided attention | ||
# loss to accelerate the learning of the diagonal attention. It requires | ||
# only a single GPU with 12 GB memory and it takes ~1 days to finish the | ||
# training on Titan V. | ||
|
||
########################################################### | ||
# FEATURE EXTRACTION SETTING # | ||
########################################################### | ||
|
||
fs: 24000 # sr | ||
n_fft: 2048 # FFT size (samples). | ||
n_shift: 300 # Hop size (samples). 12.5ms | ||
win_length: 1200 # Window length (samples). 50ms | ||
# If set to null, it will be the same as fft_size. | ||
window: "hann" # Window function. | ||
|
||
# Only used for feats_type != raw | ||
|
||
fmin: 80 # Minimum frequency of Mel basis. | ||
fmax: 7600 # Maximum frequency of Mel basis. | ||
n_mels: 80 # The number of mel basis. | ||
|
||
########################################################### | ||
# DATA SETTING # | ||
########################################################### | ||
batch_size: 64 | ||
num_workers: 2 | ||
|
||
########################################################### | ||
# MODEL SETTING # | ||
########################################################### | ||
model: # keyword arguments for the selected model | ||
embed_dim: 512 # char or phn embedding dimension | ||
elayers: 1 # number of blstm layers in encoder | ||
eunits: 512 # number of blstm units | ||
econv_layers: 3 # number of convolutional layers in encoder | ||
econv_chans: 512 # number of channels in convolutional layer | ||
econv_filts: 5 # filter size of convolutional layer | ||
atype: location # attention function type | ||
adim: 512 # attention dimension | ||
aconv_chans: 32 # number of channels in convolutional layer of attention | ||
aconv_filts: 15 # filter size of convolutional layer of attention | ||
cumulate_att_w: True # whether to cumulate attention weight | ||
dlayers: 2 # number of lstm layers in decoder | ||
dunits: 1024 # number of lstm units in decoder | ||
prenet_layers: 2 # number of layers in prenet | ||
prenet_units: 256 # number of units in prenet | ||
postnet_layers: 5 # number of layers in postnet | ||
postnet_chans: 512 # number of channels in postnet | ||
postnet_filts: 5 # filter size of postnet layer | ||
output_activation: null # activation function for the final output | ||
use_batch_norm: True # whether to use batch normalization in encoder | ||
use_concate: True # whether to concatenate encoder embedding with decoder outputs | ||
use_residual: False # whether to use residual connection in encoder | ||
dropout_rate: 0.5 # dropout rate | ||
zoneout_rate: 0.1 # zoneout rate | ||
reduction_factor: 1 # reduction factor | ||
spk_embed_dim: null # speaker embedding dimension | ||
|
||
|
||
########################################################### | ||
# UPDATER SETTING # | ||
########################################################### | ||
updater: | ||
use_masking: True # whether to apply masking for padded part in loss calculation | ||
bce_pos_weight: 5.0 # weight of positive sample in binary cross entropy calculation | ||
use_guided_attn_loss: True # whether to use guided attention loss | ||
guided_attn_loss_sigma: 0.4 # sigma of guided attention loss | ||
guided_attn_loss_lambda: 1.0 # strength of guided attention loss | ||
|
||
|
||
########################################################## | ||
# OPTIMIZER SETTING # | ||
########################################################## | ||
optimizer: | ||
optim: adam # optimizer type | ||
learning_rate: 1.0e-03 # learning rate | ||
epsilon: 1.0e-06 # epsilon | ||
weight_decay: 0.0 # weight decay coefficient | ||
|
||
########################################################### | ||
# TRAINING SETTING # | ||
########################################################### | ||
max_epoch: 200 | ||
num_snapshots: 5 | ||
|
||
########################################################### | ||
# OTHER SETTING # | ||
########################################################### | ||
seed: 42 |
Oops, something went wrong.