Skip to content

Commit

Permalink
Merge branch 'develop' of https://github.com/PaddlePaddle/DeepSpeech
Browse files Browse the repository at this point in the history
…into ctcdecoders
  • Loading branch information
Jackwaterveg committed Jan 19, 2022
2 parents 4756c7d + 97db74c commit 54f9711
Show file tree
Hide file tree
Showing 63 changed files with 3,509 additions and 634 deletions.
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -463,7 +463,6 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht
- [Automatic Speech Recognition](./docs/source/asr/quick_start.md)
- [Introduction](./docs/source/asr/models_introduction.md)
- [Data Preparation](./docs/source/asr/data_preparation.md)
- [Data Augmentation](./docs/source/asr/augmentation.md)
- [Ngram LM](./docs/source/asr/ngram_lm.md)
- [Text-to-Speech](./docs/source/tts/quick_start.md)
- [Introduction](./docs/source/tts/models_introduction.md)
Expand Down
1 change: 0 additions & 1 deletion README_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -468,7 +468,6 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
- [语音识别自定义训练](./docs/source/asr/quick_start.md)
- [简介](./docs/source/asr/models_introduction.md)
- [数据准备](./docs/source/asr/data_preparation.md)
- [数据增强](./docs/source/asr/augmentation.md)
- [Ngram 语言模型](./docs/source/asr/ngram_lm.md)
- [语音合成自定义训练](./docs/source/tts/quick_start.md)
- [简介](./docs/source/tts/models_introduction.md)
Expand Down
40 changes: 0 additions & 40 deletions docs/source/asr/augmentation.md

This file was deleted.

1 change: 0 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ Contents

asr/models_introduction
asr/data_preparation
asr/augmentation
asr/feature_list
asr/ngram_lm

Expand Down
75 changes: 75 additions & 0 deletions docs/source/tts/tts_datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# TTS Datasets
<!--
see https://openslr.org/
-->
## Mandarin
- [CSMSC](https://www.data-baker.com/open_source.html): Chinese Standard Mandarin Speech Copus
- Duration/h: 12
- Number of Sentences: 10,000
- Size: 2.14GB
- Speaker: 1 female, ages 20 ~30
- Sample Rate: 48 kHz、16bit
- Mean Words per Clip: 16
- [AISHELL-3](http://www.aishelltech.com/aishell_3)
- Duration/h: 85
- Number of Sentences: 88,035
- Size: 17.75GB
- Speaker: 218
- Sample Rate: 44.1 kHz、16bit

## English
- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/)
- Duration/h: 24
- Number of Sentences: 13,100
- Size: 2.56GB
- Speaker: 1, age 20 ~30
- Sample Rate: 22050 Hz、16bit
- Mean Words per Clip: 17.23
- [VCTK](https://datashare.ed.ac.uk/handle/10283/3443)
- Number of Sentences: 44,583
- Size: 10.94GB
- Speaker: 110
- Sample Rate: 48 kHz、16bit
- Mean Words per Clip: 17.23

## Japanese
<!--
see https://sites.google.com/site/shinnosuketakamichi/publication/corpus
-->

- [tri-jek](https://sites.google.com/site/shinnosuketakamichi/research-topics/tri-jek_corpus): Japanese-English-Korean tri-lingual corpus
- [JSSS-misc](https://sites.google.com/site/shinnosuketakamichi/research-topics/jsss-misc_corpus): misc tasks of JSSS corpus
- [JTubeSpeech](https://github.com/sarulab-speech/jtubespeech): Corpus of Japanese speech collected from YouTube
- [J-MAC](https://sites.google.com/site/shinnosuketakamichi/research-topics/j-mac_corpus): Japanese multi-speaker audiobook corpus
- [J-KAC](https://sites.google.com/site/shinnosuketakamichi/research-topics/j-kac_corpus): Japanese Kamishibai and audiobook corpus
- [JMD](https://sites.google.com/site/shinnosuketakamichi/research-topics/jmd_corpus): Japanese multi-dialect corpus
- [JSSS](https://sites.google.com/site/shinnosuketakamichi/research-topics/jsss_corpus): Japanese multi-style (summarization and simplification) corpus
- [RWCP-SSD-Onomatopoeia](https://www.ksuke.net/dataset/rwcp-ssd-onomatopoeia): onomatopoeic word dataset for environmental sounds
- [Life-m](https://sites.google.com/site/shinnosuketakamichi/research-topics/life-m_corpus): landmark image-themed music corpus
- [PJS](https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus): Phoneme-balanced Japanese singing voice corpus
- [JVS-MuSiC](https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_music): Japanese multi-speaker singing-voice corpus
- [JVS](https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus): Japanese multi-speaker voice corpus
- [JSUT-book](https://sites.google.com/site/shinnosuketakamichi/publication/jsut-book): audiobook corpus by a single Japanese speaker
- [JSUT-vi](https://sites.google.com/site/shinnosuketakamichi/publication/jsut-vi): vocal imitation corpus by a single Japanese speaker
- [JSUT-song](https://sites.google.com/site/shinnosuketakamichi/publication/jsut-song): singing voice corpus by a single Japanese singer
- [JSUT](https://sites.google.com/site/shinnosuketakamichi/publication/jsut): a large-scaled corpus of reading-style Japanese speech by a single speaker

## Emotions
### English
- [CREMA-D](https://github.com/CheyneyComputerScience/CREMA-D)
- [Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset](https://kunzhou9646.github.io/controllable-evc/)
- paper : [Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset](https://arxiv.org/abs/2010.14794)
### Mandarin
- [EMOVIE Dataset](https://viem-ccy.github.io/EMOVIE/dataset_release )
- paper: [EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model](https://arxiv.org/abs/2106.09317)
- MASC
- paper: [MASC: A Speech Corpus in Mandarin for Emotion Analysis and Affective Speaker Recognition](https://ieeexplore.ieee.org/document/4013501)
### English && Mandarin
- [Emotional Voice Conversion: Theory, Databases and ESD](https://github.com/HLTSingapore/Emotional-Speech-Data)
- paper: [Emotional Voice Conversion: Theory, Databases and ESD](https://arxiv.org/abs/2105.14762)

## Music
- [GiantMIDI-Piano](https://github.com/bytedance/GiantMIDI-Piano)
- [MAESTRO Dataset](https://magenta.tensorflow.org/datasets/maestro)
- [tf code](https://www.tensorflow.org/tutorials/audio/music_generation)
- [Opencpop](https://wenet.org.cn/opencpop/)
3 changes: 2 additions & 1 deletion examples/aishell3/tts3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -257,6 +257,7 @@ python3 ${BIN_DIR}/../synthesize_e2e.py \
--output_dir=exp/default/test_e2e \
--phones_dict=fastspeech2_nosil_aishell3_ckpt_0.4/phone_id_map.txt \
--speaker_dict=fastspeech2_nosil_aishell3_ckpt_0.4/speaker_id_map.txt \
--spk_id=0
--spk_id=0 \
--inference_dir=exp/default/inference

```
9 changes: 4 additions & 5 deletions examples/aishell3/tts3/conf/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ fmax: 7600 # Maximum frequency of Mel basis.
n_mels: 80 # The number of mel basis.

# Only used for the model using pitch features (e.g. FastSpeech2)
f0min: 80 # Maximum f0 for pitch extraction.
f0max: 400 # Minimum f0 for pitch extraction.
f0min: 80 # Minimum f0 for pitch extraction.
f0max: 400 # Maximum f0 for pitch extraction.


###########################################################
Expand Down Expand Up @@ -64,14 +64,14 @@ model:
pitch_predictor_dropout: 0.5 # dropout rate in pitch predictor
pitch_embed_kernel_size: 1 # kernel size of conv embedding layer for pitch
pitch_embed_dropout: 0.0 # dropout rate after conv embedding layer for pitch
stop_gradient_from_pitch_predictor: true # whether to stop the gradient from pitch predictor to encoder
stop_gradient_from_pitch_predictor: True # whether to stop the gradient from pitch predictor to encoder
energy_predictor_layers: 2 # number of conv layers in energy predictor
energy_predictor_chans: 256 # number of channels of conv layers in energy predictor
energy_predictor_kernel_size: 3 # kernel size of conv leyers in energy predictor
energy_predictor_dropout: 0.5 # dropout rate in energy predictor
energy_embed_kernel_size: 1 # kernel size of conv embedding layer for energy
energy_embed_dropout: 0.0 # dropout rate after conv embedding layer for energy
stop_gradient_from_energy_predictor: false # whether to stop the gradient from energy predictor to encoder
stop_gradient_from_energy_predictor: False # whether to stop the gradient from energy predictor to encoder
spk_embed_dim: 256 # speaker embedding dimension
spk_embed_integration_type: concat # speaker embedding integration type

Expand All @@ -84,7 +84,6 @@ updater:
use_masking: True # whether to apply masking for padded part in loss calculation



###########################################################
# OPTIMIZER SETTING #
###########################################################
Expand Down
19 changes: 19 additions & 0 deletions examples/aishell3/tts3/local/inference.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/bin/bash

train_output_path=$1

stage=0
stop_stage=0

if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
python3 ${BIN_DIR}/../inference.py \
--inference_dir=${train_output_path}/inference \
--am=fastspeech2_aishell3 \
--voc=pwgan_aishell3 \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/pd_infer_out \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0
fi

3 changes: 2 additions & 1 deletion examples/aishell3/tts3/local/synthesize_e2e.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,5 @@ python3 ${BIN_DIR}/../synthesize_e2e.py \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0
--spk_id=0 \
--inference_dir=${train_output_path}/inference
8 changes: 4 additions & 4 deletions examples/aishell3/vc1/conf/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ fmax: 7600 # Maximum frequency of Mel basis.
n_mels: 80 # The number of mel basis.

# Only used for the model using pitch features (e.g. FastSpeech2)
f0min: 80 # Maximum f0 for pitch extraction.
f0max: 400 # Minimum f0 for pitch extraction.
f0min: 80 # Minimum f0 for pitch extraction.
f0max: 400 # Maximum f0 for pitch extraction.


###########################################################
Expand Down Expand Up @@ -64,14 +64,14 @@ model:
pitch_predictor_dropout: 0.5 # dropout rate in pitch predictor
pitch_embed_kernel_size: 1 # kernel size of conv embedding layer for pitch
pitch_embed_dropout: 0.0 # dropout rate after conv embedding layer for pitch
stop_gradient_from_pitch_predictor: true # whether to stop the gradient from pitch predictor to encoder
stop_gradient_from_pitch_predictor: True # whether to stop the gradient from pitch predictor to encoder
energy_predictor_layers: 2 # number of conv layers in energy predictor
energy_predictor_chans: 256 # number of channels of conv layers in energy predictor
energy_predictor_kernel_size: 3 # kernel size of conv leyers in energy predictor
energy_predictor_dropout: 0.5 # dropout rate in energy predictor
energy_embed_kernel_size: 1 # kernel size of conv embedding layer for energy
energy_embed_dropout: 0.0 # dropout rate after conv embedding layer for energy
stop_gradient_from_energy_predictor: false # whether to stop the gradient from energy predictor to encoder
stop_gradient_from_energy_predictor: False # whether to stop the gradient from energy predictor to encoder
spk_embed_dim: 256 # speaker embedding dimension
spk_embed_integration_type: concat # speaker embedding integration type

Expand Down
6 changes: 3 additions & 3 deletions examples/aishell3/voc1/conf/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ generator_params:
aux_context_window: 2 # Context window size for auxiliary feature.
# If set to 2, previous 2 and future 2 frames will be considered.
dropout: 0.0 # Dropout rate. 0.0 means no dropout applied.
use_weight_norm: true # Whether to use weight norm.
use_weight_norm: True # Whether to use weight norm.
# If set to true, it will be applied to all of the conv layers.
upsample_scales: [4, 5, 3, 5] # Upsampling scales. prod(upsample_scales) == n_shift

Expand All @@ -46,8 +46,8 @@ discriminator_params:
kernel_size: 3 # Number of output channels.
layers: 10 # Number of conv layers.
conv_channels: 64 # Number of chnn layers.
bias: true # Whether to use bias parameter in conv.
use_weight_norm: true # Whether to use weight norm.
bias: True # Whether to use bias parameter in conv.
use_weight_norm: True # Whether to use weight norm.
# If set to true, it will be applied to all of the conv layers.
nonlinear_activation: "leakyrelu" # Nonlinear function after each conv.
nonlinear_activation_params: # Nonlinear function parameters
Expand Down
91 changes: 91 additions & 0 deletions examples/csmsc/tts0/conf/default.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# This configuration is for Paddle to train Tacotron 2. Compared to the
# original paper, this configuration additionally use the guided attention
# loss to accelerate the learning of the diagonal attention. It requires
# only a single GPU with 12 GB memory and it takes ~1 days to finish the
# training on Titan V.

###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################

fs: 24000 # sr
n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.

# Only used for feats_type != raw

fmin: 80 # Minimum frequency of Mel basis.
fmax: 7600 # Maximum frequency of Mel basis.
n_mels: 80 # The number of mel basis.

###########################################################
# DATA SETTING #
###########################################################
batch_size: 64
num_workers: 2

###########################################################
# MODEL SETTING #
###########################################################
model: # keyword arguments for the selected model
embed_dim: 512 # char or phn embedding dimension
elayers: 1 # number of blstm layers in encoder
eunits: 512 # number of blstm units
econv_layers: 3 # number of convolutional layers in encoder
econv_chans: 512 # number of channels in convolutional layer
econv_filts: 5 # filter size of convolutional layer
atype: location # attention function type
adim: 512 # attention dimension
aconv_chans: 32 # number of channels in convolutional layer of attention
aconv_filts: 15 # filter size of convolutional layer of attention
cumulate_att_w: True # whether to cumulate attention weight
dlayers: 2 # number of lstm layers in decoder
dunits: 1024 # number of lstm units in decoder
prenet_layers: 2 # number of layers in prenet
prenet_units: 256 # number of units in prenet
postnet_layers: 5 # number of layers in postnet
postnet_chans: 512 # number of channels in postnet
postnet_filts: 5 # filter size of postnet layer
output_activation: null # activation function for the final output
use_batch_norm: True # whether to use batch normalization in encoder
use_concate: True # whether to concatenate encoder embedding with decoder outputs
use_residual: False # whether to use residual connection in encoder
dropout_rate: 0.5 # dropout rate
zoneout_rate: 0.1 # zoneout rate
reduction_factor: 1 # reduction factor
spk_embed_dim: null # speaker embedding dimension


###########################################################
# UPDATER SETTING #
###########################################################
updater:
use_masking: True # whether to apply masking for padded part in loss calculation
bce_pos_weight: 5.0 # weight of positive sample in binary cross entropy calculation
use_guided_attn_loss: True # whether to use guided attention loss
guided_attn_loss_sigma: 0.4 # sigma of guided attention loss
guided_attn_loss_lambda: 1.0 # strength of guided attention loss


##########################################################
# OPTIMIZER SETTING #
##########################################################
optimizer:
optim: adam # optimizer type
learning_rate: 1.0e-03 # learning rate
epsilon: 1.0e-06 # epsilon
weight_decay: 0.0 # weight decay coefficient

###########################################################
# TRAINING SETTING #
###########################################################
max_epoch: 200
num_snapshots: 5

###########################################################
# OTHER SETTING #
###########################################################
seed: 42
Loading

0 comments on commit 54f9711

Please sign in to comment.