Merge branch 'develop' of https://github.com/PaddlePaddle/DeepSpeech …

…into ctcdecoders
PaddlePaddle · Jan 19, 2022 · 54f9711 · 54f9711
2 parents 4756c7d + 97db74c
commit 54f9711
Show file tree

Hide file tree

Showing 63 changed files with 3,509 additions and 634 deletions.
diff --git a/README.md b/README.md
@@ -463,7 +463,6 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht
   - [Automatic Speech Recognition](./docs/source/asr/quick_start.md)
     - [Introduction](./docs/source/asr/models_introduction.md)
     - [Data Preparation](./docs/source/asr/data_preparation.md)
-    - [Data Augmentation](./docs/source/asr/augmentation.md)
     - [Ngram LM](./docs/source/asr/ngram_lm.md)
   - [Text-to-Speech](./docs/source/tts/quick_start.md)
     - [Introduction](./docs/source/tts/models_introduction.md)

diff --git a/README_cn.md b/README_cn.md
@@ -468,7 +468,6 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
   - [语音识别自定义训练](./docs/source/asr/quick_start.md)
     - [简介](./docs/source/asr/models_introduction.md)
     - [数据准备](./docs/source/asr/data_preparation.md)
-    - [数据增强](./docs/source/asr/augmentation.md)
     - [Ngram 语言模型](./docs/source/asr/ngram_lm.md)
   - [语音合成自定义训练](./docs/source/tts/quick_start.md)
     - [简介](./docs/source/tts/models_introduction.md)

diff --git a/docs/source/asr/augmentation.md b/docs/source/asr/augmentation.md
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -27,7 +27,6 @@ Contents
 
    asr/models_introduction
    asr/data_preparation
-   asr/augmentation
    asr/feature_list
    asr/ngram_lm
 

diff --git a/docs/source/tts/tts_datasets.md b/docs/source/tts/tts_datasets.md
@@ -0,0 +1,75 @@
+# TTS Datasets
+<!--
+see https://openslr.org/
+-->
+## Mandarin
+- [CSMSC](https://www.data-baker.com/open_source.html): Chinese Standard Mandarin Speech Copus
+    - Duration/h: 12
+    - Number of Sentences: 10,000
+    - Size: 2.14GB
+    - Speaker: 1 female, ages 20 ~30
+    - Sample Rate: 48 kHz、16bit
+    - Mean Words per Clip: 16
+- [AISHELL-3](http://www.aishelltech.com/aishell_3)
+    - Duration/h: 85
+    - Number of Sentences: 88,035
+    - Size: 17.75GB
+    - Speaker: 218
+    - Sample Rate: 44.1 kHz、16bit 
+
+## English
+- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/)
+    - Duration/h: 24
+    - Number of Sentences: 13,100
+    - Size: 2.56GB
+    - Speaker: 1, age 20 ~30
+    - Sample Rate: 22050 Hz、16bit
+    - Mean Words per Clip: 17.23
+- [VCTK](https://datashare.ed.ac.uk/handle/10283/3443)
+    - Number of Sentences: 44,583
+    - Size: 10.94GB
+    - Speaker: 110 
+    - Sample Rate: 48 kHz、16bit
+    - Mean Words per Clip: 17.23
+
+## Japanese
+<!--
+see https://sites.google.com/site/shinnosuketakamichi/publication/corpus
+-->
+
+- [tri-jek](https://sites.google.com/site/shinnosuketakamichi/research-topics/tri-jek_corpus): Japanese-English-Korean tri-lingual corpus
+- [JSSS-misc](https://sites.google.com/site/shinnosuketakamichi/research-topics/jsss-misc_corpus): misc tasks of JSSS corpus
+- [JTubeSpeech](https://github.com/sarulab-speech/jtubespeech): Corpus of Japanese speech collected from YouTube
+- [J-MAC](https://sites.google.com/site/shinnosuketakamichi/research-topics/j-mac_corpus): Japanese multi-speaker audiobook corpus
+- [J-KAC](https://sites.google.com/site/shinnosuketakamichi/research-topics/j-kac_corpus): Japanese Kamishibai and audiobook corpus
+- [JMD](https://sites.google.com/site/shinnosuketakamichi/research-topics/jmd_corpus): Japanese multi-dialect corpus
+- [JSSS](https://sites.google.com/site/shinnosuketakamichi/research-topics/jsss_corpus): Japanese multi-style (summarization and simplification) corpus
+- [RWCP-SSD-Onomatopoeia](https://www.ksuke.net/dataset/rwcp-ssd-onomatopoeia): onomatopoeic word dataset for environmental sounds 
+- [Life-m](https://sites.google.com/site/shinnosuketakamichi/research-topics/life-m_corpus): landmark image-themed music corpus
+- [PJS](https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus): Phoneme-balanced Japanese singing voice corpus
+- [JVS-MuSiC](https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_music): Japanese multi-speaker singing-voice corpus
+- [JVS](https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus): Japanese multi-speaker voice corpus
+- [JSUT-book](https://sites.google.com/site/shinnosuketakamichi/publication/jsut-book): audiobook corpus by a single Japanese speaker
+- [JSUT-vi](https://sites.google.com/site/shinnosuketakamichi/publication/jsut-vi): vocal imitation corpus by a single Japanese speaker
+- [JSUT-song](https://sites.google.com/site/shinnosuketakamichi/publication/jsut-song): singing voice corpus by a single Japanese singer
+- [JSUT](https://sites.google.com/site/shinnosuketakamichi/publication/jsut): a large-scaled corpus of reading-style Japanese speech by a single speaker
+
+## Emotions
+### English
+- [CREMA-D](https://github.com/CheyneyComputerScience/CREMA-D)
+- [Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset](https://kunzhou9646.github.io/controllable-evc/)
+    - paper : [Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset](https://arxiv.org/abs/2010.14794)
+### Mandarin
+- [EMOVIE Dataset](https://viem-ccy.github.io/EMOVIE/dataset_release )
+    - paper: [EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model](https://arxiv.org/abs/2106.09317)
+- MASC
+    - paper: [MASC: A Speech Corpus in Mandarin for Emotion Analysis and Affective Speaker Recognition](https://ieeexplore.ieee.org/document/4013501)
+### English && Mandarin
+- [Emotional Voice Conversion: Theory, Databases and ESD](https://github.com/HLTSingapore/Emotional-Speech-Data)    
+    - paper: [Emotional Voice Conversion: Theory, Databases and ESD](https://arxiv.org/abs/2105.14762) 
+
+## Music
+- [GiantMIDI-Piano](https://github.com/bytedance/GiantMIDI-Piano)
+- [MAESTRO Dataset](https://magenta.tensorflow.org/datasets/maestro)
+     - [tf code](https://www.tensorflow.org/tutorials/audio/music_generation) 
+- [Opencpop](https://wenet.org.cn/opencpop/)
diff --git a/examples/aishell3/tts3/README.md b/examples/aishell3/tts3/README.md
@@ -257,6 +257,7 @@ python3 ${BIN_DIR}/../synthesize_e2e.py \
   --output_dir=exp/default/test_e2e \
   --phones_dict=fastspeech2_nosil_aishell3_ckpt_0.4/phone_id_map.txt \
   --speaker_dict=fastspeech2_nosil_aishell3_ckpt_0.4/speaker_id_map.txt \
-  --spk_id=0
+  --spk_id=0 \
+  --inference_dir=exp/default/inference
 
 ```
diff --git a/examples/aishell3/tts3/conf/default.yaml b/examples/aishell3/tts3/conf/default.yaml
@@ -16,8 +16,8 @@ fmax: 7600         # Maximum frequency of Mel basis.
 n_mels: 80         # The number of mel basis.
 
 # Only used for the model using pitch features (e.g. FastSpeech2)
-f0min: 80          # Maximum f0 for pitch extraction.
-f0max: 400         # Minimum f0 for pitch extraction.
+f0min: 80          # Minimum f0 for pitch extraction.
+f0max: 400         # Maximum f0 for pitch extraction.
 
 
 ###########################################################
@@ -64,14 +64,14 @@ model:
     pitch_predictor_dropout: 0.5               # dropout rate in pitch predictor
     pitch_embed_kernel_size: 1                 # kernel size of conv embedding layer for pitch
     pitch_embed_dropout: 0.0                   # dropout rate after conv embedding layer for pitch
-    stop_gradient_from_pitch_predictor: true   # whether to stop the gradient from pitch predictor to encoder
+    stop_gradient_from_pitch_predictor: True   # whether to stop the gradient from pitch predictor to encoder
     energy_predictor_layers: 2                 # number of conv layers in energy predictor
     energy_predictor_chans: 256                # number of channels of conv layers in energy predictor
     energy_predictor_kernel_size: 3            # kernel size of conv leyers in energy predictor
     energy_predictor_dropout: 0.5              # dropout rate in energy predictor
     energy_embed_kernel_size: 1                # kernel size of conv embedding layer for energy
     energy_embed_dropout: 0.0                  # dropout rate after conv embedding layer for energy
-    stop_gradient_from_energy_predictor: false # whether to stop the gradient from energy predictor to encoder
+    stop_gradient_from_energy_predictor: False # whether to stop the gradient from energy predictor to encoder
     spk_embed_dim: 256                         # speaker embedding dimension
     spk_embed_integration_type: concat         # speaker embedding integration type
 
@@ -84,7 +84,6 @@ updater:
     use_masking: True                 # whether to apply masking for padded part in loss calculation
 
 
-
 ###########################################################
 #                     OPTIMIZER SETTING                   #
 ###########################################################

diff --git a/examples/aishell3/tts3/local/inference.sh b/examples/aishell3/tts3/local/inference.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+
+train_output_path=$1
+
+stage=0
+stop_stage=0
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    python3 ${BIN_DIR}/../inference.py \
+        --inference_dir=${train_output_path}/inference \
+        --am=fastspeech2_aishell3 \
+        --voc=pwgan_aishell3 \
+        --text=${BIN_DIR}/../sentences.txt \
+        --output_dir=${train_output_path}/pd_infer_out \
+        --phones_dict=dump/phone_id_map.txt \
+        --speaker_dict=dump/speaker_id_map.txt \
+        --spk_id=0
+fi
+
diff --git a/examples/aishell3/tts3/local/synthesize_e2e.sh b/examples/aishell3/tts3/local/synthesize_e2e.sh
@@ -20,4 +20,5 @@ python3 ${BIN_DIR}/../synthesize_e2e.py \
     --output_dir=${train_output_path}/test_e2e \
     --phones_dict=dump/phone_id_map.txt \
     --speaker_dict=dump/speaker_id_map.txt \
-    --spk_id=0
+    --spk_id=0 \
+    --inference_dir=${train_output_path}/inference
diff --git a/examples/aishell3/vc1/conf/default.yaml b/examples/aishell3/vc1/conf/default.yaml
@@ -16,8 +16,8 @@ fmax: 7600         # Maximum frequency of Mel basis.
 n_mels: 80         # The number of mel basis.
 
 # Only used for the model using pitch features (e.g. FastSpeech2)
-f0min: 80          # Maximum f0 for pitch extraction.
-f0max: 400         # Minimum f0 for pitch extraction.
+f0min: 80          # Minimum f0 for pitch extraction.
+f0max: 400         # Maximum f0 for pitch extraction.
 
 
 ###########################################################
@@ -64,14 +64,14 @@ model:
     pitch_predictor_dropout: 0.5               # dropout rate in pitch predictor
     pitch_embed_kernel_size: 1                 # kernel size of conv embedding layer for pitch
     pitch_embed_dropout: 0.0                   # dropout rate after conv embedding layer for pitch
-    stop_gradient_from_pitch_predictor: true   # whether to stop the gradient from pitch predictor to encoder
+    stop_gradient_from_pitch_predictor: True   # whether to stop the gradient from pitch predictor to encoder
     energy_predictor_layers: 2                 # number of conv layers in energy predictor
     energy_predictor_chans: 256                # number of channels of conv layers in energy predictor
     energy_predictor_kernel_size: 3            # kernel size of conv leyers in energy predictor
     energy_predictor_dropout: 0.5              # dropout rate in energy predictor
     energy_embed_kernel_size: 1                # kernel size of conv embedding layer for energy
     energy_embed_dropout: 0.0                  # dropout rate after conv embedding layer for energy
-    stop_gradient_from_energy_predictor: false # whether to stop the gradient from energy predictor to encoder
+    stop_gradient_from_energy_predictor: False # whether to stop the gradient from energy predictor to encoder
     spk_embed_dim: 256                         # speaker embedding dimension
     spk_embed_integration_type: concat         # speaker embedding integration type
 

diff --git a/examples/aishell3/voc1/conf/default.yaml b/examples/aishell3/voc1/conf/default.yaml
@@ -33,7 +33,7 @@ generator_params:
     aux_context_window: 2 # Context window size for auxiliary feature.
                           # If set to 2, previous 2 and future 2 frames will be considered.
     dropout: 0.0          # Dropout rate. 0.0 means no dropout applied.
-    use_weight_norm: true # Whether to use weight norm.
+    use_weight_norm: True # Whether to use weight norm.
                           # If set to true, it will be applied to all of the conv layers.
     upsample_scales: [4, 5, 3, 5]     # Upsampling scales. prod(upsample_scales) == n_shift
 
@@ -46,8 +46,8 @@ discriminator_params:
     kernel_size: 3        # Number of output channels.
     layers: 10            # Number of conv layers.
     conv_channels: 64     # Number of chnn layers.
-    bias: true            # Whether to use bias parameter in conv.
-    use_weight_norm: true # Whether to use weight norm.
+    bias: True            # Whether to use bias parameter in conv.
+    use_weight_norm: True # Whether to use weight norm.
                           # If set to true, it will be applied to all of the conv layers.
     nonlinear_activation: "leakyrelu" # Nonlinear function after each conv.
     nonlinear_activation_params:      # Nonlinear function parameters

diff --git a/examples/csmsc/tts0/conf/default.yaml b/examples/csmsc/tts0/conf/default.yaml
@@ -0,0 +1,91 @@
+# This configuration is for Paddle to train Tacotron 2. Compared to the
+# original paper, this configuration additionally use the guided attention
+# loss to accelerate the learning of the diagonal attention. It requires
+# only a single GPU with 12 GB memory and it takes ~1 days to finish the
+# training on Titan V.
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+
+fs: 24000          # sr
+n_fft: 2048        # FFT size (samples).
+n_shift: 300       # Hop size (samples). 12.5ms
+win_length: 1200   # Window length (samples). 50ms
+                   # If set to null, it will be the same as fft_size.
+window: "hann"     # Window function.
+
+# Only used for feats_type != raw
+
+fmin: 80           # Minimum frequency of Mel basis.
+fmax: 7600         # Maximum frequency of Mel basis.
+n_mels: 80         # The number of mel basis.
+
+###########################################################
+#                       DATA SETTING                      #
+###########################################################
+batch_size: 64
+num_workers: 2
+
+###########################################################
+#                       MODEL SETTING                     #
+###########################################################
+model:                          # keyword arguments for the selected model
+    embed_dim: 512               # char or phn embedding dimension
+    elayers: 1                   # number of blstm layers in encoder
+    eunits: 512                  # number of blstm units
+    econv_layers: 3              # number of convolutional layers in encoder
+    econv_chans: 512             # number of channels in convolutional layer
+    econv_filts: 5               # filter size of convolutional layer
+    atype: location              # attention function type
+    adim: 512                    # attention dimension
+    aconv_chans: 32              # number of channels in convolutional layer of attention
+    aconv_filts: 15              # filter size of convolutional layer of attention
+    cumulate_att_w: True         # whether to cumulate attention weight
+    dlayers: 2                   # number of lstm layers in decoder
+    dunits: 1024                 # number of lstm units in decoder
+    prenet_layers: 2             # number of layers in prenet
+    prenet_units: 256            # number of units in prenet
+    postnet_layers: 5            # number of layers in postnet
+    postnet_chans: 512           # number of channels in postnet
+    postnet_filts: 5             # filter size of postnet layer
+    output_activation: null      # activation function for the final output
+    use_batch_norm: True         # whether to use batch normalization in encoder
+    use_concate: True            # whether to concatenate encoder embedding with decoder outputs
+    use_residual: False          # whether to use residual connection in encoder
+    dropout_rate: 0.5            # dropout rate
+    zoneout_rate: 0.1            # zoneout rate
+    reduction_factor: 1          # reduction factor
+    spk_embed_dim: null          # speaker embedding dimension
+
+
+###########################################################
+#                       UPDATER SETTING                   #
+###########################################################
+updater:
+    use_masking: True            # whether to apply masking for padded part in loss calculation
+    bce_pos_weight: 5.0          # weight of positive sample in binary cross entropy calculation
+    use_guided_attn_loss: True   # whether to use guided attention loss
+    guided_attn_loss_sigma: 0.4  # sigma of guided attention loss
+    guided_attn_loss_lambda: 1.0 # strength of guided attention loss
+
+
+##########################################################
+#                  OPTIMIZER SETTING                     #
+##########################################################
+optimizer:
+    optim: adam              # optimizer type
+    learning_rate: 1.0e-03   # learning rate
+    epsilon: 1.0e-06         # epsilon
+    weight_decay: 0.0        # weight decay coefficient
+
+###########################################################
+#                     TRAINING SETTING                    #
+###########################################################
+max_epoch: 200
+num_snapshots: 5
+
+###########################################################
+#                       OTHER SETTING                     #
+###########################################################
+seed: 42