Skip to content

Commit

Permalink
[TTS] fix broken tutorial for MixerTTS. (#4949)
Browse files Browse the repository at this point in the history
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
  • Loading branch information
XuesongYang authored Sep 21, 2022
1 parent f49a47d commit 6616b04
Showing 1 changed file with 62 additions and 0 deletions.
62 changes: 62 additions & 0 deletions tutorials/tts/FastPitch_MixerTTS_Training.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "261df0a0",
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
Expand All @@ -60,6 +62,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "9e0c0d38",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
Expand All @@ -74,12 +78,16 @@
},
{
"cell_type": "markdown",
"id": "efa2c292",
"metadata": {},
"source": [
"# Introduction"
]
},
{
"cell_type": "markdown",
"id": "95884fcd",
"metadata": {},
"source": [
"### FastPitch\n",
"\n",
Expand All @@ -97,6 +105,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "9be422ee",
"metadata": {},
"outputs": [],
"source": [
"from nemo.collections.tts.models.base import SpectrogramGenerator\n",
Expand All @@ -110,6 +120,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "cdf4aee7",
"metadata": {},
"outputs": [],
"source": [
"# Let's see what pretrained models are available for FastPitch and Mixer-TTS\n",
Expand All @@ -123,6 +135,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "298704c4",
"metadata": {},
"outputs": [],
"source": [
"# We can load the pre-trained FastModel as follows\n",
Expand All @@ -134,6 +148,10 @@
{
"cell_type": "code",
"execution_count": null,
"id": "c18181ff",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# In the same way, we can load the pre-trained Mixer-TTS model as follows\n",
Expand All @@ -145,6 +163,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "fb41b646",
"metadata": {},
"outputs": [],
"source": [
"assert isinstance(spec_gen, SpectrogramGenerator)\n",
Expand All @@ -165,13 +185,17 @@
},
{
"cell_type": "markdown",
"id": "54ec3c5e",
"metadata": {},
"source": [
"# Preprocessing"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7ef87e31",
"metadata": {},
"outputs": [],
"source": [
"from nemo_text_processing.g2p.modules import EnglishG2p\n",
Expand All @@ -182,6 +206,8 @@
},
{
"cell_type": "markdown",
"id": "9fd5dec0",
"metadata": {},
"source": [
"We will show example of preprocessing and training using small part of AN4 dataset. It consists of recordings of people spelling out addresses, names, telephone numbers, etc., one letter or number at a time, as well as their corresponding transcripts. Let's download data, prepared manifests and supplementary files.\n",
"\n",
Expand All @@ -193,6 +219,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "6b621b1c",
"metadata": {},
"outputs": [],
"source": [
"# download data and manifests\n",
Expand All @@ -208,6 +236,8 @@
},
{
"cell_type": "markdown",
"id": "45f19be7",
"metadata": {},
"source": [
"### FastPitch\n",
"\n",
Expand All @@ -219,6 +249,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "4e76d950",
"metadata": {},
"outputs": [],
"source": [
"!wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/fastpitch.py\n",
Expand All @@ -230,6 +262,8 @@
},
{
"cell_type": "markdown",
"id": "82a2eacb",
"metadata": {},
"source": [
"TTS text preprocessing pipeline consists of two stages: text normalization and text tokenization. Both of them can be handled by `nemo.collections.tts.torch.data.TTSDataset` for training. \n",
"\n",
Expand All @@ -239,6 +273,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "a46da66d",
"metadata": {},
"outputs": [],
"source": [
"# Text normalizer\n",
Expand All @@ -259,6 +295,8 @@
},
{
"cell_type": "markdown",
"id": "884d8d82",
"metadata": {},
"source": [
"To accelerate and stabilize our training, we also need to extract pitch for every audio, estimate pitch statistics (mean and std) and pre-calculate alignment prior matrices for alignment framework. To do this, all we need to do is iterate over our data one time.\n",
"\n",
Expand All @@ -277,6 +315,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "7108f748",
"metadata": {},
"outputs": [],
"source": [
"def pre_calculate_supplementary_data(sup_data_path, sup_data_types, text_tokenizer, text_normalizer, text_normalizer_call_kwargs):\n",
Expand Down Expand Up @@ -323,6 +363,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "f1affe50",
"metadata": {},
"outputs": [],
"source": [
"fastpitch_sup_data_path = \"fastpitch_sup_data_folder\"\n",
Expand All @@ -335,6 +377,8 @@
},
{
"cell_type": "markdown",
"id": "d868bb48",
"metadata": {},
"source": [
"### Mixer-TTS\n",
"\n",
Expand All @@ -346,6 +390,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "1c7c0cfc",
"metadata": {},
"outputs": [],
"source": [
"!wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/mixer_tts.py\n",
Expand All @@ -357,13 +403,17 @@
},
{
"cell_type": "markdown",
"id": "e2f10886",
"metadata": {},
"source": [
"In the FastPitch pipeline we used a char-based tokenizer, but in the Mixer-TTS training pipeline we would like to demonstrate a phoneme-based tokenizer `nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers.EnglishPhonemesTokenizer`. Unlike char-based tokenizer, `EnglishPhonemesTokenizer` needs a phoneme dictionary and a heteronym dictionary. We will be using the same `nemo_text_processing.text_normalization.normalize.Normalizer` for normalizing the text as used in the FastPitch example."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c6ba0f9a",
"metadata": {},
"outputs": [],
"source": [
"# Text normalizer\n",
Expand Down Expand Up @@ -397,13 +447,17 @@
},
{
"cell_type": "markdown",
"id": "9fc55415",
"metadata": {},
"source": [
"Just like in FastPitch we will need to extract pitch for every audio, estimate pitch statistics (mean and std) and pre-calculate alignment prior matrices for alignment framework."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aabc1f0f",
"metadata": {},
"outputs": [],
"source": [
"mixer_tts_sup_data_path = \"mixer_tts_sup_data_folder\"\n",
Expand All @@ -416,12 +470,16 @@
},
{
"cell_type": "markdown",
"id": "c0711ec6",
"metadata": {},
"source": [
"# Training"
]
},
{
"cell_type": "markdown",
"id": "0a95848c",
"metadata": {},
"source": [
"### FastPitch\n",
"\n",
Expand All @@ -433,6 +491,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "cc1a9107",
"metadata": {},
"outputs": [],
"source": [
"!(python fastpitch.py --config-name=fastpitch_align_v1.05.yaml \\\n",
Expand Down Expand Up @@ -460,6 +520,8 @@
},
{
"cell_type": "markdown",
"id": "d6bce3ce",
"metadata": {},
"source": [
"Let's look at some of the options in the training command:\n",
"\n",
Expand Down

0 comments on commit 6616b04

Please sign in to comment.