Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TTS] fix broken tutorial for MixerTTS. #4949

Merged
merged 1 commit into from
Sep 21, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions tutorials/tts/FastPitch_MixerTTS_Training.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "261df0a0",
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
Expand All @@ -60,6 +62,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "9e0c0d38",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
Expand All @@ -74,12 +78,16 @@
},
{
"cell_type": "markdown",
"id": "efa2c292",
"metadata": {},
"source": [
"# Introduction"
]
},
{
"cell_type": "markdown",
"id": "95884fcd",
"metadata": {},
"source": [
"### FastPitch\n",
"\n",
Expand All @@ -97,6 +105,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "9be422ee",
"metadata": {},
"outputs": [],
"source": [
"from nemo.collections.tts.models.base import SpectrogramGenerator\n",
Expand All @@ -110,6 +120,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "cdf4aee7",
"metadata": {},
"outputs": [],
"source": [
"# Let's see what pretrained models are available for FastPitch and Mixer-TTS\n",
Expand All @@ -123,6 +135,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "298704c4",
"metadata": {},
"outputs": [],
"source": [
"# We can load the pre-trained FastModel as follows\n",
Expand All @@ -134,6 +148,10 @@
{
"cell_type": "code",
"execution_count": null,
"id": "c18181ff",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# In the same way, we can load the pre-trained Mixer-TTS model as follows\n",
Expand All @@ -145,6 +163,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "fb41b646",
"metadata": {},
"outputs": [],
"source": [
"assert isinstance(spec_gen, SpectrogramGenerator)\n",
Expand All @@ -165,13 +185,17 @@
},
{
"cell_type": "markdown",
"id": "54ec3c5e",
"metadata": {},
"source": [
"# Preprocessing"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7ef87e31",
"metadata": {},
"outputs": [],
"source": [
"from nemo_text_processing.g2p.modules import EnglishG2p\n",
Expand All @@ -182,6 +206,8 @@
},
{
"cell_type": "markdown",
"id": "9fd5dec0",
"metadata": {},
"source": [
"We will show example of preprocessing and training using small part of AN4 dataset. It consists of recordings of people spelling out addresses, names, telephone numbers, etc., one letter or number at a time, as well as their corresponding transcripts. Let's download data, prepared manifests and supplementary files.\n",
"\n",
Expand All @@ -193,6 +219,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "6b621b1c",
"metadata": {},
"outputs": [],
"source": [
"# download data and manifests\n",
Expand All @@ -208,6 +236,8 @@
},
{
"cell_type": "markdown",
"id": "45f19be7",
"metadata": {},
"source": [
"### FastPitch\n",
"\n",
Expand All @@ -219,6 +249,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "4e76d950",
"metadata": {},
"outputs": [],
"source": [
"!wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/fastpitch.py\n",
Expand All @@ -230,6 +262,8 @@
},
{
"cell_type": "markdown",
"id": "82a2eacb",
"metadata": {},
"source": [
"TTS text preprocessing pipeline consists of two stages: text normalization and text tokenization. Both of them can be handled by `nemo.collections.tts.torch.data.TTSDataset` for training. \n",
"\n",
Expand All @@ -239,6 +273,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "a46da66d",
"metadata": {},
"outputs": [],
"source": [
"# Text normalizer\n",
Expand All @@ -259,6 +295,8 @@
},
{
"cell_type": "markdown",
"id": "884d8d82",
"metadata": {},
"source": [
"To accelerate and stabilize our training, we also need to extract pitch for every audio, estimate pitch statistics (mean and std) and pre-calculate alignment prior matrices for alignment framework. To do this, all we need to do is iterate over our data one time.\n",
"\n",
Expand All @@ -277,6 +315,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "7108f748",
"metadata": {},
"outputs": [],
"source": [
"def pre_calculate_supplementary_data(sup_data_path, sup_data_types, text_tokenizer, text_normalizer, text_normalizer_call_kwargs):\n",
Expand Down Expand Up @@ -323,6 +363,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "f1affe50",
"metadata": {},
"outputs": [],
"source": [
"fastpitch_sup_data_path = \"fastpitch_sup_data_folder\"\n",
Expand All @@ -335,6 +377,8 @@
},
{
"cell_type": "markdown",
"id": "d868bb48",
"metadata": {},
"source": [
"### Mixer-TTS\n",
"\n",
Expand All @@ -346,6 +390,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "1c7c0cfc",
"metadata": {},
"outputs": [],
"source": [
"!wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/mixer_tts.py\n",
Expand All @@ -357,13 +403,17 @@
},
{
"cell_type": "markdown",
"id": "e2f10886",
"metadata": {},
"source": [
"In the FastPitch pipeline we used a char-based tokenizer, but in the Mixer-TTS training pipeline we would like to demonstrate a phoneme-based tokenizer `nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers.EnglishPhonemesTokenizer`. Unlike char-based tokenizer, `EnglishPhonemesTokenizer` needs a phoneme dictionary and a heteronym dictionary. We will be using the same `nemo_text_processing.text_normalization.normalize.Normalizer` for normalizing the text as used in the FastPitch example."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c6ba0f9a",
"metadata": {},
"outputs": [],
"source": [
"# Text normalizer\n",
Expand Down Expand Up @@ -397,13 +447,17 @@
},
{
"cell_type": "markdown",
"id": "9fc55415",
"metadata": {},
"source": [
"Just like in FastPitch we will need to extract pitch for every audio, estimate pitch statistics (mean and std) and pre-calculate alignment prior matrices for alignment framework."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aabc1f0f",
"metadata": {},
"outputs": [],
"source": [
"mixer_tts_sup_data_path = \"mixer_tts_sup_data_folder\"\n",
Expand All @@ -416,12 +470,16 @@
},
{
"cell_type": "markdown",
"id": "c0711ec6",
"metadata": {},
"source": [
"# Training"
]
},
{
"cell_type": "markdown",
"id": "0a95848c",
"metadata": {},
"source": [
"### FastPitch\n",
"\n",
Expand All @@ -433,6 +491,8 @@
{
"cell_type": "code",
"execution_count": null,
"id": "cc1a9107",
"metadata": {},
"outputs": [],
"source": [
"!(python fastpitch.py --config-name=fastpitch_align_v1.05.yaml \\\n",
Expand Down Expand Up @@ -460,6 +520,8 @@
},
{
"cell_type": "markdown",
"id": "d6bce3ce",
"metadata": {},
"source": [
"Let's look at some of the options in the training command:\n",
"\n",
Expand Down