From af67ee2f74116a773a8c4a986cc9fd1e244b4369 Mon Sep 17 00:00:00 2001
From: Jocelyn Huang <jocelynh@nvidia.com>
Date: Thu, 27 Oct 2022 16:22:21 -0700
Subject: [PATCH] Minor typo fixes in TTS tutorial

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
---
 tutorials/tts/NeMo_TTS_Primer.ipynb | 285 ++++++++++++++--------------
 1 file changed, 147 insertions(+), 138 deletions(-)
diff --git a/tutorials/tts/NeMo_TTS_Primer.ipynb b/tutorials/tts/NeMo_TTS_Primer.ipynb
index 6b9ec79a53f1..0580d061d7fa 100644
--- a/tutorials/tts/NeMo_TTS_Primer.ipynb
+++ b/tutorials/tts/NeMo_TTS_Primer.ipynb
@@ -214,7 +214,7 @@
     "  </tr>\n",
     "</table>\n",
     "\n",
-    "The above examples may be slightly different than the output of the NeMo text normalization code. More details on NeMo text normalization can be found in the our [TN documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/text_normalization/intro.html).\n",
+    "The above examples may be slightly different than the output of the NeMo text normalization code. More details on NeMo text normalization can be found in the [TN documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/text_normalization/intro.html).\n",
     "\n",
     "A more comprehensive list of text normalization rules, examples, and languages are available in the [code](https://github.com/NVIDIA/NeMo/tree/main/nemo_text_processing/text_normalization).\n",
     "\n"
@@ -301,8 +301,7 @@
    "source": [
     "Today text normalization is typically a very manual process involving lots of rules, heuristics, and regular expressions.\n",
     "\n",
-    "It is difficult to train a machine learning model to automate this step due to lack of labeled data. To get ground truth data one would need to manually annotate the entire dataset. The resulting model would then have strictly worse performance than the the manual system producing the labels, making it better to use the original labeling system rather than the model.\n",
-    "\n"
+    "It is difficult to train a machine learning model to automate this step due to lack of labeled data. To get ground truth data one would need to manually annotate the entire dataset. The resulting model would then have strictly worse performance than the manual system producing the labels, making it better to use the original labeling system rather than the model."
    ]
   },
   {
@@ -348,7 +347,7 @@
     "\n",
     "For example (using [ARPABET](https://en.wikipedia.org/wiki/ARPABET)): *Hello World &rarr; HH, AH0, L, OW1, ,W, ER1, L, D*\n",
     "\n",
-    "Some languages, such as Spanish and German, are *phonetic*, meaning their written characters/graphemes are always pronounced the same. For such languages G2P is unnecesary.\n",
+    "Some languages, such as Spanish and German, are *phonetic*, meaning their written characters/graphemes are always pronounced the same. For such languages G2P is unnecessary.\n",
     "\n",
     "However English is not Phonetic because:\n",
     "*   Characters change pronunciation depending on what word they are in.\n",
@@ -622,7 +621,7 @@
     "\n",
     "Most of the earlier descriptions about Text Normalization are also the same for G2P, in regards to it being difficult to get labeled data to train a machine learning model to do it automatically and challenging to generalize and scale across languages.\n",
     "\n",
-    "The most common way that G2P is done today is to to hardcode the grapheme to phoneme mapping for all common words in a language in a **pronouncing dictionary**.\n",
+    "The most common way that G2P is done today is to hardcode the grapheme to phoneme mapping for all common words in a language in a **pronouncing dictionary**.\n",
     "\n",
     "A few examples of dictionary entries:\n",
     "```\n",
@@ -745,15 +744,15 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "## 7.1&nbsp;Audio"
-   ],
    "metadata": {
     "id": "_yo7Ru_GMA0E",
     "pycharm": {
      "name": "#%% md\n"
     }
-   }
+   },
+   "source": [
+    "## 7.1&nbsp;Audio"
+   ]
   },
   {
    "cell_type": "markdown",
@@ -854,7 +853,7 @@
     "\n",
     "With 2 dimensions we can effectively use **CNNs** by running [temporal convolutions](https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html?highlight=conv1d#torch.nn.Conv1d) over the time dimension. Or by applying [2d convolutions](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html?highlight=conv2d#torch.nn.Conv2d) to the spectrogram exactly as if it were an image in computer vision.\n",
     "\n",
-    "**Transformers** require computation/memory that is proportional to the length of the sequence squared. This means we can easily use large transformers for relatively short sequences like in NLP, smaller transformers for longer sequences like spectrogram data, and are inpractical to use on very long sequences like audio samples."
+    "**Transformers** require computation/memory that is proportional to the length of the sequence squared. This means we can easily use large transformers for relatively short sequences like in NLP, smaller transformers for longer sequences like spectrogram data, and are impractical to use on very long sequences like audio samples."
    ]
   },
   {
@@ -880,7 +879,7 @@
    "source": [
     "Before we go into the details of how this works, let's go through an end-to-end text to audio example so we can visualize what our model inputs and outputs look and sound like.\n",
     "\n",
-    "To do this, we will need to use both the spectrogram and vocoder models together. The vocoder will be looked at more throughly in the *audio synthesis* section."
+    "To do this, we will need to use both the spectrogram and vocoder models together. The vocoder will be looked at more thoroughly in the *audio synthesis* section."
    ]
   },
   {
@@ -1039,15 +1038,15 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "### 7.6.1&nbsp;Tacotron 2"
-   ],
    "metadata": {
     "id": "lGVKcJp6Y7Kv",
     "pycharm": {
      "name": "#%% md\n"
     }
-   }
+   },
+   "source": [
+    "### 7.6.1&nbsp;Tacotron 2"
+   ]
   },
   {
    "cell_type": "markdown",
@@ -1203,9 +1202,9 @@
     "*   The attention should be **monotonically increasing**, meaning it never go backwards in the text sequence. So the attention should only ever stay on the current character, or move forward to the next character.\n",
     "*   The model should start on the first character in the sequence and end on the last character.\n",
     "\n",
-    "These contraints result in the decoder effectively \"reading\" the text character by character or word by word, similar to how humans read aloud.\n",
+    "These constraints result in the decoder effectively \"reading\" the text character by character or word by word, similar to how humans read aloud.\n",
     "\n",
-    "A model may need to be trained for a while before its attention learns to follow these constraints. Before that, the attention may look non-sensical, and the model output will sound unintelligable.\n",
+    "A model may need to be trained for a while before its attention learns to follow these constraints. Before that, the attention may look non-sensical, and the model output will sound unintelligible.\n",
     "\n",
     "Once the models learns the above constraints and starts producing well-behaved attention maps, it is said that the model has **aligned**."
    ]
@@ -1395,20 +1394,26 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "## 7.7&nbsp;Duration Prediction\n"
-   ],
    "metadata": {
     "id": "uya9DJ1SWwEx",
     "pycharm": {
      "name": "#%% md\n"
     }
-   }
+   },
+   "source": [
+    "## 7.7&nbsp;Duration Prediction\n"
+   ]
   },
   {
    "cell_type": "markdown",
+   "metadata": {
+    "id": "O6uH8q-BZjko",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
    "source": [
-    "A large weakness of the original Tacotron 2 model is its attention mechanism, which does not enforce the required monotonicity constraint (ie. the decoder must pay attention to each character once in sequential increasing order). As a result, the attention is not robust. It often skips words, repeats words, or encounters catastrophic failures where the output becomes unintelligable.\n",
+    "A large weakness of the original Tacotron 2 model is its attention mechanism, which does not enforce the required monotonicity constraint (ie. the decoder must pay attention to each character once in sequential increasing order). As a result, the attention is not robust. It often skips words, repeats words, or encounters catastrophic failures where the output becomes unintelligible.\n",
     "\n",
     "There are some attention mechanisms such as [forward attention](https://arxiv.org/abs/1807.06736) which try to address this.\n",
     "\n",
@@ -1416,33 +1421,33 @@
     "\n",
     "Replacing the attention mechanism in Tacotron 2 with duration prediction, eg. [Non-Attentive Tacotron](https://arxiv.org/abs/2010.04301), has historically been a common and necessary optimization to make it robust enough for use in enterprise applications. Though it gained visibility in academic literature primarily due to its use in modern transformer based model architectures such as [FastSpeech](https://arxiv.org/abs/1905.09263) and [FastPitch](https://fastpitch.github.io/).\n",
     "\n",
-    "The biggest drawback of this approach is that you you need to get the ground truth character duration information. Some methods for doing this are:\n",
+    "The biggest drawback of this approach is that you need to get the ground truth character duration information. Some methods for doing this are:\n",
     "\n",
     "1.   The preferred method in NeMo is to Jointly train an [alignment model](https://arxiv.org/abs/2108.10447) that measures the similarity between characters and spectrogram frames.\n",
     "2.   Run forced alignment, such as with the [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/).\n",
     "3.   Infer the duration information from the attention map of a teacher model, such as Tacotron 2."
-   ],
-   "metadata": {
-    "id": "O6uH8q-BZjko",
-    "pycharm": {
-     "name": "#%% md\n"
-    }
-   }
+   ]
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "## 7.8&nbsp;Parallel Models\n"
-   ],
    "metadata": {
     "id": "Z7SfuEJK6176",
     "pycharm": {
      "name": "#%% md\n"
     }
-   }
+   },
+   "source": [
+    "## 7.8&nbsp;Parallel Models\n"
+   ]
   },
   {
    "cell_type": "markdown",
+   "metadata": {
+    "id": "XOtPiRajZG2Z",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
    "source": [
     "There are some significant weaknesses to auto-regressive systems. Most notably:\n",
     "\n",
@@ -1451,25 +1456,19 @@
     "*  The user has little control over how the sentence is spoken.\n",
     "\n",
     "Using duration prediction enables us to remove the auto-regressive inference and predict every spectrogram frame in parallel. This makes the inference speed up to 100x faster, making it highly preferable for deploying and serving to users."
-   ],
-   "metadata": {
-    "id": "XOtPiRajZG2Z",
-    "pycharm": {
-     "name": "#%% md\n"
-    }
-   }
+   ]
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "### 7.8.1&nbsp;FastPitch"
-   ],
    "metadata": {
     "id": "HgMfSDW5ZaE4",
     "pycharm": {
      "name": "#%% md\n"
     }
-   }
+   },
+   "source": [
+    "### 7.8.1&nbsp;FastPitch"
+   ]
   },
   {
    "cell_type": "markdown",
@@ -1502,15 +1501,15 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "Let's run the same inference for FastPitch that we did with Tacotron2. The main difference is loading the FastPitch checkpoint using the `FastPitchModel` class."
-   ],
    "metadata": {
     "id": "UN_SIcPuBcQw",
     "pycharm": {
      "name": "#%% md\n"
     }
-   }
+   },
+   "source": [
+    "Let's run the same inference for FastPitch that we did with Tacotron2. The main difference is loading the FastPitch checkpoint using the `FastPitchModel` class."
+   ]
   },
   {
    "cell_type": "code",
@@ -1573,34 +1572,42 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "### 7.8.2&nbsp;Drawbacks"
-   ],
    "metadata": {
     "id": "vwD3Xhwhoys0",
     "pycharm": {
      "name": "#%% md\n"
     }
-   }
+   },
+   "source": [
+    "### 7.8.2&nbsp;Drawbacks"
+   ]
   },
   {
    "cell_type": "markdown",
+   "metadata": {
+    "id": "3jHNDSGmo5f9",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
    "source": [
     "One weakness of parallel models is that without auto-regressive teacher forcing, the model is unable to reliably predict/reconstruct the original utterance. Primarily due to the inputs not fully capturing the unpredictable variability/ambiguity in the possible outputs. The result is that the model learns an average over possible outputs, creating spectrograms that look unrealistically \"smooth\", degrading the audio quality (https://arxiv.org/abs/2202.13066).\n",
     "\n",
     "This problem can be partially alleviated by fine-tuning the spectrogram inversion model (described in the next section) directly on the predicted spectrograms.\n",
     "\n",
     "To visualize this, let's compare a spectrogram to the corresponding one predicted by FastPitch."
-   ],
-   "metadata": {
-    "id": "3jHNDSGmo5f9",
-    "pycharm": {
-     "name": "#%% md\n"
-    }
-   }
+   ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "jvHCe1NWplZo",
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
    "source": [
     "# Compute real spectrogram\n",
     "audio_path = \"LJ023-0089.wav\"\n",
@@ -1619,18 +1626,18 @@
     "tokens = fastpitch_model.parse(text, normalize=True)\n",
     "predicted_spectrogram = fastpitch_model.generate_spectrogram(tokens=tokens)\n",
     "predicted_spectrogram = predicted_spectrogram.cpu().detach().numpy()[0]"
-   ],
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
    "metadata": {
-    "id": "jvHCe1NWplZo",
+    "id": "W_PiDO1Dqezk",
     "pycharm": {
      "name": "#%%\n"
     }
    },
-   "execution_count": null,
-   "outputs": []
-  },
-  {
-   "cell_type": "code",
+   "outputs": [],
    "source": [
     "# Compare the spectrograms\n",
     "imshow(real_spectrogram, origin=\"lower\")\n",
@@ -1640,42 +1647,40 @@
     "imshow(predicted_spectrogram, origin=\"lower\")\n",
     "plt.title(\"Predicted Spectrogram\")\n",
     "plt.show()"
-   ],
-   "metadata": {
-    "id": "W_PiDO1Dqezk",
-    "pycharm": {
-     "name": "#%%\n"
-    }
-   },
-   "execution_count": null,
-   "outputs": []
+   ]
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "As we can see, the predicted spectrogram looks very smooth and well-behaved compared to the ground truth which has a more variation and detail."
-   ],
    "metadata": {
     "id": "_a161gPwreAu",
     "pycharm": {
      "name": "#%% md\n"
     }
-   }
+   },
+   "source": [
+    "As we can see, the predicted spectrogram looks very smooth and well-behaved compared to the ground truth which has a more variation and detail."
+   ]
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "## 7.9&nbsp;Research"
-   ],
    "metadata": {
     "id": "55BH3c8Pre4l",
     "pycharm": {
      "name": "#%% md\n"
     }
-   }
+   },
+   "source": [
+    "## 7.9&nbsp;Research"
+   ]
   },
   {
    "cell_type": "markdown",
+   "metadata": {
+    "id": "Yg2JKIqQrlCG",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
    "source": [
     "There is ongoing research into improving the audio quality and expressiveness of models like FastPitch, with a few methods that have shown promising results being:\n",
     "\n",
@@ -1683,13 +1688,7 @@
     "2.   Use [normalizing flows](https://arxiv.org/abs/1908.09257) (sometimes called *glow* models) to directly learn the variability in the training data (eg. [RAD-TTS](https://nv-adlr.github.io/RADTTS)).\n",
     "3.   Use [generative adversarial networks](https://en.wikipedia.org/wiki/Generative_adversarial_network) (GAN) based training to make the predicted spectrograms harder to tell apart from real spectrograms.\n",
     "4.   Avoid the spectrogram entirely by training an end-to-end model that can go directly from text to audio (eg. [VITS](https://arxiv.org/pdf/2106.06103.pdf))."
-   ],
-   "metadata": {
-    "id": "Yg2JKIqQrlCG",
-    "pycharm": {
-     "name": "#%% md\n"
-    }
-   }
+   ]
   },
   {
    "cell_type": "markdown",
@@ -1736,15 +1735,15 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "Here we will take our audio file, compute its mel spectrogram, and then regenerate the origianl audio from the spectrogram using HiFiGan."
-   ],
    "metadata": {
     "id": "iHijlV2vfAzS",
     "pycharm": {
      "name": "#%% md\n"
     }
-   }
+   },
+   "source": [
+    "Here we will take our audio file, compute its mel spectrogram, and then regenerate the original audio from the spectrogram using HiFiGan."
+   ]
   },
   {
    "cell_type": "code",
@@ -1794,18 +1793,24 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "## 8.2&nbsp;Modeling approach"
-   ],
    "metadata": {
     "id": "Euxpd50wieyD",
     "pycharm": {
      "name": "#%% md\n"
     }
-   }
+   },
+   "source": [
+    "## 8.2&nbsp;Modeling approach"
+   ]
   },
   {
    "cell_type": "markdown",
+   "metadata": {
+    "id": "9EUITiXCiWkS",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
    "source": [
     "Spectrogram inversion is a *sequence-to-sequence* problem.\n",
     "\n",
@@ -1826,25 +1831,19 @@
     "Or if your stride is a power of 2 (like the ones we selected) then you can upsample the sequence more effectively using *transposed convolutions* (aka. *deconvolutional layers*).\n",
     "\n",
     "Once the input and output sequences are the same length, you can use any number of models to predict the output."
-   ],
-   "metadata": {
-    "id": "9EUITiXCiWkS",
-    "pycharm": {
-     "name": "#%% md\n"
-    }
-   }
+   ]
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "## 8.3&nbsp;WaveNet\n"
-   ],
    "metadata": {
     "id": "Ri1_RURKjiss",
     "pycharm": {
      "name": "#%% md\n"
     }
-   }
+   },
+   "source": [
+    "## 8.3&nbsp;WaveNet\n"
+   ]
   },
   {
    "cell_type": "markdown",
@@ -1871,15 +1870,15 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "## 8.4&nbsp;HiFi-GAN"
-   ],
    "metadata": {
     "id": "PYXZjgEEjndF",
     "pycharm": {
      "name": "#%% md\n"
     }
-   }
+   },
+   "source": [
+    "## 8.4&nbsp;HiFi-GAN"
+   ]
   },
   {
    "cell_type": "markdown",
@@ -1942,18 +1941,24 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "# 9.&nbsp;Model Evaluation"
-   ],
    "metadata": {
     "id": "aozxSufVJa0l",
     "pycharm": {
      "name": "#%% md\n"
     }
-   }
+   },
+   "source": [
+    "# 9.&nbsp;Model Evaluation"
+   ]
   },
   {
    "cell_type": "markdown",
+   "metadata": {
+    "id": "I8522HduJmHM",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
    "source": [
     "There are no well-established objective metrics for evaluating how good a TTS model is. Rather, quality is usually based on human opinion or perception, commonly measured through surveys.\n",
     "\n",
@@ -1964,28 +1969,28 @@
     "There are some metrics which are occasionally used to try and measure audio quality such as [MCD-DTW](https://github.com/MattShannon/mcd), [PESQ](https://en.wikipedia.org/wiki/Perceptual_Evaluation_of_Speech_Quality), and [STOI](https://torchmetrics.readthedocs.io/en/stable/audio/short_time_objective_intelligibility.html). But these have very limited accuracy and usefulness.\n",
     "\n",
     "The lack of objective numerical metrics that can be trained on is a large reason as to why many state of the art models rely on GAN based training to get good quality."
-   ],
-   "metadata": {
-    "id": "I8522HduJmHM",
-    "pycharm": {
-     "name": "#%% md\n"
-    }
-   }
+   ]
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "# 10.&nbsp;Additional Resources"
-   ],
    "metadata": {
     "id": "OgtWptQ5tGlq",
     "pycharm": {
      "name": "#%% md\n"
     }
-   }
+   },
+   "source": [
+    "# 10.&nbsp;Additional Resources"
+   ]
   },
   {
    "cell_type": "markdown",
+   "metadata": {
+    "id": "wtJINtrStHvJ",
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
    "source": [
     "To learn more about what TTS technology and models are available in NeMo, please look through our [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tts/intro.html#).\n",
     "\n",
@@ -1995,13 +2000,7 @@
     "*   FastPitch [training](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/FastPitch_MixerTTS_Training.ipynb) and [fine-tuning](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/FastPitch_Finetuning.ipynb)\n",
     "\n",
     "To learn how to deploy and serve your TTS models, visit [Riva](https://docs.nvidia.com/deeplearning/riva/index.html)."
-   ],
-   "metadata": {
-    "id": "wtJINtrStHvJ",
-    "pycharm": {
-     "name": "#%% md\n"
-    }
-   }
+   ]
   },
   {
    "cell_type": "markdown",
@@ -2050,12 +2049,22 @@
   "gpuClass": "standard",
   "kernelspec": {
    "display_name": "Python 3",
+   "language": "python",
    "name": "python3"
   },
   "language_info": {
-   "name": "python"
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.7"
   }
  },
  "nbformat": 4,
- "nbformat_minor": 0
+ "nbformat_minor": 1
 }