NVIDIA · ericharper · Sep 21, 2022 · Sep 20, 2022
diff --git a/tutorials/tts/FastPitch_MixerTTS_Training.ipynb b/tutorials/tts/FastPitch_MixerTTS_Training.ipynb
@@ -38,6 +38,8 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "261df0a0",
+   "metadata": {},
    "outputs": [],
    "source": [
     "\"\"\"\n",
@@ -60,6 +62,8 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "9e0c0d38",
+   "metadata": {},
    "outputs": [],
    "source": [
     "import json\n",
@@ -74,12 +78,16 @@
   },
   {
    "cell_type": "markdown",
+   "id": "efa2c292",
+   "metadata": {},
    "source": [
     "# Introduction"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "95884fcd",
+   "metadata": {},
    "source": [
     "### FastPitch\n",
     "\n",
@@ -97,6 +105,8 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "9be422ee",
+   "metadata": {},
    "outputs": [],
    "source": [
     "from nemo.collections.tts.models.base import SpectrogramGenerator\n",
@@ -110,6 +120,8 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "cdf4aee7",
+   "metadata": {},
    "outputs": [],
    "source": [
     "# Let's see what pretrained models are available for FastPitch and Mixer-TTS\n",
@@ -123,6 +135,8 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "298704c4",
+   "metadata": {},
    "outputs": [],
    "source": [
     "# We can load the pre-trained FastModel as follows\n",
@@ -134,6 +148,10 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "c18181ff",
+   "metadata": {
+    "scrolled": true
+   },
    "outputs": [],
    "source": [
     "# In the same way, we can load the pre-trained Mixer-TTS model as follows\n",
@@ -145,6 +163,8 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "fb41b646",
+   "metadata": {},
    "outputs": [],
    "source": [
     "assert isinstance(spec_gen, SpectrogramGenerator)\n",
@@ -165,13 +185,17 @@
   },
   {
    "cell_type": "markdown",
+   "id": "54ec3c5e",
+   "metadata": {},
    "source": [
     "# Preprocessing"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "7ef87e31",
+   "metadata": {},
    "outputs": [],
    "source": [
     "from nemo_text_processing.g2p.modules import EnglishG2p\n",
@@ -182,6 +206,8 @@
   },
   {
    "cell_type": "markdown",
+   "id": "9fd5dec0",
+   "metadata": {},
    "source": [
     "We will show example of preprocessing and training using small part of AN4 dataset. It consists of recordings of people spelling out addresses, names, telephone numbers, etc., one letter or number at a time, as well as their corresponding transcripts. Let's download data, prepared manifests and supplementary files.\n",
     "\n",
@@ -193,6 +219,8 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "6b621b1c",
+   "metadata": {},
    "outputs": [],
    "source": [
     "# download data and manifests\n",
@@ -208,6 +236,8 @@
   },
   {
    "cell_type": "markdown",
+   "id": "45f19be7",
+   "metadata": {},
    "source": [
     "### FastPitch\n",
     "\n",
@@ -219,6 +249,8 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "4e76d950",
+   "metadata": {},
    "outputs": [],
    "source": [
     "!wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/fastpitch.py\n",
@@ -230,6 +262,8 @@
   },
   {
    "cell_type": "markdown",
+   "id": "82a2eacb",
+   "metadata": {},
    "source": [
     "TTS text preprocessing pipeline consists of two stages: text normalization and text tokenization. Both of them can be handled by `nemo.collections.tts.torch.data.TTSDataset` for training.  \n",
     "\n",
@@ -239,6 +273,8 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "a46da66d",
+   "metadata": {},
    "outputs": [],
    "source": [
     "# Text normalizer\n",
@@ -259,6 +295,8 @@
   },
   {
    "cell_type": "markdown",
+   "id": "884d8d82",
+   "metadata": {},
    "source": [
     "To accelerate and stabilize our training, we also need to extract pitch for every audio, estimate pitch statistics (mean and std) and pre-calculate alignment prior matrices for alignment framework. To do this, all we need to do is iterate over our data one time.\n",
     "\n",
@@ -277,6 +315,8 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "7108f748",
+   "metadata": {},
    "outputs": [],
    "source": [
     "def pre_calculate_supplementary_data(sup_data_path, sup_data_types, text_tokenizer, text_normalizer, text_normalizer_call_kwargs):\n",
@@ -323,6 +363,8 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "f1affe50",
+   "metadata": {},
    "outputs": [],
    "source": [
     "fastpitch_sup_data_path = \"fastpitch_sup_data_folder\"\n",
@@ -335,6 +377,8 @@
   },
   {
    "cell_type": "markdown",
+   "id": "d868bb48",
+   "metadata": {},
    "source": [
     "### Mixer-TTS\n",
     "\n",
@@ -346,6 +390,8 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "1c7c0cfc",
+   "metadata": {},
    "outputs": [],
    "source": [
     "!wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/mixer_tts.py\n",
@@ -357,13 +403,17 @@
   },
   {
    "cell_type": "markdown",
+   "id": "e2f10886",
+   "metadata": {},
    "source": [
     "In the FastPitch pipeline we used a char-based tokenizer, but in the Mixer-TTS training pipeline we would like to demonstrate a phoneme-based tokenizer `nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers.EnglishPhonemesTokenizer`. Unlike char-based tokenizer, `EnglishPhonemesTokenizer` needs a phoneme dictionary and a heteronym dictionary. We will be using the same `nemo_text_processing.text_normalization.normalize.Normalizer` for normalizing the text as used in the FastPitch example."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "c6ba0f9a",
+   "metadata": {},
    "outputs": [],
    "source": [
     "# Text normalizer\n",
@@ -397,13 +447,17 @@
   },
   {
    "cell_type": "markdown",
+   "id": "9fc55415",
+   "metadata": {},
    "source": [
     "Just like in FastPitch we will need to extract pitch for every audio, estimate pitch statistics (mean and std) and pre-calculate alignment prior matrices for alignment framework."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "aabc1f0f",
+   "metadata": {},
    "outputs": [],
    "source": [
     "mixer_tts_sup_data_path = \"mixer_tts_sup_data_folder\"\n",
@@ -416,12 +470,16 @@
   },
   {
    "cell_type": "markdown",
+   "id": "c0711ec6",
+   "metadata": {},
    "source": [
     "# Training"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "0a95848c",
+   "metadata": {},
    "source": [
     "### FastPitch\n",
     "\n",
@@ -433,6 +491,8 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "cc1a9107",
+   "metadata": {},
    "outputs": [],
    "source": [
     "!(python fastpitch.py --config-name=fastpitch_align_v1.05.yaml \\\n",
@@ -460,6 +520,8 @@
   },
   {
    "cell_type": "markdown",
+   "id": "d6bce3ce",
+   "metadata": {},
    "source": [
     "Let's look at some of the options in the training command:\n",
     "\n",