[TTS] remove phonemizer.py (NVIDIA#5090)

remove phonemizer.py and convert code block to markdown in the tutorial. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Hainan Xu <hainanx@nvidia.com>
hainan-xv · Nov 29, 2022 · 3e4f7cd · 3e4f7cd
1 parent 025f348
commit 3e4f7cd
Show file tree

Hide file tree

Showing 2 changed files with 52 additions and 145 deletions.
diff --git a/scripts/dataset_processing/tts/hui_acg/phonemizer.py b/scripts/dataset_processing/tts/hui_acg/phonemizer.py
diff --git a/tutorials/tts/FastPitch_GermanTTS_Training.ipynb b/tutorials/tts/FastPitch_GermanTTS_Training.ipynb
@@ -81,14 +81,10 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "c588ff4f",
-   "metadata": {
-    "pycharm": {
-     "name": "#%%\n"
-    }
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
-    "# lets download the files we need to run this tutorial\n",
+    "# let's download the files we need to run this tutorial\n",
     "\n",
     "!mkdir NeMoGermanTTS\n",
     "!cd NeMoGermanTTS && wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/scripts/dataset_processing/tts/openslr/get_data.py\n",
@@ -317,7 +313,7 @@
    "source": [
     "## 4. Phonemization\n",
     "\n",
-    "The pronunciation of a word can be represented as a string of phones, which are speech sounds, each represented with symbols adapted from the Roman alphabet. The IPA is designed to represent those qualities of speech that are part of lexical (and to a limited extent prosodic) sounds in oral language: phones, phonemes, intonation and the separation of words and syllables. Training model with phonemes as well as text will help the model generate more accurate speech sounds."
+    "The pronunciation of a word can be represented as a string of phones, which are minimal speech sound units, each represented with symbols adapted from the Roman alphabet. The IPA is designed to represent those qualities of speech that are part of lexical (and to a limited extent prosodic) sounds in spoken form: phones, phonemes, intonation and the separation of words and syllables. Training models with phonemes as well as text will help the model generate more accurate speech sounds."
    ]
   },
   {
@@ -336,9 +332,7 @@
    "id": "f5a88926",
    "metadata": {},
    "source": [
-    "The original dataset only contains text input, so, in order to add phonemes, we need to convert German text into phonemes using [bootphon/phonemizer](https://github.com/bootphon/phonemizer).\n",
-    "\n",
-    "One of the easiest ways to install phonemizer is via pip and espeak backend via apt:"
+    "The original JSON dataset split generated from `get_data.py` only contains text/grapheme inputs. We recommend adding phonemes as well to obtain better quality of synthesized audios. So you would expect the dataset double sized. In order to add phonemes, we need external tools to convert German text into phonemes. There are several open-sourced external tools handling such phoneme transliteration. You may choose any per your interests. But in this tutorial,we demonstrate the process using [bootphon/phonemizer](https://github.com/bootphon/phonemizer) that applies espeak backend. Before running, please install phonemizer on your local machine via `pip install` and `apt-get install` as shown below,"
    ]
   },
   {
@@ -365,50 +359,58 @@
     "docker exec -it <docker_container_id> /bin/bash\n",
     "```\n",
     "\n",
-    "Other install methods for phonemizer are listed [here](https://bootphon.github.io/phonemizer/install.html)."
+    "Other install methods for phonemizer are listed [here](https://bootphon.github.io/phonemizer/install.html). The following code snippet shows a general guidance about how to transliterate graphemes into phonemes using phonemizer tool. You could manually run it to append phoneme manifest."
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "131ce5d0",
-   "metadata": {},
-   "outputs": [],
+   "cell_type": "markdown",
    "source": [
-    "from phonemizer.backend import EspeakBackend\n",
+    "```python\n",
     "import json\n",
-    "\n",
-    "backend = EspeakBackend('de')\n",
-    "\n",
-    "input_manifest_filepaths = [\"DataGermanTTS/thorsten-de/train_manifest\", \\\n",
-    "                            \"DataGermanTTS/thorsten-de/test_manifest\", \\\n",
-    "                            \"DataGermanTTS/thorsten-de/val_manifest\"]\n",
-    "\n",
-    "for input_manifest_filepath in input_manifest_filepaths:\n",
-    "    output_manifest_filepath = input_manifest_filepath+\"_phonemes\"\n",
-    "    records = []\n",
-    "    n_text = []\n",
-    "    with open(input_manifest_filepath + \".json\", \"r\") as f:\n",
-    "        for i, line in enumerate(f):\n",
-    "            d = json.loads(line)\n",
-    "            records.append(d)\n",
-    "            n_text.append(d['normalized_text'])\n",
-    "\n",
-    "    phonemized = backend.phonemize(n_text)\n",
-    "\n",
-    "    new_records = []\n",
-    "    for i in range(len(records)):\n",
-    "        records[i][\"is_phoneme\"] = 0\n",
-    "        new_records.append(records[i])\n",
-    "        phoneme_record = records[i].copy()\n",
-    "        phoneme_record[\"normalized_text\"] = phonemized[i]\n",
-    "        phoneme_record[\"is_phoneme\"] = 1\n",
-    "        new_records.append(phoneme_record)\n",
-    "\n",
-    "    with open(output_manifest_filepath + \".json\", \"w\") as f:\n",
-    "        for r in new_records:\n",
-    "            f.write(json.dumps(r) + '\\n')"
-   ]
+    "from pathlib import Path\n",
+    "from phonemizer.backend import EspeakBackend\n",
+    "from tqdm import tqdm\n",
+    "\n",
+    "def phonemization(manifest, language):\n",
+    "    # you can also consider with_stress=True and add stress symbols into charset of tokenizer for experimental purpose.\n",
+    "    backend = EspeakBackend(language=language, preserve_punctuation=True)\n",
+    "    print(f\"Phonemizing: {manifest}\")\n",
+    "    entries = []\n",
+    "    with open(manifest, 'r') as fjson:\n",
+    "        for line in tqdm(fjson):\n",
+    "            # grapheme\n",
+    "            grapheme_dct = json.loads(line.strip())\n",
+    "            grapheme_dct.update({\"is_phoneme\": 0})\n",
+    "            # phoneme\n",
+    "            phoneme_dct = grapheme_dct.copy()\n",
+    "            # you can also add a separator.Separator(phone=\"_\") to distinguish phone or word boundaries for experimental purpose.\n",
+    "            phonemes = backend.phonemize([grapheme_dct[\"normalized_text\"]], strip=True)\n",
+    "            phoneme_dct[\"normalized_text\"] = phonemes[0]\n",
+    "            phoneme_dct[\"is_phoneme\"] = 1\n",
+    "\n",
+    "            entries.append(grapheme_dct)\n",
+    "            entries.append(phoneme_dct)\n",
+    "\n",
+    "    output_manifest_filepath = manifest.parent / f\"{manifest.stem}_phonemes{manifest.suffix}\"\n",
+    "    with open(output_manifest_filepath, \"w\", encoding=\"utf-8\") as fout:\n",
+    "        for entry in entries:\n",
+    "            fout.write(f\"{json.dumps(entry)}\\n\")\n",
+    "    print(f\"Phonemizing is complete: {manifest} --> {output_manifest_filepath}\")\n",
+    "\n",
+    "input_manifest_filepaths = [\n",
+    "    \"DataGermanTTS/thorsten-de/train_manifest.json\",\n",
+    "    \"DataGermanTTS/thorsten-de/test_manifest.json\",\n",
+    "    \"DataGermanTTS/thorsten-de/val_manifest.json\"\n",
+    "]\n",
+    "\n",
+    "language = 'de'\n",
+    "for manifest in input_manifest_filepaths:\n",
+    "    phonemization(Path(manifest), language)\n",
+    "```"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
   },
   {
    "cell_type": "markdown",
@@ -553,10 +555,7 @@
     "  f.write(tmp)"
    ],
    "metadata": {
-    "collapsed": false,
-    "pycharm": {
-     "name": "#%%\n"
-    }
+    "collapsed": false
    }
   },
   {