Update NeMo_TTS_Primer.ipynb (#6436)

* Update NeMo_TTS_Primer.ipynb Changed a mistake in line 782. Instead of frequency band (ie. pitch) we should write frequency bin. Note that frequency bins in FFT are not related to pitch. Signed-off-by: Mostafa Ghorbandoost <mos.ghorbandoost@gmail.com> * Update NeMo_TTS_Primer.ipynb Corrected the description of spectrogram and mel spectrogram calculations in lines 782 & 783 and added a fourth point to the description and added a reference for more mathematical details at the end of this point. Signed-off-by: Mostafa Ghorbandoost <mos.ghorbandoost@gmail.com> --------- Signed-off-by: Mostafa Ghorbandoost <mos.ghorbandoost@gmail.com>
NVIDIA · Apr 19, 2023 · be711c9 · be711c9
1 parent a365879
commit be711c9
Showing 1 changed file with 4 additions and 3 deletions.
diff --git a/tutorials/tts/NeMo_TTS_Primer.ipynb b/tutorials/tts/NeMo_TTS_Primer.ipynb
@@ -777,10 +777,11 @@
     "While raw audio shows amplitude versus time and is useful for easily recording and listening, it is not optimal when it comes to processing.\n",
     "\n",
     "For processing, it is usually preferable to represent the audio as a **spectrogram** which shows frequency versus time. Specifically, we:\n",
-    "\n",
+    "\n",  
     "1.   Group together audio samples into a much smaller set of time buckets, called **audio frames**. An audio frame will usually bucket around 50ms of audio.\n",
-    "2.   For each audio frame, use the [Fast Fourier transform](https://en.wikipedia.org/wiki/Fast_Fourier_transform) (**FFT**) to calculate the magnitude (ie. energy, amplitude or \"loudness\") and phase (which we don't use) of each frequency band (ie. pitch).\n",
-    "3.   Translate the original frequency bands, measured in units of hertz (Hz), into units of [mel frequency](https://en.wikipedia.org/wiki/Mel_scale). The output is called a **mel spectrogram**.\n",
+    "2.   For each audio frame, use the [Fast Fourier transform](https://en.wikipedia.org/wiki/Fast_Fourier_transform) (**FFT**) to calculate the magnitude (ie. energy, amplitude or \"loudness\") and phase (which we don't use) of each frequency bin. We refer to the magnitudes of the frequency bins as a spectrogram\n",
+    "3.   Map the original frequency bins onto the [mel scale](https://en.wikipedia.org/wiki/Mel_scale), using overlapped [triangular filters](https://en.wikipedia.org/wiki/Window_function#Triangular_window) to create mel filterbanks.\n",
+    "4.   Multiply the original spectrogram by the mel filterbanks to produce a mel spectrogram (for more details see [here](https://www.mathworks.com/help/audio/ref/melspectrogram.html)).\n",
     "\n",
     "We then use the mel spectrogram as our final audio representation. The only thing we lose during this process is the phase information, the implications of which we will discuss more later on.\n",
     "\n",