diff --git a/tutorials/tts/NeMo_TTS_Primer.ipynb b/tutorials/tts/NeMo_TTS_Primer.ipynb index b6df5b91ba89..99306744dd05 100644 --- a/tutorials/tts/NeMo_TTS_Primer.ipynb +++ b/tutorials/tts/NeMo_TTS_Primer.ipynb @@ -777,10 +777,11 @@ "While raw audio shows amplitude versus time and is useful for easily recording and listening, it is not optimal when it comes to processing.\n", "\n", "For processing, it is usually preferable to represent the audio as a **spectrogram** which shows frequency versus time. Specifically, we:\n", - "\n", + "\n", "1. Group together audio samples into a much smaller set of time buckets, called **audio frames**. An audio frame will usually bucket around 50ms of audio.\n", - "2. For each audio frame, use the [Fast Fourier transform](https://en.wikipedia.org/wiki/Fast_Fourier_transform) (**FFT**) to calculate the magnitude (ie. energy, amplitude or \"loudness\") and phase (which we don't use) of each frequency band (ie. pitch).\n", - "3. Translate the original frequency bands, measured in units of hertz (Hz), into units of [mel frequency](https://en.wikipedia.org/wiki/Mel_scale). The output is called a **mel spectrogram**.\n", + "2. For each audio frame, use the [Fast Fourier transform](https://en.wikipedia.org/wiki/Fast_Fourier_transform) (**FFT**) to calculate the magnitude (ie. energy, amplitude or \"loudness\") and phase (which we don't use) of each frequency bin. We refer to the magnitudes of the frequency bins as a spectrogram\n", + "3. Map the original frequency bins onto the [mel scale](https://en.wikipedia.org/wiki/Mel_scale), using overlapped [triangular filters](https://en.wikipedia.org/wiki/Window_function#Triangular_window) to create mel filterbanks.\n", + "4. Multiply the original spectrogram by the mel filterbanks to produce a mel spectrogram (for more details see [here](https://www.mathworks.com/help/audio/ref/melspectrogram.html)).\n", "\n", "We then use the mel spectrogram as our final audio representation. The only thing we lose during this process is the phase information, the implications of which we will discuss more later on.\n", "\n",