Skip to content

Commit

Permalink
Update NeMo_TTS_Primer.ipynb (#6436)
Browse files Browse the repository at this point in the history
* Update NeMo_TTS_Primer.ipynb

Changed a mistake in line 782. Instead of frequency band (ie. pitch) we should write frequency bin. Note that frequency bins in FFT are not related to pitch.

Signed-off-by: Mostafa Ghorbandoost <mos.ghorbandoost@gmail.com>

* Update NeMo_TTS_Primer.ipynb

Corrected the description of spectrogram and mel spectrogram calculations in lines 782 & 783 and added a fourth point to the description and added a reference for more mathematical details at the end of this point.

Signed-off-by: Mostafa Ghorbandoost <mos.ghorbandoost@gmail.com>

---------

Signed-off-by: Mostafa Ghorbandoost <mos.ghorbandoost@gmail.com>
  • Loading branch information
pythinker authored Apr 19, 2023
1 parent a365879 commit be711c9
Showing 1 changed file with 4 additions and 3 deletions.
7 changes: 4 additions & 3 deletions tutorials/tts/NeMo_TTS_Primer.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -777,10 +777,11 @@
"While raw audio shows amplitude versus time and is useful for easily recording and listening, it is not optimal when it comes to processing.\n",
"\n",
"For processing, it is usually preferable to represent the audio as a **spectrogram** which shows frequency versus time. Specifically, we:\n",
"\n",
"\n",
"1. Group together audio samples into a much smaller set of time buckets, called **audio frames**. An audio frame will usually bucket around 50ms of audio.\n",
"2. For each audio frame, use the [Fast Fourier transform](https://en.wikipedia.org/wiki/Fast_Fourier_transform) (**FFT**) to calculate the magnitude (ie. energy, amplitude or \"loudness\") and phase (which we don't use) of each frequency band (ie. pitch).\n",
"3. Translate the original frequency bands, measured in units of hertz (Hz), into units of [mel frequency](https://en.wikipedia.org/wiki/Mel_scale). The output is called a **mel spectrogram**.\n",
"2. For each audio frame, use the [Fast Fourier transform](https://en.wikipedia.org/wiki/Fast_Fourier_transform) (**FFT**) to calculate the magnitude (ie. energy, amplitude or \"loudness\") and phase (which we don't use) of each frequency bin. We refer to the magnitudes of the frequency bins as a spectrogram\n",
"3. Map the original frequency bins onto the [mel scale](https://en.wikipedia.org/wiki/Mel_scale), using overlapped [triangular filters](https://en.wikipedia.org/wiki/Window_function#Triangular_window) to create mel filterbanks.\n",
"4. Multiply the original spectrogram by the mel filterbanks to produce a mel spectrogram (for more details see [here](https://www.mathworks.com/help/audio/ref/melspectrogram.html)).\n",
"\n",
"We then use the mel spectrogram as our final audio representation. The only thing we lose during this process is the phase information, the implications of which we will discuss more later on.\n",
"\n",
Expand Down

0 comments on commit be711c9

Please sign in to comment.