Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update NeMo_TTS_Primer.ipynb #6436

Merged
merged 2 commits into from
Apr 19, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions tutorials/tts/NeMo_TTS_Primer.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -777,10 +777,11 @@
"While raw audio shows amplitude versus time and is useful for easily recording and listening, it is not optimal when it comes to processing.\n",
"\n",
"For processing, it is usually preferable to represent the audio as a **spectrogram** which shows frequency versus time. Specifically, we:\n",
"\n",
"\n",
"1. Group together audio samples into a much smaller set of time buckets, called **audio frames**. An audio frame will usually bucket around 50ms of audio.\n",
"2. For each audio frame, use the [Fast Fourier transform](https://en.wikipedia.org/wiki/Fast_Fourier_transform) (**FFT**) to calculate the magnitude (ie. energy, amplitude or \"loudness\") and phase (which we don't use) of each frequency band (ie. pitch).\n",
"3. Translate the original frequency bands, measured in units of hertz (Hz), into units of [mel frequency](https://en.wikipedia.org/wiki/Mel_scale). The output is called a **mel spectrogram**.\n",
"2. For each audio frame, use the [Fast Fourier transform](https://en.wikipedia.org/wiki/Fast_Fourier_transform) (**FFT**) to calculate the magnitude (ie. energy, amplitude or \"loudness\") and phase (which we don't use) of each frequency bin. We refer to the magnitudes of the frequency bins as a spectrogram\n",
"3. Map the original frequency bins onto the [mel scale](https://en.wikipedia.org/wiki/Mel_scale), using overlapped [triangular filters](https://en.wikipedia.org/wiki/Window_function#Triangular_window) to create mel filterbanks.\n",
"4. Multiply the original spectrogram by the mel filterbanks to produce a mel spectrogram (for more details see [here](https://www.mathworks.com/help/audio/ref/melspectrogram.html)).\n",
"\n",
"We then use the mel spectrogram as our final audio representation. The only thing we lose during this process is the phase information, the implications of which we will discuss more later on.\n",
"\n",
Expand Down