This repo supports the following speech dataset:
You can use any other dataset if you write a preprocessor for it.
Each training example consists of:
- The text that was spoken
- A mel-scale spectrogram of the audio
- A linear-scale spectrogram of the audio
The preprocessor is responsible for generating these. See nawar.py for a commented example.
For each training example, a preprocessor should:
-
Load the audio file:
wav = audio.load_wav(wav_path)
-
Compute linear-scale and mel-scale spectrograms (float32 numpy arrays):
spectrogram = audio.spectrogram(wav).astype(np.float32) mel_spectrogram = audio.melspectrogram(wav).astype(np.float32)
-
Save the spectrograms to disk:
np.save(os.path.join(out_dir, spectrogram_filename), spectrogram.T, allow_pickle=False) np.save(os.path.join(out_dir, mel_spectrogram_filename), mel_spectrogram.T, allow_pickle=False)
Note that the transpose of the matrix returned by
audio.spectrogram
is saved so that it's in time-major format. -
Generate a tuple
(spectrogram_filename, mel_spectrogram_filename, n_frames, text)
to write to train.txt. n_frames is just the length of the time axis of the spectrogram.
After you've written your preprocessor, you can add it to preprocess.py by following the example of the other preprocessors in that file.