HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform

Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani

Recent advancements in speech synthesis have leveraged GAN-based networks like HiFi-GAN and BigVGAN to produce high-fidelity waveforms from mel-spectrograms. However, these networks are computationally expensive and parameter-heavy. iSTFTNet addresses these limitations by integrating inverse short-time Fourier transform (iSTFT) into the network, achieving both speed and parameter efficiency. In this paper, we introduce an extension to iSTFTNet, termed HiFTNet, which incorporates a harmonic-plus-noise source filter in the time-frequency domain that uses a sinusoidal source from the fundamental frequency (F0) inferred via a pre-trained F0 estimation network for fast inference speed. Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN, achieving ground-truth-level performance. HiFTNet also outperforms BigVGAN-base on LibriTTS for unseen speakers and achieves comparable performance to BigVGAN while being four times faster with only 1/6 of the parameters. Our work sets a new benchmark for efficient, high-quality neural vocoding, paving the way for real-time applications that demand high quality speech synthesis.

Paper: https://arxiv.org/abs/2309.09493

Audio samples: https://hiftnet.github.io/

Check our TTS work that uses HiFTNet as speech decoder for human-level speech synthesis here: https://github.com/yl4579/StyleTTS2

Pre-requisites

Python >= 3.7
Clone this repository:

git clone https://github.com/yl4579/HiFTNet.git
cd HiFTNet

Install python requirements:

pip install -r requirements.txt

Training

python train.py --config config_v1.json --[args]

For the F0 model training, please refer to yl4579/PitchExtractor. This repo includes a pre-trained F0 model on LibriTTS. Still, you may want to train your own F0 model for the best performance, particularly for noisy or non-speech data, as we found that F0 estimation accuracy is essential for the vocoder performance.

Inference

Please refer to the notebook inference.ipynb for details.

Pre-Trained Models

You can download the pre-trained LJSpeech model here and the pre-trained LibriTTS model here. The pre-trained models contain parameters of the optimizers and discriminators that can be used for fine-tuning.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LJSpeech-1.1		LJSpeech-1.1
Utils		Utils
LICENSE		LICENSE
README.md		README.md
config_v1.json		config_v1.json
env.py		env.py
inference.ipynb		inference.ipynb
meldataset.py		meldataset.py
models.py		models.py
requirements.txt		requirements.txt
stft.py		stft.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform

Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani

Pre-requisites

Training

Inference

Pre-Trained Models

References

About

Releases

Packages

Languages

License

yl4579/HiFTNet

Folders and files

Latest commit

History

Repository files navigation

HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform

Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani

Pre-requisites

Training

Inference

Pre-Trained Models

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages