Skip to content

Implementation of DCComix TTS: An End-to-End Expressive TTS with Discrete Code Collaborated with Mixer

Notifications You must be signed in to change notification settings

lakahaga/dc-comix-tts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dc-comix-tts

Implementation of DCComix TTS: An End-to-End Expressive TTS with Discrete Code Collaborated with Mixer Accepted to Interspech 2023. Audio samples/demo for this system is here

Abstract: Despite the huge successes made in neutral TTS, content-leakage remains a challenge. In this paper, we propose a new input representation and simple architecture to achieve improved prosody modeling. Inspired by the recent success in the use of discrete code in TTS, we introduce discrete code to the input of the reference encoder. Specifically, we leverage the vector quantizer from the audio compression model to exploit the diverse acoustic information it has already been trained on. In addition, we apply the modified MLP-Mixer to the reference encoder, making the architecture lighter. As a result, we train the prosody transfer TTS in an end-to-end manner. We prove the effectiveness of our method through both subjective and objective evaluations. We demonstrate that the reference encoder learns better speaker-independent prosody when discrete code is utilized as input in the experiments. In addition, we obtain comparable results even when fewer parameters are inputted.

Installation

  • python ≥ 3.8
  • pytorch 1.11.0+cu113
  • nemo_toolkit 1.18.0

See requirements.txt for other libraries

Traininig

  • prepare data (VCTK)
    python preprocess/make_manifest.py
    
    • Note that we resample VCTK audios to 24kHz to match resolution with Encodec
  • preprocessing
    • text normalization
    python torchdata/text_preprocess.py
    
  • run train.py
    • for dc-comix-tts : use ref_mixer_codec_vits.yaml

References

@software{Harper_NeMo_a_toolkit,
author = {Harper, Eric and Majumdar, Somshubra and Kuchaiev, Oleksii and Jason, Li and Zhang, Yang and Bakhturina, Evelina and Noroozi, Vahid and Subramanian, Sandeep and Nithin, Koluguri and Jocelyn, Huang and Jia, Fei and Balam, Jagadeesh and Yang, Xuesong and Livne, Micha and Dong, Yi and Naren, Sean and Ginsburg, Boris},
title = {{NeMo: a toolkit for Conversational AI and Large Language Models}},
url = {https://github.com/NVIDIA/NeMo}
}
@article{defossez2022highfi,
  title={High Fidelity Neural Audio Compression},
  author={Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
  journal={arXiv preprint arXiv:2210.13438},
  year={2022}
}

About

Implementation of DCComix TTS: An End-to-End Expressive TTS with Discrete Code Collaborated with Mixer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages