Fine-tune SpeechT5 for non-English text-to-speech tasks, implemented in PyTorch.
This repository contains code and resources for fine-tuning (or training) a SpeechT5 model on a non-English language for a text-to-speech task. The project leverages Huggingface's transformers
library and speechbrain
to load necessary models and tools. Other parts of the code, such as data preprocessing and train and evaluate functions, have been fully implemented using PyTorch. Therefore, feel free to make any changes you need to train your model efficiently.
The main objective of this project is to fine-tune the SpeechT5 model for text-to-speech on a non-English language. The steps include:
- Setting up the environment.
- Loading necessary tools (tokenizer and feature extractor) and models (SpeechT5 itself, a model to generate X-vector speaker embeddings, and the vocoder).
- Most importantly: Adding the unique characters of the language you want to fine-tune the model on to the tokenizer and modifying the input embedding matrix of the model accordingly.
- Loading and preprocessing your data.
- Training and evaluating the model.
Here are some generated samples from the model that I trained on the Persian Common Voice dataset.
Sample 1
1.mp4
Sample 2
2.mp4
Sample 3
3.mp4
Sample 4
4.mp4
Sample 5
5.mp4
This code draws lessons from:
https://huggingface.co/learn/audio-course/en/chapter6/fine-tuning