Skip to content

Fine-tune SpeechT5 for non-English text-to-speech task, implemented in PyTorch.

Notifications You must be signed in to change notification settings

HoseinAzad/SpeechT5-Non-English-TTS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SpeechT5-Non-English-TTS

Fine-tune SpeechT5 for non-English text-to-speech tasks, implemented in PyTorch.

speecht5_framework

This repository contains code and resources for fine-tuning (or training) a SpeechT5 model on a non-English language for a text-to-speech task. The project leverages Huggingface's transformers library and speechbrain to load necessary models and tools. Other parts of the code, such as data preprocessing and train and evaluate functions, have been fully implemented using PyTorch. Therefore, feel free to make any changes you need to train your model efficiently.


Project Overview

The main objective of this project is to fine-tune the SpeechT5 model for text-to-speech on a non-English language. The steps include:

  1. Setting up the environment.
  2. Loading necessary tools (tokenizer and feature extractor) and models (SpeechT5 itself, a model to generate X-vector speaker embeddings, and the vocoder).
  3. Most importantly: Adding the unique characters of the language you want to fine-tune the model on to the tokenizer and modifying the input embedding matrix of the model accordingly.
  4. Loading and preprocessing your data.
  5. Training and evaluating the model.

Generated Samples

Here are some generated samples from the model that I trained on the Persian Common Voice dataset.

Sample 1

1.mp4

Sample 2

2.mp4

Sample 3

3.mp4

Sample 4

4.mp4

Sample 5

5.mp4

References

This code draws lessons from:
https://huggingface.co/learn/audio-course/en/chapter6/fine-tuning

Releases

No releases published

Packages

No packages published

Languages