Examples of a good fine-tune? #65
Replies: 6 comments 19 replies
-
https://www.youtube.com/watch?v=Tuz7_7q0Pr0 Trained on this interview: https://www.youtube.com/watch?v=ozOoONmJ9EQ |
Beta Was this translation helpful? Give feedback.
-
This is the results I got using default config for 50 epochs (past SLM adversarial training is the most VRAM consuming part. I don't know how to mitigate this problem and fit it in smaller machines. Maybe techniques used for LLM finetuning could help as we are working with large speech language models here. |
Beta Was this translation helpful? Give feedback.
-
Here are the results I have from 2 different models. Aurora: 50 Epochs with joint training after 10, 8 Hours of audio, single voice. Batch Size 2, max length 220. AuroraTest1.webm |
Beta Was this translation helpful? Give feedback.
-
Fine-tuning on LibriTTS using a single Brazilian Portuguese speaker involved processing approximately 24 hours of audio over 60 epochs. Link: https://drive.google.com/file/d/1pBqHbIuuaO7jvMsnnpbjrsFAPcHZKr41/view?usp=sharing I'm using PL-BERT multilingual. Please, any idea why there is this annoying noise on the end of the audio clip? Thanks! Jonathan S. Santos |
Beta Was this translation helpful? Give feedback.
-
Acredito que por falta de um pad de silêncio de pelo menos 400ms outra coisa se os áudios estiverem maior que o length faça o cálculo dos segundos e a frequência não tentei treinar ainda em português assim que concluir os LLM vou liberar um checkpoint em português se puder compartilhar seu check point |
Beta Was this translation helpful? Give feedback.
-
I tinkered around with the config_ft.yml file and I discovered I can do style diffusion and SLM Adversarial Training in one session on my 4090. Batch_size is set to 2. batch_percentage is set to 1. Note this can also work on a 7900 xtx. I'm using Virtual console mode and I used nvtop to close any program eating up vram. Epoch is set to 100 because DiscLM is usually at 0. I'm using vokan as the base model. Most of the audio files I gathered had background music and noise so I used resemble-enhance (Denoising via Gradio App version, not commandline version) and the audacity plug in acon digital deverberate 3 on the audio files. Then I used the audacity plug in trim extend to add 200 milliseconds in the beginning and end of the audio files. Edit: Here's a screenshot of the vram usage. This is what I did in runpod.
I install these.
I use the pwd command to find directory / filepath infomation.
I put the training dataset in a zip file and then I upload it to either https://catbox.moe/ or https://litterbox.catbox.moe/ (Which lets you upload a 1GB file) I download the vokan base model and zip file with aria2.
or
or the gofile downloader. https://github.com/ltsdw/gofile-downloader. Link for vokan model.
I unzip the file
I download the gofile upload script file.
I give the script permissions
I upload the pth file to https://gofile.io/
|
Beta Was this translation helpful? Give feedback.
-
Does anyone have an example of a good fine-tuned styletts2 model?
The only one I can find is the LJSpeech model, which sounds really good! But wondering what some other narrators / speakers would sound like, especially voices more outside the training dataset. Thanks, and awesome work on this.
Beta Was this translation helpful? Give feedback.
All reactions