-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Capacitron #977
Capacitron #977
Conversation
Update: I've just ran the first training and it is slightly off. I suspect there's an error in the loss calculation because of the reorganisations from @Edresson a few months back. I'm investigating that today and will push the new commits. |
Update: managed to do the first training that gave some promising results. I'm including now a step wise gradual lr scheduler (unlike the Noam Scheduler, this takes in hardcoded step # thresholds and learning rates), which proved necessary in my previous implementation. More updates to follow. :) |
capacitron_inference.py
Outdated
@@ -0,0 +1,64 @@ | |||
''' | |||
This will be deleted later, only for dev, to see how to infer the capacitron model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file will be deleted before the merge, it's only a script to show how to infer the model for others who are experimenting with this work
- added reference wav and text args for posterior inference - some formatting
@erogol from my side this is ready to go. 😊 Big thanks to @WeberJulian for all the help! |
Just waiting for my ljpeech T2 capacitron run to converge. If it does, I'll merge both this PR and coqui-ai/Trainer#26 |
Trainings are not converging since the reorganisation of the previous 2 weeks. Reporting back here soon Edit: commits below fixed all issues |
Update CI badges
FINALLY !!!! 🚀 |
This PR implements a new model into 🐸 TTS based on the Capacitron model from Google. It's a partial implementation of the models detailed in the paper, hierarchical latent embeddings are still to be done - this is a TODO for later. If you'd like to get an idea what the model does and how it works, here's a post I did a few months ago.
I have implemented this model as part of my Master's Thesis at TU Berlin. The thesis itself is a detailed report on the implementation and subjective evalutation of this model. You can read my thesis and listen to some samples here. I'm in the process of creating a website with audio samples from my pretrained models and the uploaded thesis as well - this is WIP.
I have implemented this model into an earlier version (March 2020) of 🐸 TTS, so this new "re-implementation" still needs to be tested. I'm in that process right now, however I've wanted to open this PR already to discuss some of the ways the
Trainer API
needs to be adjusted to accomodate the model.TODOs:
|-----> Implement 2 model specific methods Trainer#26
graves
worked]-
original
doesn't work with capacitron anddynamic_convolution
only works with Tacotron 2- Use Graves with T1
- User DCA with T2
@erogol @Edresson @WeberJulian I'd appreciate it if you could review the changes and discuss some of the specifics in the code.