-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vits voice conversion fail [Bug] #1672
Comments
Hello, can you try the latest 🐸 TTS version for generating the output wav? You can use the existing trained model as it is. Thanks. |
Can you confirm the number of speakers in your dataset? |
My dataset has one speaker and the flowing is the quey result by tts --list_speaker_idxs
|
Can you try synthesising a text ? |
I use the param |
|
I guess |
I will need to look into it when I get home (on phone right now), I think there might have been a problem because of having just a single speaker in training. Not sure though, I could be wrong. |
ok,tkanks, look forward to your answer |
I tried to modify the code,it can run success,but the result of voice conversion is very poor! Maybe I made a mistake! The follwing is my change:
Change To:
Change To:
|
Good catch! |
Revert first file change, still has the exception :
|
I see, I would even try to print "g_src" (revert change as well as after your change), check it's shape/type to debug further. I think it must be the case of training single speaker on multi-speaker model. (cause i just ran my mutli-speaker VITS, worked fine) |
I have used the vctk dataset for test, and still get the error:
|
this is train.py code :
|
When use --text , my env works fine too! When use --reference_wav 006637.wav, will get the above exception! Have you try to use --reference_wav? |
I have tried vctk dataset, and still got the exception! |
Yes, I finally got a chance to try it on my PC. I can confirm this bug as well. I believe instead of a "reference_speaker_idx", we happen to pass "reference_embedding" when calculating g_src. |
But the result of voice conversion is very poor, Maybe I made a mistake. Look forward to the official fix. |
|
Hi, @vinson-zhang , can you check/test my PR? Thanks.
|
Ok, I'll try to test it |
I have tried the commands. The first and second execute successfully, and the last one was failed. But the result of voice conversion still very poor. However the result of generated by |
Try training a model with multiple speakers dataset, then voice cloning among them works really good. It is a very new area of research and I do not know how to improve it right now. If I find something, I will let you know. |
Hello, I fixed the tests again (now total 4). Can you do a final check? Thanks a lot. 😃 |
OK Thanks |
All four commands can be executed successfully! 👍 |
@vinson-zhang Are you trying to use voice conversion with a model trained with only one speaker? |
Let me try the multi-speaker dataset and see what happens. |
Thanks @lexkoro and @p0p4k . Agreed with @lexkoro that to have the ability to do voice conversion you must train a model with a minimum of 2 speakers. I do not see any application where you need to do voice conversion using just one speaker (You can have just one speaker in a target language, but you will need more speakers in other languages to be able to get useful results). The voice conversion inference does not suppose to be compatible with a model trained with just one speaker is not even supported (because I didn't see any application for it). If you plan to do voice conversion you need a minimum of 2 speakers for all possible applications. |
In this case, the expected is not voice conversion success. You are trying to use the speaker encoder with a model trained with internal speaker embeddings (use_speaker_embedding=True). You need to provide the --reference_speaker_idx, otherwise, the model will try to extract the speaker embedding using the speaker encoder.
In addition, you must to train your model with more than 1 speaker, otherwise, Your model will be able to just generate the voice for one speaker (then it is useless for voice conversion). Please, try the command above using a multi-speaker model. |
@Edresson looks like it is not a bug right? |
Yeah, It is not a bug. |
@Edresson I'm trying to convert the speaker outside the training set to the speaker inside the training set. What should I do? |
It is not supported currently. However, you can create your own code like the YourTTS Colab demos where is possible to do what you want. |
Describe the bug
The following error occurs when I use vits for voice conversion :
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead (while checking arguments for embedding)
To Reproduce
voice conversion command:
Expected behavior
voice conversion success!
Logs
Environment
Additional context
No response
The text was updated successfully, but these errors were encountered: