Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vits voice conversion fail [Bug] #1672

Closed
vinson-zhang opened this issue Jun 20, 2022 · 37 comments
Closed

vits voice conversion fail [Bug] #1672

vinson-zhang opened this issue Jun 20, 2022 · 37 comments
Labels
bug Something isn't working

Comments

@vinson-zhang
Copy link

vinson-zhang commented Jun 20, 2022

Describe the bug

The following error occurs when I use vits for voice conversion :

RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead (while checking arguments for embedding)

To Reproduce

import os

from trainer import Trainer, TrainerArgs

from TTS.config.shared_configs import BaseAudioConfig
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits,CharactersConfig,VitsArgs
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
from TTS.tts.utils.speakers import SpeakerManager

output_path = os.path.dirname(os.path.abspath(__file__))
dataset_config = BaseDatasetConfig(
    name="baker_old_2", path="/datasets/temp-bznsyp", language="zh-cn"
)
audio_config = BaseAudioConfig(
    sample_rate=48000,
    win_length=1024,
    hop_length=256,
    num_mels=80,
    preemphasis=0.0,
    ref_level_db=20,
    log_func="np.log",
    do_trim_silence=True,
    trim_db=45,
    mel_fmin=0,
    mel_fmax=None,
    spec_gain=1.0,
    signal_norm=False,
    do_amp_to_db_linear=False,
)

vitsArgs = VitsArgs(
    use_speaker_embedding=True,
    use_sdp=False,
    use_speaker_encoder_as_loss=True,
    speaker_encoder_config_path="/TTS/models/tts_models--multilingual--multi-dataset--your_tts/config_se.json",
    speaker_encoder_model_path="/TTS/models/tts_models--multilingual--multi-dataset--your_tts/model_se.pth",
)

config = VitsConfig(
    model_args=vitsArgs,
    audio=audio_config,
    run_name="vits_baker_temp",
    batch_size=48,
    eval_batch_size=24,
    batch_group_size=5,
    num_loader_workers=0,
    num_eval_loader_workers=8,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=1000,
    text_cleaner="chinese_mandarin_cleaners",
    use_phonemes=True,
    phoneme_language="zh-cn",
    phonemizer="zh_cn_phonemizer",
    add_blank=False,
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    compute_input_seq_cache=False,
    print_step=25,
    print_eval=True,
    mixed_precision=True,
    output_path=output_path,
    datasets=[dataset_config],
    characters=CharactersConfig(
        characters_class=None,
        vocab_dict=None,
        pad="_",
        eos="~",
        bos="^",
        blank=None,
        characters="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!'(),.:;? ",
        punctuations="\uff0c\u3002\uff1f\uff01\uff5e\uff1a\uff1b*\u2014\u2014-\uff08\uff09\u3010\u3011!'(),-.:;? “”",
        phonemes="12345giy\u0268\u0289\u026fu\u026a\u028f\u028ae\u00f8\u0258\u0259\u0275\u0264o\u025b\u0153\u025c\u025e\u028c\u0254\u00e6\u0250a\u0276\u0251\u0252\u1d7b\u0298\u0253\u01c0\u0257\u01c3\u0284\u01c2\u0260\u01c1\u029bpbtd\u0288\u0256c\u025fk\u0261q\u0262\u0294\u0274\u014b\u0272\u0273n\u0271m\u0299r\u0280\u2c71\u027e\u027d\u0278\u03b2fv\u03b8\u00f0sz\u0283\u0292\u0282\u0290\u00e7\u029dx\u0263\u03c7\u0281\u0127\u0295h\u0266\u026c\u026e\u028b\u0279\u027bj\u0270l\u026d\u028e\u029f\u02c8\u02cc\u02d0\u02d1\u028dw\u0265\u029c\u02a2\u02a1\u0255\u0291\u027a\u0267\u025a\u02de\u026b",
        is_unique=False,
        is_sorted=True
    ),
    test_sentences=[
        ["你在做什么?", "baker", None, "zh-cn"],
        ["篮球场上没有人", "baker", None, "zh-cn"],
    ],
)

# INITIALIZE THE AUDIO PROCESSOR
# Audio processor is used for feature extraction and audio I/O.
# It mainly serves to the dataloader and the training loggers.
ap = AudioProcessor.init_from_config(config)

# INITIALIZE THE TOKENIZER
# Tokenizer is used to convert text to sequences of token IDs.
# config is updated with the default characters if not defined in the config.
tokenizer, config = TTSTokenizer.init_from_config(config)

# LOAD DATA SAMPLES
# Each sample is a list of ```[text, audio_file_path, speaker_name]```
# You can define your custom sample loader returning the list of samples.
# Or define your custom formatter and pass it to the `load_tts_samples`.
# Check `TTS.tts.datasets.load_tts_samples` for more details.
train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

speaker_manager = SpeakerManager()
speaker_manager.use_cuda = True
speaker_manager.set_ids_from_data(train_samples + eval_samples, parse_key="speaker_name")
config.model_args.num_speakers = speaker_manager.num_speakers

# init model
model = Vits(config, ap, tokenizer, speaker_manager=speaker_manager)

# init the trainer and 
trainer = Trainer(
    TrainerArgs(),
    config,
    output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
trainer.fit()

voice conversion command:

tts  --model_path ./vits_baker_temp-June-20-2022_02+48PM-0000000/best_model.pth --config_path ./vits_baker_temp-June-20-2022_02+48PM-0000000/config.json --speaker_idx "baker" --out_path output.wav --reference_wav  006637.wav

Expected behavior

voice conversion success!

Logs

/opt/conda/lib/python3.8/site-packages/torch/functional.py:695: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at  ../aten/src/ATen/native/SpectralOps.cpp:798.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "/opt/conda/bin/tts", line 33, in <module>
    sys.exit(load_entry_point('TTS', 'console_scripts', 'tts')())
  File "/TTS/TTS/bin/synthesize.py", line 309, in main
    wav = synthesizer.tts(
  File "/TTS/TTS/utils/synthesizer.py", line 339, in tts
    outputs = transfer_voice(
  File "/TTS/TTS/tts/utils/synthesis.py", line 304, in transfer_voice
    model_outputs = _func(reference_wav, speaker_id, d_vector, reference_speaker_id, reference_d_vector)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/TTS/TTS/tts/models/vits.py", line 1140, in inference_voice_conversion
    wav, _, _ = self.voice_conversion(y, y_lengths, speaker_cond_src, speaker_cond_tgt)
  File "/TTS/TTS/tts/models/vits.py", line 1157, in voice_conversion
    g_src = self.emb_g(speaker_cond_src).unsqueeze(-1)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2183, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead (while checking arguments for embedding)

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090"
        ],
        "available": true,
        "version": "11.3"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0+cu113",
        "TTS": "0.6.2",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "x86_64",
        "python": "3.8.12",
        "version": "#91-Ubuntu SMP Thu Jul 15 19:09:17 UTC 2021"
    }
}

Additional context

No response

@vinson-zhang vinson-zhang added the bug Something isn't working label Jun 20, 2022
@p0p4k
Copy link
Contributor

p0p4k commented Jun 21, 2022

Hello, can you try the latest 🐸 TTS version for generating the output wav? You can use the existing trained model as it is. Thanks.
If it does not work, then we might have to debug that embedding layer is getting input indices float instead of long/int.

@vinson-zhang
Copy link
Author

vinson-zhang commented Jun 21, 2022

I have tried the latest version [71281ff] and still have the same problem! @p0p4k

@p0p4k
Copy link
Contributor

p0p4k commented Jun 21, 2022

Can you confirm the number of speakers in your dataset?

@vinson-zhang
Copy link
Author

vinson-zhang commented Jun 21, 2022

My dataset has one speaker and the flowing is the quey result by tts --list_speaker_idxs

 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > Using Griffin-Lim as no vocoder model defined
 > Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model.
{'baker': 0}

@p0p4k
Copy link
Contributor

p0p4k commented Jun 21, 2022

Can you try synthesising a text ?
tts --text "some text" --speaker_idx ... --output..

@vinson-zhang
Copy link
Author

I use the param --speaker_idx "baker" --reference_wav 006637.wav,
I think self.emb_g should not be used in TTS/tts/models/vits.py:1159 g_src = self.emb_g(speaker_cond_src).unsqueeze(-1)??

@vinson-zhang
Copy link
Author

tts --text "some text" --speaker_idx works fine, and get a correct wav result!

@vinson-zhang
Copy link
Author

I guess self.emb_g can be used when use the param of --reference_speaker_idx

@p0p4k
Copy link
Contributor

p0p4k commented Jun 21, 2022

I will need to look into it when I get home (on phone right now), I think there might have been a problem because of having just a single speaker in training. Not sure though, I could be wrong.

@vinson-zhang
Copy link
Author

ok,tkanks, look forward to your answer

@vinson-zhang
Copy link
Author

I tried to modify the code,it can run success,but the result of voice conversion is very poor! Maybe I made a mistake! The follwing is my change:

  1. First, change:(TTS/tts/models/vits.py:1156)
    if self.args.use_speaker_embedding and not self.args.use_d_vector_file:
          g_src = self.emb_g(speaker_cond_src).unsqueeze(-1)
          g_tgt = self.emb_g(speaker_cond_tgt).unsqueeze(-1)

Change To:

      if self.args.use_speaker_embedding and not self.args.use_d_vector_file:
          g_src = F.normalize(speaker_cond_src).unsqueeze(-1)
          g_tgt = self.emb_g(speaker_cond_tgt).unsqueeze(-1)
  1. Second Change:(TTS.tts.utils.synthesis.id_to_torch)
def id_to_torch(aux_id, cuda=False):
    if aux_id is not None:
        aux_id = np.asarray(aux_id)
        aux_id = torch.from_numpy(aux_id)
    if cuda:
        return aux_id.cuda()
    return aux_id

Change To:

def id_to_torch(aux_id, cuda=False):
    if aux_id is not None:
        aux_id = np.asarray([aux_id])
        aux_id = torch.from_numpy(aux_id)
    if cuda:
        return aux_id.cuda()
    return aux_id

@p0p4k
Copy link
Contributor

p0p4k commented Jun 21, 2022

Good catch!
Try to change only the 2nd file. Revert first file change, does it work? Thanks.

@vinson-zhang
Copy link
Author

Revert first file change, still has the exception :

RuntimeError: Expected tensor for argument https://github.com/coqui-ai/TTS/pull/1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead (while checking arguments for embedding)

@p0p4k
Copy link
Contributor

p0p4k commented Jun 21, 2022

I see, I would even try to print "g_src" (revert change as well as after your change), check it's shape/type to debug further. I think it must be the case of training single speaker on multi-speaker model. (cause i just ran my mutli-speaker VITS, worked fine)

@vinson-zhang
Copy link
Author

vinson-zhang commented Jun 21, 2022

I have used the vctk dataset for test, and still get the error:


/opt/conda/lib/python3.8/site-packages/torch/functional.py:695: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at  ../aten/src/ATen/native/SpectralOps.cpp:798.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "/opt/conda/bin/tts", line 33, in <module>
    sys.exit(load_entry_point('TTS', 'console_scripts', 'tts')())
  File "/TTS/TTS/bin/synthesize.py", line 309, in main
    wav = synthesizer.tts(
  File "/TTS/TTS/utils/synthesizer.py", line 339, in tts
    outputs = transfer_voice(
  File "/TTS/TTS/tts/utils/synthesis.py", line 304, in transfer_voice
    model_outputs = _func(reference_wav, speaker_id, d_vector, reference_speaker_id, reference_d_vector)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/TTS/TTS/tts/models/vits.py", line 1140, in inference_voice_conversion
    wav, _, _ = self.voice_conversion(y, y_lengths, speaker_cond_src, speaker_cond_tgt)
  File "/TTS/TTS/tts/models/vits.py", line 1157, in voice_conversion
    g_src = self.emb_g(speaker_cond_src).unsqueeze(-1)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2183, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead (while checking arguments for embedding)


@vinson-zhang
Copy link
Author

this is train.py code :

import os

from trainer import Trainer, TrainerArgs

from TTS.config.shared_configs import BaseAudioConfig
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits,CharactersConfig,VitsArgs
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
from TTS.tts.utils.speakers import SpeakerManager

output_path = os.path.dirname(os.path.abspath(__file__))
dataset_config = BaseDatasetConfig(
    name="vctk_old", path="/datasets/temp_vctk", language="en-us"
)
audio_config = BaseAudioConfig(
    sample_rate=48000,
    win_length=1024,
    hop_length=256,
    num_mels=80,
    preemphasis=0.0,
    ref_level_db=20,
    log_func="np.log",
    do_trim_silence=True,
    trim_db=45,
    mel_fmin=0,
    mel_fmax=None,
    spec_gain=1.0,
    signal_norm=False,
    do_amp_to_db_linear=False,
)


vitsArgs = VitsArgs(
    use_speaker_embedding=True,
    use_sdp=False,
    use_speaker_encoder_as_loss=True,
    speaker_encoder_config_path="/TTS/models/tts_models--multilingual--multi-dataset--your_tts/config_se.json",
    speaker_encoder_model_path="/TTS/models/tts_models--multilingual--multi-dataset--your_tts/model_se.pth",
    speaker_embedding_channels=512,
)


config = VitsConfig(
    model_args=vitsArgs,
    audio=audio_config,
    run_name="vits_vctk",
    batch_size=32,
    eval_batch_size=16,
    batch_group_size=5,
    num_loader_workers=4,
    num_eval_loader_workers=4,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=1000,
    text_cleaner="english_cleaners",
    use_phonemes=True,
    phoneme_language="en",
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    compute_input_seq_cache=False,
    print_step=25,
    print_eval=False,
    mixed_precision=True,
    max_text_len=325,  # change this if you have a larger VRAM than 16GB
    output_path=output_path,
    datasets=[dataset_config],
    test_sentences=[
        ["What are you doing?", "VCTK_old_p225", None, "en-us"],
        ["My name is mike. I'm not fine!", "VCTK_old_p226", None, "en-us"],
    ],
)

# INITIALIZE THE AUDIO PROCESSOR
# Audio processor is used for feature extraction and audio I/O.
# It mainly serves to the dataloader and the training loggers.
ap = AudioProcessor.init_from_config(config)

# INITIALIZE THE TOKENIZER
# Tokenizer is used to convert text to sequences of token IDs.
# config is updated with the default characters if not defined in the config.
tokenizer, config = TTSTokenizer.init_from_config(config)

# LOAD DATA SAMPLES
# Each sample is a list of ```[text, audio_file_path, speaker_name]```
# You can define your custom sample loader returning the list of samples.
# Or define your custom formatter and pass it to the `load_tts_samples`.
# Check `TTS.tts.datasets.load_tts_samples` for more details.
train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

speaker_manager = SpeakerManager()
speaker_manager.use_cuda = True
speaker_manager.set_ids_from_data(train_samples + eval_samples, parse_key="speaker_name")
config.model_args.num_speakers = speaker_manager.num_speakers

# init model
model = Vits(config, ap, tokenizer, speaker_manager=speaker_manager)

# init the trainer and 
trainer = Trainer(
    TrainerArgs(),
    config,
    output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
trainer.fit()


@p0p4k
Copy link
Contributor

p0p4k commented Jun 21, 2022

I see, I would even try to print "g_src" (revert change as well as after your change), check it's shape/type to debug further. I think it must be the case of training single speaker on multi-speaker model. (cause i just ran my mutli-speaker VITS, worked fine)

Did you try this?

Btw, I used the pre-trained model (vctk/vits), it works fine on my side.
image

image

@vinson-zhang
Copy link
Author

When use --text , my env works fine too!

When use --reference_wav 006637.wav, will get the above exception!

Have you try to use --reference_wav?

@vinson-zhang
Copy link
Author

I have tried vctk dataset, and still got the exception!

@p0p4k
Copy link
Contributor

p0p4k commented Jun 21, 2022

Yes, I finally got a chance to try it on my PC. I can confirm this bug as well. I believe instead of a "reference_speaker_idx", we happen to pass "reference_embedding" when calculating g_src.
Your temporary fix seems to work for now. 👍

@vinson-zhang
Copy link
Author

But the result of voice conversion is very poor, Maybe I made a mistake. Look forward to the official fix.

@p0p4k
Copy link
Contributor

p0p4k commented Jun 22, 2022

  • Note: For anyone joining this discussion now and encountering the same bug, ignore the discussion above. So many things I said above seem to be highly inaccurate and misleading.
  • I was not well-informed about the internals and I am actively trying to understand the bug to fix it.
  • Before voice-conversion was officially patched, I was just using d-vectors computed from wavs of target speakers stored in a separated folder using my own custom function. The code that I used was more or less the same as the current voice_conversion function with no way of using embeddings.
  • So, I might make a new post with better understanding of the current official code soon, unless some higher ups decide to chime in. Thanks.

@p0p4k
Copy link
Contributor

p0p4k commented Jun 22, 2022

Hi, @vinson-zhang , can you check/test my PR? Thanks.
I tried it on my trained VITS model for another language and it works fine.
However, my speaker encoder is 512 dimension and VCTK based tts-model/VITS uses 256. So you can help me test it.
Below are some of the combinations you can try for your convenience :
Thanks!

# format -> tgt - ref 
# idx - wav 
tts --model_path <model_path> --config_path <model/config.json> --reference_wav <ref.wav> --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_idx $'<speaker name>'

# wav - wav
tts --model_path <model_path> --config_path <model/config.json> --reference_wav <ref.wav> --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_wav <spk.wav>

# wav - idx + text
tts --model_path <model_path> --config_path <model/config.json> --reference_speaker_idx $'<ref_speaker name>'` --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_wav <spk.wav> --text "random testing text."

# idx - idx + text
tts --model_path <model_path> --config_path <model/config.json> --reference_speaker_idx $'<ref_speaker name>'` --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --text "random testing text."  --speaker_idx $'<speaker name>'

@vinson-zhang
Copy link
Author

Ok, I'll try to test it

@vinson-zhang
Copy link
Author

vinson-zhang commented Jun 22, 2022

Hi, @vinson-zhang , can you check/test my PR? Thanks. I tried it on my trained VITS model for another language and it works fine. However, my speaker encoder is 512 dimension and VCTK based tts-model/VITS uses 256. So you can help me test it. Below are some of the combinations you can try for your convenience : Thanks!


tts --model_path <model_path> --config_path <model/config.json> --reference_wav <ref.wav> --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_idx $'<speaker name>'

tts --model_path <model_path> --config_path <model/config.json> --reference_wav <ref.wav> --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_wav <spk.wav>

tts --model_path <model_path> --config_path <model/config.json> --reference_speaker_idx $'<ref_speaker name>'` --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_wav <spk.wav>

I have tried the commands. The first and second execute successfully, and the last one was failed. But the result of voice conversion still very poor. However the result of generated by --text is relatively good.
Is there any way to optimize it?

@p0p4k
Copy link
Contributor

p0p4k commented Jun 22, 2022

Try training a model with multiple speakers dataset, then voice cloning among them works really good. It is a very new area of research and I do not know how to improve it right now. If I find something, I will let you know.

@p0p4k
Copy link
Contributor

p0p4k commented Jun 22, 2022

Hi, @vinson-zhang , can you check/test my PR? Thanks. I tried it on my trained VITS model for another language and it works fine. However, my speaker encoder is 512 dimension and VCTK based tts-model/VITS uses 256. So you can help me test it. Below are some of the combinations you can try for your convenience : Thanks!

# format -> tgt - ref 
# idx - wav 
tts --model_path <model_path> --config_path <model/config.json> --reference_wav <ref.wav> --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_idx $'<speaker name>'

# wav - wav
tts --model_path <model_path> --config_path <model/config.json> --reference_wav <ref.wav> --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_wav <spk.wav>

# wav - idx + text
tts --model_path <model_path> --config_path <model/config.json> --reference_speaker_idx $'<ref_speaker name>'` --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_wav <spk.wav> --text "random testing text."

# idx - idx + text
tts --model_path <model_path> --config_path <model/config.json> --reference_speaker_idx $'<ref_speaker name>'` --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --text "random testing text."  --speaker_idx $'<speaker name>'

Hello, I fixed the tests again (now total 4). Can you do a final check? Thanks a lot. 😃

@vinson-zhang
Copy link
Author

Try training a model with multiple speakers dataset, then voice cloning among them works really good. It is a very new area of research and I do not know how to improve it right now. If I find something, I will let you know.

OK Thanks

@vinson-zhang
Copy link
Author

Hi, @vinson-zhang , can you check/test my PR? Thanks. I tried it on my trained VITS model for another language and it works fine. However, my speaker encoder is 512 dimension and VCTK based tts-model/VITS uses 256. So you can help me test it. Below are some of the combinations you can try for your convenience : Thanks!

# format -> tgt - ref 
# idx - wav 
tts --model_path <model_path> --config_path <model/config.json> --reference_wav <ref.wav> --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_idx $'<speaker name>'

# wav - wav
tts --model_path <model_path> --config_path <model/config.json> --reference_wav <ref.wav> --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_wav <spk.wav>

# wav - idx + text
tts --model_path <model_path> --config_path <model/config.json> --reference_speaker_idx $'<ref_speaker name>'` --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_wav <spk.wav> --text "random testing text."

# idx - idx + text
tts --model_path <model_path> --config_path <model/config.json> --reference_speaker_idx $'<ref_speaker name>'` --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --text "random testing text."  --speaker_idx $'<speaker name>'

Hello, I fixed the tests again (now total 4). Can you do a final check? Thanks a lot. 😃

All four commands can be executed successfully! 👍

@lexkoro
Copy link
Collaborator

lexkoro commented Jun 22, 2022

@vinson-zhang Are you trying to use voice conversion with a model trained with only one speaker?
If yes, then you won't get good results from it.
You will have to train on much more speakers so that the model learns a certain variety of speech features.

@vinson-zhang
Copy link
Author

Let me try the multi-speaker dataset and see what happens.

@Edresson
Copy link
Contributor

My dataset has one speaker and the flowing is the quey result by tts --list_speaker_idxs

 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > Using Griffin-Lim as no vocoder model defined
 > Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model.
{'baker': 0}

@vinson-zhang Are you trying to use voice conversion with a model trained with only one speaker? If yes, then you won't get good results from it. You will have to train on much more speakers so that the model learns a certain variety of speech features.

Thanks @lexkoro and @p0p4k . Agreed with @lexkoro that to have the ability to do voice conversion you must train a model with a minimum of 2 speakers. I do not see any application where you need to do voice conversion using just one speaker (You can have just one speaker in a target language, but you will need more speakers in other languages to be able to get useful results). The voice conversion inference does not suppose to be compatible with a model trained with just one speaker is not even supported (because I didn't see any application for it). If you plan to do voice conversion you need a minimum of 2 speakers for all possible applications.

@Edresson
Copy link
Contributor

Edresson commented Jun 27, 2022

Describe the bug

The following error occurs when I use vits for voice conversion :

RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead (while checking arguments for embedding)

To Reproduce

import os

from trainer import Trainer, TrainerArgs

from TTS.config.shared_configs import BaseAudioConfig
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits,CharactersConfig,VitsArgs
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
from TTS.tts.utils.speakers import SpeakerManager

output_path = os.path.dirname(os.path.abspath(__file__))
dataset_config = BaseDatasetConfig(
    name="baker_old_2", path="/datasets/temp-bznsyp", language="zh-cn"
)
audio_config = BaseAudioConfig(
    sample_rate=48000,
    win_length=1024,
    hop_length=256,
    num_mels=80,
    preemphasis=0.0,
    ref_level_db=20,
    log_func="np.log",
    do_trim_silence=True,
    trim_db=45,
    mel_fmin=0,
    mel_fmax=None,
    spec_gain=1.0,
    signal_norm=False,
    do_amp_to_db_linear=False,
)

vitsArgs = VitsArgs(
    use_speaker_embedding=True,
    use_sdp=False,
    use_speaker_encoder_as_loss=True,
    speaker_encoder_config_path="/TTS/models/tts_models--multilingual--multi-dataset--your_tts/config_se.json",
    speaker_encoder_model_path="/TTS/models/tts_models--multilingual--multi-dataset--your_tts/model_se.pth",
)

config = VitsConfig(
    model_args=vitsArgs,
    audio=audio_config,
    run_name="vits_baker_temp",
    batch_size=48,
    eval_batch_size=24,
    batch_group_size=5,
    num_loader_workers=0,
    num_eval_loader_workers=8,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=1000,
    text_cleaner="chinese_mandarin_cleaners",
    use_phonemes=True,
    phoneme_language="zh-cn",
    phonemizer="zh_cn_phonemizer",
    add_blank=False,
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    compute_input_seq_cache=False,
    print_step=25,
    print_eval=True,
    mixed_precision=True,
    output_path=output_path,
    datasets=[dataset_config],
    characters=CharactersConfig(
        characters_class=None,
        vocab_dict=None,
        pad="_",
        eos="~",
        bos="^",
        blank=None,
        characters="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!'(),.:;? ",
        punctuations="\uff0c\u3002\uff1f\uff01\uff5e\uff1a\uff1b*\u2014\u2014-\uff08\uff09\u3010\u3011!'(),-.:;? “”",
        phonemes="12345giy\u0268\u0289\u026fu\u026a\u028f\u028ae\u00f8\u0258\u0259\u0275\u0264o\u025b\u0153\u025c\u025e\u028c\u0254\u00e6\u0250a\u0276\u0251\u0252\u1d7b\u0298\u0253\u01c0\u0257\u01c3\u0284\u01c2\u0260\u01c1\u029bpbtd\u0288\u0256c\u025fk\u0261q\u0262\u0294\u0274\u014b\u0272\u0273n\u0271m\u0299r\u0280\u2c71\u027e\u027d\u0278\u03b2fv\u03b8\u00f0sz\u0283\u0292\u0282\u0290\u00e7\u029dx\u0263\u03c7\u0281\u0127\u0295h\u0266\u026c\u026e\u028b\u0279\u027bj\u0270l\u026d\u028e\u029f\u02c8\u02cc\u02d0\u02d1\u028dw\u0265\u029c\u02a2\u02a1\u0255\u0291\u027a\u0267\u025a\u02de\u026b",
        is_unique=False,
        is_sorted=True
    ),
    test_sentences=[
        ["你在做什么?", "baker", None, "zh-cn"],
        ["篮球场上没有人", "baker", None, "zh-cn"],
    ],
)

# INITIALIZE THE AUDIO PROCESSOR
# Audio processor is used for feature extraction and audio I/O.
# It mainly serves to the dataloader and the training loggers.
ap = AudioProcessor.init_from_config(config)

# INITIALIZE THE TOKENIZER
# Tokenizer is used to convert text to sequences of token IDs.
# config is updated with the default characters if not defined in the config.
tokenizer, config = TTSTokenizer.init_from_config(config)

# LOAD DATA SAMPLES
# Each sample is a list of ```[text, audio_file_path, speaker_name]```
# You can define your custom sample loader returning the list of samples.
# Or define your custom formatter and pass it to the `load_tts_samples`.
# Check `TTS.tts.datasets.load_tts_samples` for more details.
train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

speaker_manager = SpeakerManager()
speaker_manager.use_cuda = True
speaker_manager.set_ids_from_data(train_samples + eval_samples, parse_key="speaker_name")
config.model_args.num_speakers = speaker_manager.num_speakers

# init model
model = Vits(config, ap, tokenizer, speaker_manager=speaker_manager)

# init the trainer and 
trainer = Trainer(
    TrainerArgs(),
    config,
    output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
trainer.fit()

voice conversion command:

tts  --model_path ./vits_baker_temp-June-20-2022_02+48PM-0000000/best_model.pth --config_path ./vits_baker_temp-June-20-2022_02+48PM-0000000/config.json --speaker_idx "baker" --out_path output.wav --reference_wav  006637.wav

Expected behavior

voice conversion success!

Logs

/opt/conda/lib/python3.8/site-packages/torch/functional.py:695: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at  ../aten/src/ATen/native/SpectralOps.cpp:798.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "/opt/conda/bin/tts", line 33, in <module>
    sys.exit(load_entry_point('TTS', 'console_scripts', 'tts')())
  File "/TTS/TTS/bin/synthesize.py", line 309, in main
    wav = synthesizer.tts(
  File "/TTS/TTS/utils/synthesizer.py", line 339, in tts
    outputs = transfer_voice(
  File "/TTS/TTS/tts/utils/synthesis.py", line 304, in transfer_voice
    model_outputs = _func(reference_wav, speaker_id, d_vector, reference_speaker_id, reference_d_vector)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/TTS/TTS/tts/models/vits.py", line 1140, in inference_voice_conversion
    wav, _, _ = self.voice_conversion(y, y_lengths, speaker_cond_src, speaker_cond_tgt)
  File "/TTS/TTS/tts/models/vits.py", line 1157, in voice_conversion
    g_src = self.emb_g(speaker_cond_src).unsqueeze(-1)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2183, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead (while checking arguments for embedding)

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090"
        ],
        "available": true,
        "version": "11.3"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0+cu113",
        "TTS": "0.6.2",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "x86_64",
        "python": "3.8.12",
        "version": "#91-Ubuntu SMP Thu Jul 15 19:09:17 UTC 2021"
    }
}

Additional context

No response

In this case, the expected is not voice conversion success. You are trying to use the speaker encoder with a model trained with internal speaker embeddings (use_speaker_embedding=True). You need to provide the --reference_speaker_idx, otherwise, the model will try to extract the speaker embedding using the speaker encoder.

tts  --model_path ./vits_baker_temp-June-20-2022_02+48PM-0000000/best_model.pth --config_path ./vits_baker_temp-June-20-2022_02+48PM-0000000/config.json --speaker_idx "baker" --out_path output.wav --reference_wav  006637.wav --reference_speaker_idx  "baker" 

In addition, you must to train your model with more than 1 speaker, otherwise, Your model will be able to just generate the voice for one speaker (then it is useless for voice conversion). Please, try the command above using a multi-speaker model.

@erogol
Copy link
Member

erogol commented Jul 11, 2022

@Edresson looks like it is not a bug right?

@Edresson
Copy link
Contributor

@Edresson looks like it is not a bug right?

Yeah, It is not a bug.

@vinson-zhang
Copy link
Author

vinson-zhang commented Jul 15, 2022

@Edresson I'm trying to convert the speaker outside the training set to the speaker inside the training set. What should I do?

@Edresson
Copy link
Contributor

@Edresson I'm trying to convert the speaker outside the training set to the speaker inside the training set. What do I do?

It is not supported currently. However, you can create your own code like the YourTTS Colab demos where is possible to do what you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants