vits voice conversion fail [Bug] #1672

vinson-zhang · 2022-06-20T15:16:48Z

Describe the bug

The following error occurs when I use vits for voice conversion :

RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead (while checking arguments for embedding)

To Reproduce

import os

from trainer import Trainer, TrainerArgs

from TTS.config.shared_configs import BaseAudioConfig
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits,CharactersConfig,VitsArgs
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
from TTS.tts.utils.speakers import SpeakerManager

output_path = os.path.dirname(os.path.abspath(__file__))
dataset_config = BaseDatasetConfig(
    name="baker_old_2", path="/datasets/temp-bznsyp", language="zh-cn"
)
audio_config = BaseAudioConfig(
    sample_rate=48000,
    win_length=1024,
    hop_length=256,
    num_mels=80,
    preemphasis=0.0,
    ref_level_db=20,
    log_func="np.log",
    do_trim_silence=True,
    trim_db=45,
    mel_fmin=0,
    mel_fmax=None,
    spec_gain=1.0,
    signal_norm=False,
    do_amp_to_db_linear=False,
)

vitsArgs = VitsArgs(
    use_speaker_embedding=True,
    use_sdp=False,
    use_speaker_encoder_as_loss=True,
    speaker_encoder_config_path="/TTS/models/tts_models--multilingual--multi-dataset--your_tts/config_se.json",
    speaker_encoder_model_path="/TTS/models/tts_models--multilingual--multi-dataset--your_tts/model_se.pth",
)

config = VitsConfig(
    model_args=vitsArgs,
    audio=audio_config,
    run_name="vits_baker_temp",
    batch_size=48,
    eval_batch_size=24,
    batch_group_size=5,
    num_loader_workers=0,
    num_eval_loader_workers=8,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=1000,
    text_cleaner="chinese_mandarin_cleaners",
    use_phonemes=True,
    phoneme_language="zh-cn",
    phonemizer="zh_cn_phonemizer",
    add_blank=False,
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    compute_input_seq_cache=False,
    print_step=25,
    print_eval=True,
    mixed_precision=True,
    output_path=output_path,
    datasets=[dataset_config],
    characters=CharactersConfig(
        characters_class=None,
        vocab_dict=None,
        pad="_",
        eos="~",
        bos="^",
        blank=None,
        characters="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!'(),.:;? ",
        punctuations="\uff0c\u3002\uff1f\uff01\uff5e\uff1a\uff1b*\u2014\u2014-\uff08\uff09\u3010\u3011!'(),-.:;? “”",
        phonemes="12345giy\u0268\u0289\u026fu\u026a\u028f\u028ae\u00f8\u0258\u0259\u0275\u0264o\u025b\u0153\u025c\u025e\u028c\u0254\u00e6\u0250a\u0276\u0251\u0252\u1d7b\u0298\u0253\u01c0\u0257\u01c3\u0284\u01c2\u0260\u01c1\u029bpbtd\u0288\u0256c\u025fk\u0261q\u0262\u0294\u0274\u014b\u0272\u0273n\u0271m\u0299r\u0280\u2c71\u027e\u027d\u0278\u03b2fv\u03b8\u00f0sz\u0283\u0292\u0282\u0290\u00e7\u029dx\u0263\u03c7\u0281\u0127\u0295h\u0266\u026c\u026e\u028b\u0279\u027bj\u0270l\u026d\u028e\u029f\u02c8\u02cc\u02d0\u02d1\u028dw\u0265\u029c\u02a2\u02a1\u0255\u0291\u027a\u0267\u025a\u02de\u026b",
        is_unique=False,
        is_sorted=True
    ),
    test_sentences=[
        ["你在做什么？", "baker", None, "zh-cn"],
        ["篮球场上没有人", "baker", None, "zh-cn"],
    ],
)

# INITIALIZE THE AUDIO PROCESSOR
# Audio processor is used for feature extraction and audio I/O.
# It mainly serves to the dataloader and the training loggers.
ap = AudioProcessor.init_from_config(config)

# INITIALIZE THE TOKENIZER
# Tokenizer is used to convert text to sequences of token IDs.
# config is updated with the default characters if not defined in the config.
tokenizer, config = TTSTokenizer.init_from_config(config)

# LOAD DATA SAMPLES
# Each sample is a list of ```[text, audio_file_path, speaker_name]```
# You can define your custom sample loader returning the list of samples.
# Or define your custom formatter and pass it to the `load_tts_samples`.
# Check `TTS.tts.datasets.load_tts_samples` for more details.
train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

speaker_manager = SpeakerManager()
speaker_manager.use_cuda = True
speaker_manager.set_ids_from_data(train_samples + eval_samples, parse_key="speaker_name")
config.model_args.num_speakers = speaker_manager.num_speakers

# init model
model = Vits(config, ap, tokenizer, speaker_manager=speaker_manager)

# init the trainer and 
trainer = Trainer(
    TrainerArgs(),
    config,
    output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
trainer.fit()

voice conversion command:

tts  --model_path ./vits_baker_temp-June-20-2022_02+48PM-0000000/best_model.pth --config_path ./vits_baker_temp-June-20-2022_02+48PM-0000000/config.json --speaker_idx "baker" --out_path output.wav --reference_wav  006637.wav

Expected behavior

voice conversion success!

Logs

/opt/conda/lib/python3.8/site-packages/torch/functional.py:695: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at  ../aten/src/ATen/native/SpectralOps.cpp:798.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "/opt/conda/bin/tts", line 33, in <module>
    sys.exit(load_entry_point('TTS', 'console_scripts', 'tts')())
  File "/TTS/TTS/bin/synthesize.py", line 309, in main
    wav = synthesizer.tts(
  File "/TTS/TTS/utils/synthesizer.py", line 339, in tts
    outputs = transfer_voice(
  File "/TTS/TTS/tts/utils/synthesis.py", line 304, in transfer_voice
    model_outputs = _func(reference_wav, speaker_id, d_vector, reference_speaker_id, reference_d_vector)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/TTS/TTS/tts/models/vits.py", line 1140, in inference_voice_conversion
    wav, _, _ = self.voice_conversion(y, y_lengths, speaker_cond_src, speaker_cond_tgt)
  File "/TTS/TTS/tts/models/vits.py", line 1157, in voice_conversion
    g_src = self.emb_g(speaker_cond_src).unsqueeze(-1)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2183, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead (while checking arguments for embedding)

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090"
        ],
        "available": true,
        "version": "11.3"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0+cu113",
        "TTS": "0.6.2",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "x86_64",
        "python": "3.8.12",
        "version": "#91-Ubuntu SMP Thu Jul 15 19:09:17 UTC 2021"
    }
}

Additional context

No response

The text was updated successfully, but these errors were encountered:

p0p4k · 2022-06-21T07:54:39Z

Hello, can you try the latest 🐸 TTS version for generating the output wav? You can use the existing trained model as it is. Thanks.
If it does not work, then we might have to debug that embedding layer is getting input indices float instead of long/int.

vinson-zhang · 2022-06-21T09:06:22Z

I have tried the latest version [71281ff] and still have the same problem! @p0p4k

p0p4k · 2022-06-21T09:39:39Z

Can you confirm the number of speakers in your dataset?

vinson-zhang · 2022-06-21T09:56:20Z

My dataset has one speaker and the flowing is the quey result by tts --list_speaker_idxs

 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > Using Griffin-Lim as no vocoder model defined
 > Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model.
{'baker': 0}

p0p4k · 2022-06-21T10:05:13Z

Can you try synthesising a text ?
tts --text "some text" --speaker_idx ... --output..

vinson-zhang · 2022-06-21T10:06:00Z

I use the param --speaker_idx "baker" --reference_wav 006637.wav,
I think self.emb_g should not be used in TTS/tts/models/vits.py:1159 g_src = self.emb_g(speaker_cond_src).unsqueeze(-1)??

vinson-zhang · 2022-06-21T10:07:14Z

tts --text "some text" --speaker_idx works fine, and get a correct wav result!

vinson-zhang · 2022-06-21T10:10:21Z

I guess self.emb_g can be used when use the param of --reference_speaker_idx

p0p4k · 2022-06-21T10:10:22Z

I will need to look into it when I get home (on phone right now), I think there might have been a problem because of having just a single speaker in training. Not sure though, I could be wrong.

vinson-zhang · 2022-06-21T10:11:43Z

ok，tkanks, look forward to your answer

vinson-zhang · 2022-06-21T10:24:41Z

I tried to modify the code,it can run success,but the result of voice conversion is very poor! Maybe I made a mistake! The follwing is my change:

First, change:(TTS/tts/models/vits.py:1156)

    if self.args.use_speaker_embedding and not self.args.use_d_vector_file:
          g_src = self.emb_g(speaker_cond_src).unsqueeze(-1)
          g_tgt = self.emb_g(speaker_cond_tgt).unsqueeze(-1)

Change To:

      if self.args.use_speaker_embedding and not self.args.use_d_vector_file:
          g_src = F.normalize(speaker_cond_src).unsqueeze(-1)
          g_tgt = self.emb_g(speaker_cond_tgt).unsqueeze(-1)

Second Change:(TTS.tts.utils.synthesis.id_to_torch)

def id_to_torch(aux_id, cuda=False):
    if aux_id is not None:
        aux_id = np.asarray(aux_id)
        aux_id = torch.from_numpy(aux_id)
    if cuda:
        return aux_id.cuda()
    return aux_id

Change To:

def id_to_torch(aux_id, cuda=False):
    if aux_id is not None:
        aux_id = np.asarray([aux_id])
        aux_id = torch.from_numpy(aux_id)
    if cuda:
        return aux_id.cuda()
    return aux_id

p0p4k · 2022-06-21T10:35:19Z

Good catch!
Try to change only the 2nd file. Revert first file change, does it work? Thanks.

vinson-zhang · 2022-06-21T10:39:40Z

Revert first file change, still has the exception :

RuntimeError: Expected tensor for argument https://github.com/coqui-ai/TTS/pull/1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead (while checking arguments for embedding)

p0p4k · 2022-06-21T11:29:51Z

I see, I would even try to print "g_src" (revert change as well as after your change), check it's shape/type to debug further. I think it must be the case of training single speaker on multi-speaker model. (cause i just ran my mutli-speaker VITS, worked fine)

vinson-zhang · 2022-06-21T13:24:36Z

I have used the vctk dataset for test, and still get the error:


/opt/conda/lib/python3.8/site-packages/torch/functional.py:695: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at  ../aten/src/ATen/native/SpectralOps.cpp:798.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "/opt/conda/bin/tts", line 33, in <module>
    sys.exit(load_entry_point('TTS', 'console_scripts', 'tts')())
  File "/TTS/TTS/bin/synthesize.py", line 309, in main
    wav = synthesizer.tts(
  File "/TTS/TTS/utils/synthesizer.py", line 339, in tts
    outputs = transfer_voice(
  File "/TTS/TTS/tts/utils/synthesis.py", line 304, in transfer_voice
    model_outputs = _func(reference_wav, speaker_id, d_vector, reference_speaker_id, reference_d_vector)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/TTS/TTS/tts/models/vits.py", line 1140, in inference_voice_conversion
    wav, _, _ = self.voice_conversion(y, y_lengths, speaker_cond_src, speaker_cond_tgt)
  File "/TTS/TTS/tts/models/vits.py", line 1157, in voice_conversion
    g_src = self.emb_g(speaker_cond_src).unsqueeze(-1)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2183, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead (while checking arguments for embedding)

vinson-zhang · 2022-06-21T13:26:21Z

this is train.py code :

import os

from trainer import Trainer, TrainerArgs

from TTS.config.shared_configs import BaseAudioConfig
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits,CharactersConfig,VitsArgs
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
from TTS.tts.utils.speakers import SpeakerManager

output_path = os.path.dirname(os.path.abspath(__file__))
dataset_config = BaseDatasetConfig(
    name="vctk_old", path="/datasets/temp_vctk", language="en-us"
)
audio_config = BaseAudioConfig(
    sample_rate=48000,
    win_length=1024,
    hop_length=256,
    num_mels=80,
    preemphasis=0.0,
    ref_level_db=20,
    log_func="np.log",
    do_trim_silence=True,
    trim_db=45,
    mel_fmin=0,
    mel_fmax=None,
    spec_gain=1.0,
    signal_norm=False,
    do_amp_to_db_linear=False,
)


vitsArgs = VitsArgs(
    use_speaker_embedding=True,
    use_sdp=False,
    use_speaker_encoder_as_loss=True,
    speaker_encoder_config_path="/TTS/models/tts_models--multilingual--multi-dataset--your_tts/config_se.json",
    speaker_encoder_model_path="/TTS/models/tts_models--multilingual--multi-dataset--your_tts/model_se.pth",
    speaker_embedding_channels=512,
)


config = VitsConfig(
    model_args=vitsArgs,
    audio=audio_config,
    run_name="vits_vctk",
    batch_size=32,
    eval_batch_size=16,
    batch_group_size=5,
    num_loader_workers=4,
    num_eval_loader_workers=4,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=1000,
    text_cleaner="english_cleaners",
    use_phonemes=True,
    phoneme_language="en",
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    compute_input_seq_cache=False,
    print_step=25,
    print_eval=False,
    mixed_precision=True,
    max_text_len=325,  # change this if you have a larger VRAM than 16GB
    output_path=output_path,
    datasets=[dataset_config],
    test_sentences=[
        ["What are you doing？", "VCTK_old_p225", None, "en-us"],
        ["My name is mike. I'm not fine!", "VCTK_old_p226", None, "en-us"],
    ],
)

# INITIALIZE THE AUDIO PROCESSOR
# Audio processor is used for feature extraction and audio I/O.
# It mainly serves to the dataloader and the training loggers.
ap = AudioProcessor.init_from_config(config)

# INITIALIZE THE TOKENIZER
# Tokenizer is used to convert text to sequences of token IDs.
# config is updated with the default characters if not defined in the config.
tokenizer, config = TTSTokenizer.init_from_config(config)

# LOAD DATA SAMPLES
# Each sample is a list of ```[text, audio_file_path, speaker_name]```
# You can define your custom sample loader returning the list of samples.
# Or define your custom formatter and pass it to the `load_tts_samples`.
# Check `TTS.tts.datasets.load_tts_samples` for more details.
train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

speaker_manager = SpeakerManager()
speaker_manager.use_cuda = True
speaker_manager.set_ids_from_data(train_samples + eval_samples, parse_key="speaker_name")
config.model_args.num_speakers = speaker_manager.num_speakers

# init model
model = Vits(config, ap, tokenizer, speaker_manager=speaker_manager)

# init the trainer and 
trainer = Trainer(
    TrainerArgs(),
    config,
    output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
trainer.fit()

p0p4k · 2022-06-21T14:13:34Z

I see, I would even try to print "g_src" (revert change as well as after your change), check it's shape/type to debug further. I think it must be the case of training single speaker on multi-speaker model. (cause i just ran my mutli-speaker VITS, worked fine)

Did you try this?

Btw, I used the pre-trained model (vctk/vits), it works fine on my side.

vinson-zhang · 2022-06-21T14:18:50Z

When use --text , my env works fine too!

When use --reference_wav 006637.wav, will get the above exception!

Have you try to use --reference_wav?

vinson-zhang · 2022-06-21T14:22:44Z

I have tried vctk dataset, and still got the exception!

p0p4k · 2022-06-21T16:51:42Z

Yes, I finally got a chance to try it on my PC. I can confirm this bug as well. I believe instead of a "reference_speaker_idx", we happen to pass "reference_embedding" when calculating g_src.
Your temporary fix seems to work for now. 👍

vinson-zhang · 2022-06-22T00:21:35Z

But the result of voice conversion is very poor, Maybe I made a mistake. Look forward to the official fix.

p0p4k · 2022-06-22T03:32:12Z

Note: For anyone joining this discussion now and encountering the same bug, ignore the discussion above. So many things I said above seem to be highly inaccurate and misleading.
I was not well-informed about the internals and I am actively trying to understand the bug to fix it.
Before voice-conversion was officially patched, I was just using d-vectors computed from wavs of target speakers stored in a separated folder using my own custom function. The code that I used was more or less the same as the current voice_conversion function with no way of using embeddings.
So, I might make a new post with better understanding of the current official code soon, unless some higher ups decide to chime in. Thanks.

p0p4k · 2022-06-22T07:44:34Z

Hi, @vinson-zhang , can you check/test my PR? Thanks.
I tried it on my trained VITS model for another language and it works fine.
However, my speaker encoder is 512 dimension and VCTK based tts-model/VITS uses 256. So you can help me test it.
Below are some of the combinations you can try for your convenience :
Thanks!

# format -> tgt - ref 
# idx - wav 
tts --model_path <model_path> --config_path <model/config.json> --reference_wav <ref.wav> --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_idx $'<speaker name>'

# wav - wav
tts --model_path <model_path> --config_path <model/config.json> --reference_wav <ref.wav> --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_wav <spk.wav>

# wav - idx + text
tts --model_path <model_path> --config_path <model/config.json> --reference_speaker_idx $'<ref_speaker name>'` --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_wav <spk.wav> --text "random testing text."

# idx - idx + text
tts --model_path <model_path> --config_path <model/config.json> --reference_speaker_idx $'<ref_speaker name>'` --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --text "random testing text."  --speaker_idx $'<speaker name>'

vinson-zhang · 2022-06-22T08:24:56Z

Ok, I'll try to test it

vinson-zhang · 2022-06-22T08:51:58Z

Hi, @vinson-zhang , can you check/test my PR? Thanks. I tried it on my trained VITS model for another language and it works fine. However, my speaker encoder is 512 dimension and VCTK based tts-model/VITS uses 256. So you can help me test it. Below are some of the combinations you can try for your convenience : Thanks!


tts --model_path <model_path> --config_path <model/config.json> --reference_wav <ref.wav> --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_idx $'<speaker name>'

tts --model_path <model_path> --config_path <model/config.json> --reference_wav <ref.wav> --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_wav <spk.wav>

tts --model_path <model_path> --config_path <model/config.json> --reference_speaker_idx $'<ref_speaker name>'` --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_wav <spk.wav>

I have tried the commands. The first and second execute successfully, and the last one was failed. But the result of voice conversion still very poor. However the result of generated by --text is relatively good.
Is there any way to optimize it?

p0p4k · 2022-06-22T09:02:21Z

Try training a model with multiple speakers dataset, then voice cloning among them works really good. It is a very new area of research and I do not know how to improve it right now. If I find something, I will let you know.

p0p4k · 2022-06-22T09:06:39Z

Hi, @vinson-zhang , can you check/test my PR? Thanks. I tried it on my trained VITS model for another language and it works fine. However, my speaker encoder is 512 dimension and VCTK based tts-model/VITS uses 256. So you can help me test it. Below are some of the combinations you can try for your convenience : Thanks!

# format -> tgt - ref 
# idx - wav 
tts --model_path <model_path> --config_path <model/config.json> --reference_wav <ref.wav> --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_idx $'<speaker name>'

# wav - wav
tts --model_path <model_path> --config_path <model/config.json> --reference_wav <ref.wav> --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_wav <spk.wav>

# wav - idx + text
tts --model_path <model_path> --config_path <model/config.json> --reference_speaker_idx $'<ref_speaker name>'` --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_wav <spk.wav> --text "random testing text."

# idx - idx + text
tts --model_path <model_path> --config_path <model/config.json> --reference_speaker_idx $'<ref_speaker name>'` --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --text "random testing text."  --speaker_idx $'<speaker name>'

Hello, I fixed the tests again (now total 4). Can you do a final check? Thanks a lot. 😃

vinson-zhang · 2022-06-22T09:22:02Z

Try training a model with multiple speakers dataset, then voice cloning among them works really good. It is a very new area of research and I do not know how to improve it right now. If I find something, I will let you know.

OK Thanks

vinson-zhang · 2022-06-22T09:22:54Z

Hi, @vinson-zhang , can you check/test my PR? Thanks. I tried it on my trained VITS model for another language and it works fine. However, my speaker encoder is 512 dimension and VCTK based tts-model/VITS uses 256. So you can help me test it. Below are some of the combinations you can try for your convenience : Thanks!

# format -> tgt - ref 
# idx - wav 
tts --model_path <model_path> --config_path <model/config.json> --reference_wav <ref.wav> --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_idx $'<speaker name>'

# wav - wav
tts --model_path <model_path> --config_path <model/config.json> --reference_wav <ref.wav> --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_wav <spk.wav>

# wav - idx + text
tts --model_path <model_path> --config_path <model/config.json> --reference_speaker_idx $'<ref_speaker name>'` --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --speaker_wav <spk.wav> --text "random testing text."

# idx - idx + text
tts --model_path <model_path> --config_path <model/config.json> --reference_speaker_idx $'<ref_speaker name>'` --out_path <tts_output.wav> --encoder_path <speaker_encoder_model> --encoder_config_path <speaker_encoder/config_se.json> --text "random testing text."  --speaker_idx $'<speaker name>'

Hello, I fixed the tests again (now total 4). Can you do a final check? Thanks a lot. 😃

All four commands can be executed successfully! 👍

lexkoro · 2022-06-22T09:43:25Z

@vinson-zhang Are you trying to use voice conversion with a model trained with only one speaker?
If yes, then you won't get good results from it.
You will have to train on much more speakers so that the model learns a certain variety of speech features.

vinson-zhang · 2022-06-22T10:42:15Z

Let me try the multi-speaker dataset and see what happens.

Edresson · 2022-06-27T11:24:43Z

My dataset has one speaker and the flowing is the quey result by tts --list_speaker_idxs

 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > Using Griffin-Lim as no vocoder model defined
 > Available speaker ids: (Set --speaker_idx flag to one of these values to use the multi-speaker model.
{'baker': 0}

@vinson-zhang Are you trying to use voice conversion with a model trained with only one speaker? If yes, then you won't get good results from it. You will have to train on much more speakers so that the model learns a certain variety of speech features.

Thanks @lexkoro and @p0p4k . Agreed with @lexkoro that to have the ability to do voice conversion you must train a model with a minimum of 2 speakers. I do not see any application where you need to do voice conversion using just one speaker (You can have just one speaker in a target language, but you will need more speakers in other languages to be able to get useful results). The voice conversion inference does not suppose to be compatible with a model trained with just one speaker is not even supported (because I didn't see any application for it). If you plan to do voice conversion you need a minimum of 2 speakers for all possible applications.

Edresson · 2022-06-27T11:46:45Z

Describe the bug

The following error occurs when I use vits for voice conversion :

RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead (while checking arguments for embedding)

To Reproduce

import os

from trainer import Trainer, TrainerArgs

from TTS.config.shared_configs import BaseAudioConfig
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits,CharactersConfig,VitsArgs
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
from TTS.tts.utils.speakers import SpeakerManager

output_path = os.path.dirname(os.path.abspath(__file__))
dataset_config = BaseDatasetConfig(
    name="baker_old_2", path="/datasets/temp-bznsyp", language="zh-cn"
)
audio_config = BaseAudioConfig(
    sample_rate=48000,
    win_length=1024,
    hop_length=256,
    num_mels=80,
    preemphasis=0.0,
    ref_level_db=20,
    log_func="np.log",
    do_trim_silence=True,
    trim_db=45,
    mel_fmin=0,
    mel_fmax=None,
    spec_gain=1.0,
    signal_norm=False,
    do_amp_to_db_linear=False,
)

vitsArgs = VitsArgs(
    use_speaker_embedding=True,
    use_sdp=False,
    use_speaker_encoder_as_loss=True,
    speaker_encoder_config_path="/TTS/models/tts_models--multilingual--multi-dataset--your_tts/config_se.json",
    speaker_encoder_model_path="/TTS/models/tts_models--multilingual--multi-dataset--your_tts/model_se.pth",
)

config = VitsConfig(
    model_args=vitsArgs,
    audio=audio_config,
    run_name="vits_baker_temp",
    batch_size=48,
    eval_batch_size=24,
    batch_group_size=5,
    num_loader_workers=0,
    num_eval_loader_workers=8,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=1000,
    text_cleaner="chinese_mandarin_cleaners",
    use_phonemes=True,
    phoneme_language="zh-cn",
    phonemizer="zh_cn_phonemizer",
    add_blank=False,
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    compute_input_seq_cache=False,
    print_step=25,
    print_eval=True,
    mixed_precision=True,
    output_path=output_path,
    datasets=[dataset_config],
    characters=CharactersConfig(
        characters_class=None,
        vocab_dict=None,
        pad="_",
        eos="~",
        bos="^",
        blank=None,
        characters="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!'(),.:;? ",
        punctuations="\uff0c\u3002\uff1f\uff01\uff5e\uff1a\uff1b*\u2014\u2014-\uff08\uff09\u3010\u3011!'(),-.:;? “”",
        phonemes="12345giy\u0268\u0289\u026fu\u026a\u028f\u028ae\u00f8\u0258\u0259\u0275\u0264o\u025b\u0153\u025c\u025e\u028c\u0254\u00e6\u0250a\u0276\u0251\u0252\u1d7b\u0298\u0253\u01c0\u0257\u01c3\u0284\u01c2\u0260\u01c1\u029bpbtd\u0288\u0256c\u025fk\u0261q\u0262\u0294\u0274\u014b\u0272\u0273n\u0271m\u0299r\u0280\u2c71\u027e\u027d\u0278\u03b2fv\u03b8\u00f0sz\u0283\u0292\u0282\u0290\u00e7\u029dx\u0263\u03c7\u0281\u0127\u0295h\u0266\u026c\u026e\u028b\u0279\u027bj\u0270l\u026d\u028e\u029f\u02c8\u02cc\u02d0\u02d1\u028dw\u0265\u029c\u02a2\u02a1\u0255\u0291\u027a\u0267\u025a\u02de\u026b",
        is_unique=False,
        is_sorted=True
    ),
    test_sentences=[
        ["你在做什么？", "baker", None, "zh-cn"],
        ["篮球场上没有人", "baker", None, "zh-cn"],
    ],
)

# INITIALIZE THE AUDIO PROCESSOR
# Audio processor is used for feature extraction and audio I/O.
# It mainly serves to the dataloader and the training loggers.
ap = AudioProcessor.init_from_config(config)

# INITIALIZE THE TOKENIZER
# Tokenizer is used to convert text to sequences of token IDs.
# config is updated with the default characters if not defined in the config.
tokenizer, config = TTSTokenizer.init_from_config(config)

# LOAD DATA SAMPLES
# Each sample is a list of ```[text, audio_file_path, speaker_name]```
# You can define your custom sample loader returning the list of samples.
# Or define your custom formatter and pass it to the `load_tts_samples`.
# Check `TTS.tts.datasets.load_tts_samples` for more details.
train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

speaker_manager = SpeakerManager()
speaker_manager.use_cuda = True
speaker_manager.set_ids_from_data(train_samples + eval_samples, parse_key="speaker_name")
config.model_args.num_speakers = speaker_manager.num_speakers

# init model
model = Vits(config, ap, tokenizer, speaker_manager=speaker_manager)

# init the trainer and 
trainer = Trainer(
    TrainerArgs(),
    config,
    output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)
trainer.fit()

voice conversion command:

tts  --model_path ./vits_baker_temp-June-20-2022_02+48PM-0000000/best_model.pth --config_path ./vits_baker_temp-June-20-2022_02+48PM-0000000/config.json --speaker_idx "baker" --out_path output.wav --reference_wav  006637.wav

Expected behavior

voice conversion success!

Logs

/opt/conda/lib/python3.8/site-packages/torch/functional.py:695: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at  ../aten/src/ATen/native/SpectralOps.cpp:798.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "/opt/conda/bin/tts", line 33, in <module>
    sys.exit(load_entry_point('TTS', 'console_scripts', 'tts')())
  File "/TTS/TTS/bin/synthesize.py", line 309, in main
    wav = synthesizer.tts(
  File "/TTS/TTS/utils/synthesizer.py", line 339, in tts
    outputs = transfer_voice(
  File "/TTS/TTS/tts/utils/synthesis.py", line 304, in transfer_voice
    model_outputs = _func(reference_wav, speaker_id, d_vector, reference_speaker_id, reference_d_vector)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/TTS/TTS/tts/models/vits.py", line 1140, in inference_voice_conversion
    wav, _, _ = self.voice_conversion(y, y_lengths, speaker_cond_src, speaker_cond_tgt)
  File "/TTS/TTS/tts/models/vits.py", line 1157, in voice_conversion
    g_src = self.emb_g(speaker_cond_src).unsqueeze(-1)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2183, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead (while checking arguments for embedding)

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090"
        ],
        "available": true,
        "version": "11.3"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0+cu113",
        "TTS": "0.6.2",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "x86_64",
        "python": "3.8.12",
        "version": "#91-Ubuntu SMP Thu Jul 15 19:09:17 UTC 2021"
    }
}

Additional context

No response

In this case, the expected is not voice conversion success. You are trying to use the speaker encoder with a model trained with internal speaker embeddings (use_speaker_embedding=True). You need to provide the --reference_speaker_idx, otherwise, the model will try to extract the speaker embedding using the speaker encoder.

tts  --model_path ./vits_baker_temp-June-20-2022_02+48PM-0000000/best_model.pth --config_path ./vits_baker_temp-June-20-2022_02+48PM-0000000/config.json --speaker_idx "baker" --out_path output.wav --reference_wav  006637.wav --reference_speaker_idx  "baker"

In addition, you must to train your model with more than 1 speaker, otherwise, Your model will be able to just generate the voice for one speaker (then it is useless for voice conversion). Please, try the command above using a multi-speaker model.

erogol · 2022-07-11T11:04:32Z

@Edresson looks like it is not a bug right?

Edresson · 2022-07-11T11:07:21Z

@Edresson looks like it is not a bug right?

Yeah, It is not a bug.

vinson-zhang · 2022-07-15T06:49:06Z

@Edresson I'm trying to convert the speaker outside the training set to the speaker inside the training set. What should I do?

Edresson · 2022-07-15T11:16:23Z

@Edresson I'm trying to convert the speaker outside the training set to the speaker inside the training set. What do I do?

It is not supported currently. However, you can create your own code like the YourTTS Colab demos where is possible to do what you want.

vinson-zhang added the bug Something isn't working label Jun 20, 2022

p0p4k mentioned this issue Jun 22, 2022

̶V̶I̶T̶S̶ ̶v̶o̶i̶c̶e̶ ̶c̶o̶n̶v̶e̶r̶s̶i̶o̶n̶ ̶h̶o̶t̶ ̶f̶i̶x̶ ̶ [Ignore this pull - DO NOT MERGE] #1679

Closed

p0p4k mentioned this issue Jun 23, 2022

VITS voice conversion hot fix #1682

Closed

Edresson closed this as completed Jul 11, 2022

A7Alzohaile mentioned this issue Jul 30, 2022

[Bug] VITS training fails with an error #1806

Closed

vits voice conversion fail [Bug] #1672

vits voice conversion fail [Bug] #1672

Comments

vinson-zhang commented Jun 20, 2022 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

p0p4k commented Jun 21, 2022

vinson-zhang commented Jun 21, 2022 • edited Loading

p0p4k commented Jun 21, 2022

vinson-zhang commented Jun 21, 2022 • edited Loading

p0p4k commented Jun 21, 2022

vinson-zhang commented Jun 21, 2022

vinson-zhang commented Jun 21, 2022

vinson-zhang commented Jun 21, 2022

p0p4k commented Jun 21, 2022

vinson-zhang commented Jun 21, 2022

vinson-zhang commented Jun 21, 2022

p0p4k commented Jun 21, 2022

vinson-zhang commented Jun 21, 2022

p0p4k commented Jun 21, 2022

vinson-zhang commented Jun 21, 2022 • edited Loading

vinson-zhang commented Jun 21, 2022

p0p4k commented Jun 21, 2022

vinson-zhang commented Jun 21, 2022

vinson-zhang commented Jun 21, 2022

p0p4k commented Jun 21, 2022

vinson-zhang commented Jun 22, 2022

p0p4k commented Jun 22, 2022

p0p4k commented Jun 22, 2022 • edited Loading

vinson-zhang commented Jun 22, 2022

vinson-zhang commented Jun 22, 2022 • edited Loading

p0p4k commented Jun 22, 2022

p0p4k commented Jun 22, 2022

vinson-zhang commented Jun 22, 2022

vinson-zhang commented Jun 22, 2022

lexkoro commented Jun 22, 2022

vinson-zhang commented Jun 22, 2022

Edresson commented Jun 27, 2022

Edresson commented Jun 27, 2022 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

erogol commented Jul 11, 2022

Edresson commented Jul 11, 2022

vinson-zhang commented Jul 15, 2022 • edited Loading

Edresson commented Jul 15, 2022

vinson-zhang commented Jun 20, 2022 •

edited

Loading

vinson-zhang commented Jun 21, 2022 •

edited

Loading

vinson-zhang commented Jun 21, 2022 •

edited

Loading

vinson-zhang commented Jun 21, 2022 •

edited

Loading

p0p4k commented Jun 22, 2022 •

edited

Loading

vinson-zhang commented Jun 22, 2022 •

edited

Loading

Edresson commented Jun 27, 2022 •

edited

Loading

vinson-zhang commented Jul 15, 2022 •

edited

Loading