Questions related to MeloTTS #1193

eehoeskrap · 2024-07-31T02:06:36Z

Thank you for creating a great repository.
I wonder why there is no bert when converting a pytorch model of MeloTTS to an Onnx model.
https://github.com/k2-fsa/sherpa-onnx/blob/963aaba82b01a425ae8dcf0fdcff6b073a45686f/scripts/melo-tts/export-onnx.py#L206C1-L235C6

    torch.onnx.export(
        torch_model,
        (
            x,
            x_lengths,
            tones,
            sid,
            noise_scale,
            length_scale,
            noise_scale_w,
        ),
        filename,
        opset_version=opset_version,
        input_names=[
            "x",
            "x_lengths",
            "tones",
            "sid",
            "noise_scale",
            "length_scale",
            "noise_scale_w",
        ],
        output_names=["y"],
        dynamic_axes={
            "x": {0: "N", 1: "L"},
            "x_lengths": {0: "N"},
            "tones": {0: "N", 1: "L"},
            "y": {0: "N", 1: "S", 2: "T"},
        },
    )

The text was updated successfully, but these errors were encountered:

csukuangfj · 2024-07-31T02:42:03Z

Could you tell us how to get the input for bert from texts?

Are there any C++ implementation for that?

eehoeskrap · 2024-07-31T04:00:45Z

In this code, you can get the bert value through the get_bert function.
Bert calls a different torch model for each language, and there is only a Python implementation.
https://github.com/myshell-ai/MeloTTS/blob/144a0980fac43411153209cf08a1998e3c161e10/melo/utils.py#L22

eehoeskrap · 2024-07-31T04:06:41Z

In your code, there is a part where bert and ja_bert are entered as model inputs in ModelWrapper.

sherpa-onnx/scripts/melo-tts/export-onnx.py

Line 172 in 963aaba

bert=bert,

So, even though I specified input_names as below when exporting to the onnx model, I am experiencing the phenomenon that there is no bert in the input in the onnx file.

    torch.onnx.export(
        torch_model,
        (
            x,
            x_lengths,
            sid,
            tones,
            lang_id,
            bert,
            ja_bert,
            sdp_ratio,
            noise_scale,
            noise_scale_w,
            length_scale,
        ),
        filename,
        opset_version=opset_version,
        input_names=[
            "x",
            "x_lengths",
            "sid",
            "tones",
            "lang_id",
            "bert",
            "ja_bert",
            "sdp_ratio",
            "noise_scale",
            "noise_scale_w",
            "length_scale",
        ],
        output_names=["y"],
        dynamic_axes={
            "x": {0: "N", 1: "L"},
            "x_lengths": {0: "N"},
            "tones": {0: "N", 1: "L"},
            "lang_id": {0: "N", 1: "L"},
            "bert": {0: "N", 1: "L", 2: "D"},
            "ja_bert": {0: "N", 1: "L", 2: "D"},
            "y": {0: "N", 1: "S", 2: "T"},
        },
    )

csukuangfj · 2024-07-31T04:25:10Z

Could you tell us how to get the input for bert from texts?

Are there any C++ implementation for that?

Please have a look at this comment. That is the main obstacle. If you can fix it, then we can support bert.

csukuangfj · 2024-07-31T04:26:14Z

In this code, you can get the bert value through the get_bert function.

Yes, I know that. I am asking do you know if there is a C++ implementation for that or is it possible to implement it in C++?

eehoeskrap · 2024-07-31T04:49:09Z

In this code, you can get the bert value through the get_bert function.

Yes, I know that. I am asking do you know if there is a C++ implementation for that or is it possible to implement it in C++?

As far as I know, there is currently no Korean version of Bert C++. I will try it and let you know.

Korean version of Bert (https://huggingface.co/kykim/bert-kor-base)

csukuangfj · 2024-07-31T04:53:30Z

By the way, the main issue is about the tokenizer.

eehoeskrap · 2024-07-31T05:03:39Z

By the way, the main issue is about the tokenizer.

Yes, I know that.
If you run onnx with the bert value set to 0 like this code, the Korean voice is produced awkwardly.

sherpa-onnx/scripts/melo-tts/export-onnx.py

Line 162 in 963aaba

bert = torch.zeros(x.shape[0], 1024, x.shape[1], dtype=torch.float32)

csukuangfj · 2024-07-31T05:58:38Z

If you run onnx with the bert value set to 0 like this code, the Korean voice is produced awkwardly.

In that case, supporting Korean models from MeloTTS in sherpa-onnx may be hard.

Could you try
https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-mimic3-ko_KO-kss_low.tar.bz2

We have already had a Korean TTS model in sherpa-onnx.

eehoeskrap · 2024-07-31T06:18:10Z

If you run onnx with the bert value set to 0 like this code, the Korean voice is produced awkwardly.

In that case, supporting Korean models from MeloTTS in sherpa-onnx may be hard.

Could you try https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-mimic3-ko_KO-kss_low.tar.bz2

We have already had a Korean TTS model in sherpa-onnx.

I found this repo while trying to export MeloTTS models ONNX.
When exporting ONNX in this code, I was wondering why bert was not included.
Thanks to your answer, I found out that it is because there is no C++ implementation.

I already have a Korean tts model trained with custom data.
I just succeeded in exporting onnx including bert values.
However, the preprocessing process (tokenizer, etc.) was run in python.

The Korean version of MeloTTS torch model is exported to ONNX for inference, so it is quite fast.
However, I need to try a C++ implementation of the preprocessing process like you did. I will try this.
However, Korean phoneme processing is quite difficult.

As you mentioned earlier, the biggest question is "How do we implement the bert torch model in C++?" is correct.
First, let's try exporting the bert model to onnx.

Thank you for the reply.

csukuangfj · 2024-08-05T06:10:44Z

Currently the ios version has to process the entire text before synthesizing the audio,

I just added the support for passing a callback from Swift to C. Please see #1218

Please play the samples received in the callback by yourself, possibly in a separate thread. We don't have time to add that.

Finally, also noticed ios version can't be published to app store due to framework issue.

Please have a look at #1172

By the way, contributions to sherpa-onnx are highly appreciated.

Hope that you can fix the issues by yourself.

@nanaghartey

nanaghartey · 2024-08-06T04:59:17Z

@csukuangfj No problem. I actually made some contributions but noticed the latest version fixes most of the issues i found. Example in sherpa-onnx/jni/jni.cc some reserved words in java were used preventing porting of sample tts kotlin code to java. E.g Java_com_k2fsa_sherpa_onnx_SpeakerEmbeddingExtractor_new Now all is good!

By the way, I just checked out MeloTTS, finetuned a model and exported to sherpa onnx for android. It's great. How can i help bring this to ios? I'm not sure the swiftui tts example accepts melo tts models

csukuangfj · 2024-08-06T06:47:53Z

How can i help bring this to ios? I'm not sure the swiftui tts example accepts melo tts models

Yes, it is already supported. In case you don't know how to do it, I just added an example for you.
Please see
#1223

@nanaghartey

nanaghartey · 2024-08-07T12:19:47Z

@csukuangfj I have a single speaker fine tuned model (melo). it works great but when i convert to sherpa onnx and then use the provided zh_en *.fst and .dict on android , i get wrong synthesis. I assumed it would work since my model is english. how can i generate the *.fst and .dict files for my custom model? or can we make it work by changing the configurations?

csukuangfj · 2024-08-07T12:21:35Z

You don't need *.fst for English only models.

Could you post the code about how you add the metadata?

, i get wrong synthesis.

Could you be more specific? What does wrong mean?

nanaghartey · 2024-08-07T12:32:29Z

@csukuangfj thanks for the prompt response.

"wrong" here means unexpected output. wrong pronunciations.

Sorry but this is how i export (the default export script only exports chinese_english):

import torch
from melo.api import TTS
from melo.text import language_id_map, language_tone_start_map
from melo.text.chinese import pinyin_to_symbol_map
from melo.text.english import eng_dict, refine_syllables
from pypinyin import Style, lazy_pinyin, phrases_dict, pinyin_dict
from typing import Any, Dict
import json

# Prepare the pinyin to symbol map
for k, v in pinyin_to_symbol_map.items():
    if isinstance(v, list):
        break
    pinyin_to_symbol_map[k] = v.split()

# Function to get initial, final, and tone from pinyin
def get_initial_final_tone(word: str):
    initials = lazy_pinyin(word, neutral_tone_with_five=True, style=Style.INITIALS)
    finals = lazy_pinyin(word, neutral_tone_with_five=True, style=Style.FINALS_TONE3)

    ans_phone = []
    ans_tone = []

    for c, v in zip(initials, finals):
        raw_pinyin = c + v
        v_without_tone = v[:-1]
        try:
            tone = v[-1]
        except:
            return [], []

        pinyin = c + v_without_tone
        if c:
            v_rep_map = {
                "uei": "ui",
                "iou": "iu",
                "uen": "un",
            }
            if v_without_tone in v_rep_map.keys():
                pinyin = c + v_rep_map[v_without_tone]
        else:
            pinyin_rep_map = {
                "ing": "ying",
                "i": "yi",
                "in": "yin",
                "u": "wu",
            }
            if pinyin in pinyin_rep_map.keys():
                pinyin = pinyin_rep_map[pinyin]
            else:
                single_rep_map = {
                    "v": "yu",
                    "e": "e",
                    "i": "y",
                    "u": "w",
                }
                if pinyin[0] in single_rep_map.keys():
                    pinyin = single_rep_map[pinyin[0]] + pinyin[1:]

        if pinyin not in pinyin_to_symbol_map:
            continue
        phone = pinyin_to_symbol_map[pinyin]
        ans_phone += phone
        ans_tone += [tone] * len(phone)

    return ans_phone, ans_tone

# Function to generate tokens file
def generate_tokens(symbol_list):
    with open("tokens.txt", "w", encoding="utf-8") as f:
        for i, s in enumerate(symbol_list):
            f.write(f"{s} {i}\n")

# Function to add new English words to the lexicon
def add_new_english_words(lexicon):
    lexicon["kaldi"] = [["K", "AH0"], ["L", "D", "IH0"]]
    lexicon["SF"] = [["EH1", "S"], ["EH1", "F"]]

# Function to generate lexicon file
def generate_lexicon():
    word_dict = pinyin_dict.pinyin_dict
    phrases = phrases_dict.phrases_dict
    add_new_english_words(eng_dict)
    with open("lexicon.txt", "w", encoding="utf-8") as f:
        for word in eng_dict:
            phones, tones = refine_syllables(eng_dict[word])
            tones = [t + language_tone_start_map["EN"] for t in tones]
            tones = [str(t) for t in tones]

            phones = " ".join(phones)
            tones = " ".join(tones)

            f.write(f"{word.lower()} {phones} {tones}\n")

        for key in word_dict:
            if not (0x4E00 <= key <= 0x9FA5):
                continue
            w = chr(key)
            phone, tone = get_initial_final_tone(w)
            if not phone:
                continue
            phone = " ".join(phone)
            tone = " ".join(tone)
            f.write(f"{w} {phone} {tone}\n")

        for w in phrases:
            phone, tone = get_initial_final_tone(w)
            if not phone:
                continue
            phone = " ".join(phone)
            tone = " ".join(tone)
            f.write(f"{w} {phone} {tone}\n")

# Function to add metadata to ONNX model
def add_meta_data(filename: str, meta_data: Dict[str, Any]):
    import onnx
    model = onnx.load(filename)
    while len(model.metadata_props):
        model.metadata_props.pop()

    for key, value in meta_data.items():
        meta = model.metadata_props.add()
        meta.key = key
        meta.value = str(value)

    onnx.save(model, filename)

# ModelWrapper class definition
class ModelWrapper(torch.nn.Module):
    def __init__(self, model: "SynthesizerTrn"):
        super().__init__()
        self.model = model
        self.lang_id = language_id_map[model.language]

    def forward(
        self,
        x,
        x_lengths,
        tones,
        sid,
        noise_scale,
        length_scale,
        noise_scale_w,
        max_len=None,
    ):
        bert = torch.zeros(x.shape[0], 1024, x.shape[1], dtype=torch.float32)
        ja_bert = torch.zeros(x.shape[0], 768, x.shape[1], dtype=torch.float32)
        lang_id = torch.zeros_like(x)
        lang_id[:, 1::2] = self.lang_id
        return self.model.model.infer(
            x=x,
            x_lengths=x_lengths,
            sid=sid,
            tone=tones,
            language=lang_id,
            bert=bert,
            ja_bert=ja_bert,
            noise_scale=noise_scale,
            noise_scale_w=noise_scale_w,
            length_scale=length_scale,
        )[0]

# Main function to handle model loading and ONNX export
def main():
    generate_lexicon()  # Generate the lexicon.txt file

    model_path = "model.pth"  # Path to your custom model
    config_path = "config.json"  # Path to your config.json file
    with open(config_path, 'r') as f:
        config = json.load(f)

    model = TTS(language="EN", device="cpu", config_path=config_path, ckpt_path=model_path)
    model.load_state_dict(torch.load(model_path, map_location="cpu"), strict=False)

    generate_tokens(config["symbols"])  # Generate tokens.txt file

    torch_model = ModelWrapper(model)

    x = torch.randint(low=0, high=10, size=(60,), dtype=torch.int64)
    x_lengths = torch.tensor([x.size(0)], dtype=torch.int64)
    sid = torch.tensor([0], dtype=torch.int64)
    tones = torch.zeros_like(x)

    noise_scale = torch.tensor([1.0], dtype=torch.float32)
    length_scale = torch.tensor([1.0], dtype=torch.float32)
    noise_scale_w = torch.tensor([1.0], dtype=torch.float32)

    x = x.unsqueeze(0)
    tones = tones.unsqueeze(0)

    filename = "model.onnx"
    torch.onnx.export(
        torch_model,
        (x, x_lengths, tones, sid, noise_scale, length_scale, noise_scale_w),
        filename,
        opset_version=13,
        input_names=["x", "x_lengths", "tones", "sid", "noise_scale", "length_scale", "noise_scale_w"],
        output_names=["y"],
        dynamic_axes={
            "x": {0: "N", 1: "L"},
            "x_lengths": {0: "N"},
            "tones": {0: "N", 1: "L"},
            "y": {0: "N", 1: "S", 2: "T"},
        },
    )

    meta_data = {
        "model_type": "melo-vits",
        "comment": "melo",
        "version": 2,
        "language": "English",
        "add_blank": int(config["data"]["add_blank"]),
        "n_speakers": config["data"]["n_speakers"],
        "jieba": 1,
        "sample_rate": config["data"]["sampling_rate"],
        "bert_dim": 1024,
        "ja_bert_dim": 768,
        "speaker_id": list(config["data"]["spk2id"].values())[0],
        "lang_id": language_id_map["EN"],
        "tone_start": language_tone_start_map["EN"],
        "url": "https://github.com/myshell-ai/MeloTTS",
        "license": "MIT license",
        "description": "MeloTTS is a high-quality multi-lingual text-to-speech library by MyShell.ai",
    }
    add_meta_data(filename, meta_data)

if __name__ == "__main__":
    main()

then in api.py i do:

class TTS(nn.Module):
    def __init__(self, 
                 language,
                 device='auto',
                 use_hf=True,
                 config_path=None,
                 ckpt_path=None):
        super().__init__()
        if device == 'auto':
            device = 'cpu'
            if torch.cuda.is_available():
                device = 'cuda'
            if torch.backends.mps.is_available():
                device = 'mps'
        if 'cuda' in device:
            assert torch.cuda.is_available()

        # Load configuration from your custom config_path
        if config_path:
            hps = utils.get_hparams_from_file(config_path)
        else:
            hps = load_or_download_config(language, use_hf=use_hf)

        num_languages = hps.num_languages
        num_tones = hps.num_tones
        symbols = hps.symbols

        model = SynthesizerTrn(
            len(symbols),
            hps.data.filter_length // 2 + 1,
            hps.train.segment_size // hps.data.hop_length,
            n_speakers=hps.data.n_speakers,
            num_tones=num_tones,
            num_languages=num_languages,
            **hps.model,
        ).to(device)

        model.eval()
        self.model = model
        self.symbol_to_id = {s: i for i, s in enumerate(symbols)}
        self.hps = hps
        self.device = device

        # load state_dict
        checkpoint_dict = load_or_download_model(language, device, use_hf=use_hf, ckpt_path=ckpt_path)
        self.model.load_state_dict(checkpoint_dict['model'], strict=True)

        language = language.split('_')[0]
        self.language = 'ZH_MIX_EN' if language == 'ZH' else language

csukuangfj · 2024-08-07T12:34:47Z

"wrong" here means unexpected output. wrong pronunciations.

Could you post some text and the corresponding generated wav?

please also post the logs if you use sherpa-onnx to generate the wav with your model.

csukuangfj · 2024-08-07T12:43:36Z

https://github.com/csukuangfj/onnxruntime-build/actions/runs/9184634501

You can see from the above link that we can successfully build a debug version of static lib.

nanaghartey · 2024-08-07T13:17:57Z

"wrong" here means unexpected output. wrong pronunciations.

Could you post some text and the corresponding generated wav?

please also post the logs if you use sherpa-onnx to generate the wav with your model.

custom model 1 : Eng, news (african accent)

text - "things to look out for in the year 2020"

.pth generated wav -

output.mov

onnx generated wav -

generated.mov

custom model 2 - Eng, singing (us accent)

text - "next time won't you sing with me"

.pth generated wav -

output.mov

onnx generated wav -

generated.mov

i use sherpa-onnx but don't get logs. I was only trying out Melo on sherpa so models were not trained for long (training is not the issue though)

I hope you're able to spot the issue. Thanks

nanaghartey · 2024-08-07T19:44:38Z

@csukuangfj I can also share my model.pth and config.json files if that'd help.

csukuangfj · 2024-08-08T02:13:45Z

When you use .pth to test your model, can you zero out the bert part and try again?

nanaghartey · 2024-08-08T03:28:35Z

When you use .pth to test your model, can you zero out the bert part and try again?

The results is still better than onnx's when i zero out the bert part

csukuangfj · 2024-08-08T03:46:06Z

Could you show the code about how you did that?

nanaghartey · 2024-08-08T03:51:58Z

In api.py #def tts_to_file() i did:

    bert = torch.zeros_like(bert).to(device)

please share your solution if that is wrong.

csukuangfj · 2024-08-08T04:11:15Z

could you please post the complete code?

nanaghartey · 2024-08-08T04:13:22Z

could you please post the complete code?

def tts_to_file(self, text, speaker_id, output_path=None, sdp_ratio=0.2, noise_scale=0.6, noise_scale_w=0.8, speed=1.0, pbar=None, format=None, position=None, quiet=False,):
       language = self.language
       texts = self.split_sentences_into_pieces(text, language, quiet)
       audio_list = []
       if pbar:
           tx = pbar(texts)
       else:
           if position:
               tx = tqdm(texts, position=position)
           elif quiet:
               tx = texts
           else:
               tx = tqdm(texts)
       for t in tx:
           if language in ['EN', 'ZH_MIX_EN']:
               t = re.sub(r'([a-z])([A-Z])', r'\1 \2', t)
           device = self.device
           bert, ja_bert, phones, tones, lang_ids = utils.get_text_for_tts_infer(t, language, self.hps, device, self.symbol_to_id)
           #bert = torch.zeros_like(bert).to(device)
           #ja_bert = torch.zeros_like(ja_bert).to(device)
           with torch.no_grad():
               x_tst = phones.to(device).unsqueeze(0)
               tones = tones.to(device).unsqueeze(0)
               lang_ids = lang_ids.to(device).unsqueeze(0)
               bert = bert.to(device).unsqueeze(0)
               ja_bert = ja_bert.to(device).unsqueeze(0)
               x_tst_lengths = torch.LongTensor([phones.size(0)]).to(device)
               del phones
               speakers = torch.LongTensor([speaker_id]).to(device)
               audio = self.model.infer(
                       x_tst,
                       x_tst_lengths,
                       speakers,
                       tones,
                       lang_ids,
                       bert,
                       ja_bert,
                       sdp_ratio=sdp_ratio,
                       noise_scale=noise_scale,
                       noise_scale_w=noise_scale_w,
                       length_scale=1. / speed,
                   )[0][0, 0].data.cpu().float().numpy()
               del x_tst, tones, lang_ids, bert, ja_bert, x_tst_lengths, speakers
               # 
           audio_list.append(audio)
       torch.cuda.empty_cache()
       audio = self.audio_numpy_concat(audio_list, sr=self.hps.data.sampling_rate, speed=speed)

       if output_path is None:
           return audio
       else:
           if format:
               soundfile.write(output_path, audio, self.hps.data.sampling_rate, format=format)
           else:
               soundfile.write(output_path, audio, self.hps.data.sampling_rate)

csukuangfj · 2024-08-08T04:39:09Z

In api.py #def tts_to_file() i did:
    bert = torch.zeros_like(bert).to(device)
please share your solution if that is wrong.

Could you change

           bert, ja_bert, phones, tones, lang_ids = utils.get_text_for_tts_infer(t, language, self.hps, device, self.symbol_to_id)
           #bert = torch.zeros_like(bert).to(device)
           #ja_bert = torch.zeros_like(ja_bert).to(device)

to

           bert, ja_bert, phones, tones, lang_ids = utils.get_text_for_tts_infer(t, language, self.hps, device, self.symbol_to_id)
           bert.zero_()
           ja_bert.zero_()

nanaghartey · 2024-08-08T12:50:44Z

@csukuangfj

      bert, ja_bert, phones, tones, lang_ids = utils.get_text_for_tts_infer(t, language, self.hps, device, self.symbol_to_id)
           bert.zero_()
           ja_bert.zero_()

results is generated wav that sounds almost same as the original .pth inference (without zeroing out ) except for some few pronunciations that sound off. however it's way better than the wavs from onnx above. Here is the output with bert zeroed out:

eng.mov

output_sing.mov

I then tried :

  bert = torch.zeros(x.shape[0], 1024, x.shape[1], dtype=torch.float32)
        ja_bert = torch.zeros(x.shape[0], 768, x.shape[1], dtype=torch.float32)
        bert.zero_()
        ja_bert.zero_()
        
    in export-onnx.py for onnx conversion but i got same "wrong" results shared earlier

csukuangfj · 2024-08-09T01:22:47Z

Please compare the inputs to the model manually and see if they are the same.

dhc45010 · 2024-08-15T08:22:03Z

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以)
python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx

csukuangfj · 2024-08-15T10:18:17Z

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以) python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx

目前解决不了这个问题。

dhc45010 · 2024-08-15T11:57:04Z

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以) python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx

目前解决不了这个问题。

好的感谢

studionexus-lk · 2024-08-24T03:25:17Z

do anyone have google collab notebook for this? convert models? i need japan tts voices

csukuangfj · 2024-08-24T03:35:55Z

do anyone have google collab notebook for this? convert models? i need japan tts voices

Please see
https://colab.research.google.com/drive/1XsKyAXti1e6_qYiJ3Fiyt8E7d1lPch75?usp=sharing

It is for Chinese+English MeloTTS model.

nanaghartey · 2024-08-25T00:03:53Z

do anyone have google collab notebook for this? convert models? i need japan tts voices

Please see https://colab.research.google.com/drive/1XsKyAXti1e6_qYiJ3Fiyt8E7d1lPch75?usp=sharing

It is for Chinese+English MeloTTS model.

Is there one for English only? In future if there is a way to convert a standard English model from the official training script can you share here? Thanks

csukuangfj · 2024-08-25T00:22:39Z

Sorry, I only have this one.

csukuangfj · 2024-09-26T03:00:54Z

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以) python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx

请用 onnxruntime 1.12.0

微信群里，有同学反馈，使用 onnxruntime 1.12.0, gpu 跑 melo tts, 不会有问题.
@dhc45010

csukuangfj · 2024-09-26T08:52:17Z

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以) python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx

请看

#1379

@dhc45010

dhc45010 · 2024-09-29T01:37:56Z

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以) python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx

请看

#1379

@dhc45010

好的感谢大佬的回复，我回头试试

nanaghartey · 2024-11-01T02:12:11Z

@csukuangfj any updates on getting the default MeloTTS models to work?

csukuangfj · 2024-11-01T02:16:38Z

@csukuangfj any updates on getting the default MeloTTS models to work?

Could you describe the issue you have?
@nanaghartey

nanaghartey · 2024-11-01T03:01:12Z

@csukuangfj any updates on getting the default MeloTTS models to work?

Could you describe the issue you have?

@nanaghartey

There is support for Chinese+English MeloTTS model only . If one wants to use metlotts they have to stick to the Chinese+English model .
I'm asking if there are any updates/documentation on converting e.g standard English MeloTTS models.

csukuangfj · 2024-11-01T04:58:01Z

please adapt our current script. if you have any troubles, please post error logs.

nanaghartey · 2024-11-02T04:01:31Z

please adapt our current script. if you have any troubles, please post error logs.

I already tried that above and it didn't work .

csukuangfj · 2024-11-04T10:03:24Z

please adapt our current script. if you have any troubles, please post error logs.

I already tried that above and it didn't work .

Please see
#1509

and please find why it didn't work for you. @nanaghartey

nanaghartey · 2024-11-05T20:21:14Z

please adapt our current script. if you have any troubles, please post error logs.

I already tried that above and it didn't work .

Please see #1509

and please find why it didn't work for you. @nanaghartey

Thanks for this. I tried it out with the model you shared - https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-melo-tts-en.tar.bz2

I have this:

modelName = "model.onnx";
        dictDir = "model/dict";
        lexicon  = "lexicon.txt";
        dataDir = null;

I use the same dict for Chinese + English model since i don't have any other. I get this when i run the app :

Current model is not using jieba but you provided --vits-dict-dir

the app hangs during start up with the logs below:


2024-11-05 18:48:37.139 13801-13801 sherpa-onnx             com.k2fsa.sherpa.onnx                W  ---vits model---
                                                                                                    description=MeloTTS is a high-quality multi-lingual text-to-speech library by MyShell.ai
                                                                                                    license=MIT license
                                                                                                    url=https://github.com/myshell-ai/MeloTTS
                                                                                                    tone_start=7
                                                                                                    speaker_id=0
                                                                                                    ja_bert_dim=768
                                                                                                    version=2
                                                                                                    bert_dim=1024
                                                                                                    add_blank=1
                                                                                                    sample_rate=44100
                                                                                                    n_speakers=4
                                                                                                    comment=melo
                                                                                                    lang_id=2
                                                                                                    language=English
                                                                                                    jieba=0
                                                                                                    model_type=melo-vits
                                                                                                    ----------input names----------
                                                                                                    0 x
                                                                                                    1 x_lengths
                                                                                                    2 tones
                                                                                                    3 sid
                                                                                                    4 noise_scale
                                                                                                    5 length_scale
                                                                                                    6 noise_scale_w
                                                                                                    ----------output names----------
                                                                                                    0 y
2024-11-05 18:48:37.194 13801-13801 sherpa-onnx             com.k2fsa.sherpa.onnx                W  Current model is not using jieba but you provided --vits-dict-dir
---------------------------- PROCESS ENDED (13801) for package com.k2fsa.sherpa.onnx ----------------------------

    the chinese + english model runs fine with these logs:

--vits model---
                                                                                                    description=MeloTTS is a high-quality multi-lingual text-to-speech library by MyShell.ai
                                                                                                    license=MIT license
                                                                                                    url=https://github.com/myshell-ai/MeloTTS
                                                                                                    tone_start=0
                                                                                                    speaker_id=1
                                                                                                    ja_bert_dim=768
                                                                                                    version=2
                                                                                                    bert_dim=1024
                                                                                                    add_blank=1
                                                                                                    sample_rate=44100
                                                                                                    n_speakers=1
                                                                                                    comment=melo
                                                                                                    lang_id=3
                                                                                                    language=Chinese + English
                                                                                                    jieba=1
                                                                                                    model_type=melo-vits
                                                                                                    ----------input names----------
                                                                                                    0 x
                                                                                                    1 x_lengths
                                                                                                    2 tones
                                                                                                    3 sid
                                                                                                    4 noise_scale
                                                                                                    5 length_scale
                                                                                                    6 noise_scale_w
                                                                                                    ----------output names----------
                                                                                                    0 y

csukuangfj · 2024-11-05T21:00:46Z

please don't use files not included in the model directory you have downloaded.

that is, do not use dict dir.

csukuangfj · 2024-11-05T21:03:23Z

All you need has included in the model tar.bz2 file.

Please see the comment in #1509 for usage

nanaghartey · 2024-11-05T22:00:24Z

@csukuangfj
I forgot to mention i already tried that.

modelName = "model.onnx";
        dictDir = null;
        lexicon  = "lexicon.txt";
        dataDir = null;
        String meloDir = copyDataDir(modelDir);
        modelDir = meloDir + "/" + modelDir;
        assets = null;

the app loads all right but when i enter text and tap generate i get this.

024-11-05 21:57:56.546 12497-12615 sherpa-onnx             com.k2fsa.sherpa.onnx                W  string is: hello
2024-11-05 21:57:56.546 12497-12615 sherpa-onnx             com.k2fsa.sherpa.onnx                W  Raw text: hello
2024-11-05 21:57:56.547 12497-12615 libc++abi               com.k2fsa.sherpa.onnx                E  terminating due to uncaught exception of type std::out_of_range: unordered_map::at: key not found
2024-11-05 21:57:56.547 12497-12615 libc                    com.k2fsa.sherpa.onnx                A  Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid 12615 (Thread-2), pid 12497 (fsa.sherpa.onnx)
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A  Cmdline: com.k2fsa.sherpa.onnx
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A  pid: 12497, tid: 12615, name: Thread-2  >>> com.k2fsa.sherpa.onnx <<<
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A        #01 pc 00000000001a8c90  /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/lib/arm64/libsherpa-onnx-jni.so (BuildId: 73eb9682daf1bd7954ab5281d845c6771d228f77)
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A        #02 pc 00000000001a8588  /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/lib/arm64/libsherpa-onnx-jni.so (BuildId: 73eb9682daf1bd7954ab5281d845c6771d228f77)
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A        #03 pc 00000000001a8448  /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/lib/arm64/libsherpa-onnx-jni.so (BuildId: 73eb9682daf1bd7954ab5281d845c6771d228f77)
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A        #04 pc 00000000001c3328  /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/lib/arm64/libsherpa-onnx-jni.so (BuildId: 73eb9682daf1bd7954ab5281d845c6771d228f77)
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A        #05 pc 00000000001c329c  /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/lib/arm64/libsherpa-onnx-jni.so (__cxa_throw+128) (BuildId: 73eb9682daf1bd7954ab5281d845c6771d228f77)
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A        #06 pc 00000000001ef8d4  /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/lib/arm64/libsherpa-onnx-jni.so (BuildId: 73eb9682daf1bd7954ab5281d845c6771d228f77)
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A        #07 pc 00000000002e2dcc  /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/lib/arm64/libsherpa-onnx-jni.so (BuildId: 73eb9682daf1bd7954ab5281d845c6771d228f77)
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A        #08 pc 00000000002c16a0  /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/lib/arm64/libsherpa-onnx-jni.so (BuildId: 73eb9682daf1bd7954ab5281d845c6771d228f77)
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A        #09 pc 00000000001d64f8  /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/lib/arm64/libsherpa-onnx-jni.so (Java_com_k2fsa_sherpa_onnx_OfflineTts_generateWithCallbackImpl+552) (BuildId: 73eb9682daf1bd7954ab5281d845c6771d228f77)
2024-11-05 21:57:56.831 12619-12619 DEBUG                   pid-12619                            A        #16 pc 000000000000390c  [anon:dalvik-classes3.dex extracted in memory from /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/base.apk] (com.k2fsa.sherpa.onnx.OfflineTts.generateWithCallback+0)
2024-11-05 21:57:56.831 12619-12619 DEBUG                   pid-12619                            A        #21 pc 0000000000003b64  [anon:dalvik-classes3.dex extracted in memory from /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/base.apk] (com.k2fsa.sherpa.onnx.Tts.generateWithCallback+0)
2024-11-05 21:57:56.831 12619-12619 DEBUG                   pid-12619                            A        #26 pc 00000000000025dc  [anon:dalvik-classes3.dex extracted in memory from /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/base.apk] (com.k2fsa.sherpa.onnx.MainActivityBatches.lambda$onClickGenerate$6$com-k2fsa-sherpa-onnx-MainActivityBatches+0)
2024-11-05 21:57:56.831 12619-12619 DEBUG                   pid-12619                            A        #31 pc 0000000000001ce4  [anon:dalvik-classes3.dex extracted in memory from /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/base.apk] (com.k2fsa.sherpa.onnx.MainActivityBatches$$ExternalSyntheticLambda3.run+0)
---------------------------- PROCESS ENDED (12497) for package com.k2fsa.sherpa.onnx ----------------------------

csukuangfj · 2024-11-06T00:05:09Z

are you using the latest master to build the libraries?

How did you get the '.so' files?

nanaghartey · 2024-11-06T00:58:55Z

for quick testing, I used the .so files in sherpa-onnx-1.10.30-arm64-v8a-zh_en-tts-vits-melo-tts-zh_en.apk from https://k2-fsa.github.io/sherpa/onnx/tts/apk.html

I just built the .so files and tested. Works now! Thanks a lot. By the way in the export-onnx-en script i only changed :

def main():
    generate_lexicon()

    language = "EN"
    model = TTS(language=language, device="cpu")

To

def main():
    generate_lexicon()

    model_path = "model.pth"  # Path to your custom model
    config_path = "config.json"  # Path to your config.json file
    with open(config_path, 'r') as f:
        config = json.load(f)

    model = TTS(language="EN", device="cpu", config_path=config_path, ckpt_path=model_path)
    model.load_state_dict(torch.load(model_path, map_location="cpu"), strict=False)

That should be enough right? It works but wondering if i need to change something else to improve pronunciation

csukuangfj · 2024-11-06T13:33:53Z

if i need to change something else to improve pronunciation

You can try to enable bert support.

csukuangfj · 2024-11-06T13:36:08Z

for quick testing, I used the .so files in sherpa-onnx-1.10.30-arm64-v8a-zh_en-tts-vits-melo-tts-zh_en.apk from https://k2-fsa.github.io/sherpa/onnx/tts/apk.html

I hope you understand that support for melo-tts English model is added after 1.10.30 and you need to use the latest master to test it, not the code or library from 1.10.30.

nanaghartey · 2024-11-06T14:13:36Z

if i need to change something else to improve pronunciation

You can try to enable bert support.

sure i'll try that. Thanks

nanaghartey · 2024-12-23T02:17:33Z

please adapt our current script. if you have any troubles, please post error logs.

I already tried that above and it didn't work .

Please see

#1509

and please find why it didn't work for you. @nanaghartey

I noticed some pronunciation differences:

For example, "Google" is pronounced correctly using the original Melo TTS model. However, on the Sherpa ONNX-converted Melo TTS model, each letter is pronounced individually as G-O-O-G-L-E.

csukuangfj · 2024-12-24T02:13:07Z

I noticed some pronunciation differences:

@nanaghartey

First, we don't use G2P.

Second, all words that can be pronounced are enumerated in lexicon.txt.

Third, in case you have a word that is not in lexicon.txt, please follow the link in our doc
https://k2-fsa.github.io/sherpa/onnx/tts/pretrained_models/vits.html#vits-melo-tts-zh-en-chinese-english-1-speaker
to add it by yourself.

nanaghartey · 2024-12-24T02:41:48Z

@csukuangfj I'm well aware of this . The challenge is I don't know which words users will be generating so i can't add them manually

nanaghartey · 2025-01-05T02:02:43Z

I noticed some pronunciation differences:

@nanaghartey

First, we don't use G2P.

Second, all words that can be pronounced are enumerated in lexicon.txt.

Third, in case you have a word that is not in lexicon.txt, please follow the link in our doc https://k2-fsa.github.io/sherpa/onnx/tts/pretrained_models/vits.html#vits-melo-tts-zh-en-chinese-english-1-speaker to add it by yourself.

So is it possible to to add words dynamically during runtime? or a different approach like piper ? without manually adding words to lexicon ??

csukuangfj · 2025-01-05T08:28:07Z

So is it possible to to add words dynamically during runtime?

If you know the pronunciation of the words, then it is possible.

csukuangfj mentioned this issue Aug 6, 2024

Add MeloTTS example for ios #1223

Merged

Questions related to MeloTTS #1193

Questions related to MeloTTS #1193

Comments

eehoeskrap commented Jul 31, 2024

csukuangfj commented Jul 31, 2024 • edited Loading

eehoeskrap commented Jul 31, 2024

eehoeskrap commented Jul 31, 2024

csukuangfj commented Jul 31, 2024

csukuangfj commented Jul 31, 2024

eehoeskrap commented Jul 31, 2024 • edited Loading

csukuangfj commented Jul 31, 2024

eehoeskrap commented Jul 31, 2024

csukuangfj commented Jul 31, 2024

eehoeskrap commented Jul 31, 2024

csukuangfj commented Aug 5, 2024

nanaghartey commented Aug 6, 2024

csukuangfj commented Aug 6, 2024

nanaghartey commented Aug 7, 2024

csukuangfj commented Aug 7, 2024

nanaghartey commented Aug 7, 2024

csukuangfj commented Aug 7, 2024 • edited Loading

csukuangfj commented Aug 7, 2024

nanaghartey commented Aug 7, 2024

nanaghartey commented Aug 7, 2024

csukuangfj commented Aug 8, 2024

nanaghartey commented Aug 8, 2024

csukuangfj commented Aug 8, 2024

nanaghartey commented Aug 8, 2024

csukuangfj commented Aug 8, 2024

nanaghartey commented Aug 8, 2024

csukuangfj commented Aug 8, 2024

nanaghartey commented Aug 8, 2024 • edited by csukuangfj Loading

csukuangfj commented Aug 9, 2024

dhc45010 commented Aug 15, 2024 • edited Loading

csukuangfj commented Aug 15, 2024

dhc45010 commented Aug 15, 2024

studionexus-lk commented Aug 24, 2024

csukuangfj commented Aug 24, 2024

nanaghartey commented Aug 25, 2024 • edited Loading

csukuangfj commented Aug 25, 2024

csukuangfj commented Sep 26, 2024

csukuangfj commented Sep 26, 2024

dhc45010 commented Sep 29, 2024

nanaghartey commented Nov 1, 2024

csukuangfj commented Nov 1, 2024

nanaghartey commented Nov 1, 2024

csukuangfj commented Nov 1, 2024

nanaghartey commented Nov 2, 2024

csukuangfj commented Nov 4, 2024

nanaghartey commented Nov 5, 2024

csukuangfj commented Nov 5, 2024

csukuangfj commented Nov 5, 2024

nanaghartey commented Nov 5, 2024

csukuangfj commented Nov 6, 2024

nanaghartey commented Nov 6, 2024

csukuangfj commented Nov 6, 2024

csukuangfj commented Nov 6, 2024

nanaghartey commented Nov 6, 2024

nanaghartey commented Dec 23, 2024

csukuangfj commented Dec 24, 2024

nanaghartey commented Dec 24, 2024

nanaghartey commented Jan 5, 2025

csukuangfj commented Jan 5, 2025

csukuangfj commented Jul 31, 2024 •

edited

Loading

eehoeskrap commented Jul 31, 2024 •

edited

Loading

csukuangfj commented Aug 7, 2024 •

edited

Loading

nanaghartey commented Aug 8, 2024 •

edited by csukuangfj

Loading

dhc45010 commented Aug 15, 2024 •

edited

Loading

nanaghartey commented Aug 25, 2024 •

edited

Loading