Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions related to MeloTTS #1193

Open
eehoeskrap opened this issue Jul 31, 2024 · 65 comments
Open

Questions related to MeloTTS #1193

eehoeskrap opened this issue Jul 31, 2024 · 65 comments

Comments

@eehoeskrap
Copy link

Thank you for creating a great repository.
I wonder why there is no bert when converting a pytorch model of MeloTTS to an Onnx model.
https://github.com/k2-fsa/sherpa-onnx/blob/963aaba82b01a425ae8dcf0fdcff6b073a45686f/scripts/melo-tts/export-onnx.py#L206C1-L235C6

    torch.onnx.export(
        torch_model,
        (
            x,
            x_lengths,
            tones,
            sid,
            noise_scale,
            length_scale,
            noise_scale_w,
        ),
        filename,
        opset_version=opset_version,
        input_names=[
            "x",
            "x_lengths",
            "tones",
            "sid",
            "noise_scale",
            "length_scale",
            "noise_scale_w",
        ],
        output_names=["y"],
        dynamic_axes={
            "x": {0: "N", 1: "L"},
            "x_lengths": {0: "N"},
            "tones": {0: "N", 1: "L"},
            "y": {0: "N", 1: "S", 2: "T"},
        },
    )
@csukuangfj
Copy link
Collaborator

csukuangfj commented Jul 31, 2024

Could you tell us how to get the input for bert from texts?

Are there any C++ implementation for that?

@eehoeskrap
Copy link
Author

In this code, you can get the bert value through the get_bert function.
Bert calls a different torch model for each language, and there is only a Python implementation.
https://github.com/myshell-ai/MeloTTS/blob/144a0980fac43411153209cf08a1998e3c161e10/melo/utils.py#L22

@eehoeskrap
Copy link
Author

In your code, there is a part where bert and ja_bert are entered as model inputs in ModelWrapper.

So, even though I specified input_names as below when exporting to the onnx model, I am experiencing the phenomenon that there is no bert in the input in the onnx file.

    torch.onnx.export(
        torch_model,
        (
            x,
            x_lengths,
            sid,
            tones,
            lang_id,
            bert,
            ja_bert,
            sdp_ratio,
            noise_scale,
            noise_scale_w,
            length_scale,
        ),
        filename,
        opset_version=opset_version,
        input_names=[
            "x",
            "x_lengths",
            "sid",
            "tones",
            "lang_id",
            "bert",
            "ja_bert",
            "sdp_ratio",
            "noise_scale",
            "noise_scale_w",
            "length_scale",
        ],
        output_names=["y"],
        dynamic_axes={
            "x": {0: "N", 1: "L"},
            "x_lengths": {0: "N"},
            "tones": {0: "N", 1: "L"},
            "lang_id": {0: "N", 1: "L"},
            "bert": {0: "N", 1: "L", 2: "D"},
            "ja_bert": {0: "N", 1: "L", 2: "D"},
            "y": {0: "N", 1: "S", 2: "T"},
        },
    )
image

@csukuangfj
Copy link
Collaborator

Could you tell us how to get the input for bert from texts?

Are there any C++ implementation for that?

Please have a look at this comment. That is the main obstacle. If you can fix it, then we can support bert.

@csukuangfj
Copy link
Collaborator

In this code, you can get the bert value through the get_bert function.

Yes, I know that. I am asking do you know if there is a C++ implementation for that or is it possible to implement it in C++?

@eehoeskrap
Copy link
Author

eehoeskrap commented Jul 31, 2024

In this code, you can get the bert value through the get_bert function.

Yes, I know that. I am asking do you know if there is a C++ implementation for that or is it possible to implement it in C++?

As far as I know, there is currently no Korean version of Bert C++. I will try it and let you know.

@csukuangfj
Copy link
Collaborator

By the way, the main issue is about the tokenizer.

@eehoeskrap
Copy link
Author

By the way, the main issue is about the tokenizer.

Yes, I know that.
If you run onnx with the bert value set to 0 like this code, the Korean voice is produced awkwardly.

bert = torch.zeros(x.shape[0], 1024, x.shape[1], dtype=torch.float32)

@csukuangfj
Copy link
Collaborator

If you run onnx with the bert value set to 0 like this code, the Korean voice is produced awkwardly.

In that case, supporting Korean models from MeloTTS in sherpa-onnx may be hard.

Could you try
https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-mimic3-ko_KO-kss_low.tar.bz2

We have already had a Korean TTS model in sherpa-onnx.

@eehoeskrap
Copy link
Author

If you run onnx with the bert value set to 0 like this code, the Korean voice is produced awkwardly.

In that case, supporting Korean models from MeloTTS in sherpa-onnx may be hard.

Could you try https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-mimic3-ko_KO-kss_low.tar.bz2

We have already had a Korean TTS model in sherpa-onnx.

I found this repo while trying to export MeloTTS models ONNX.
When exporting ONNX in this code, I was wondering why bert was not included.
Thanks to your answer, I found out that it is because there is no C++ implementation.

I already have a Korean tts model trained with custom data.
I just succeeded in exporting onnx including bert values.
However, the preprocessing process (tokenizer, etc.) was run in python.

The Korean version of MeloTTS torch model is exported to ONNX for inference, so it is quite fast.
However, I need to try a C++ implementation of the preprocessing process like you did. I will try this.
However, Korean phoneme processing is quite difficult.

As you mentioned earlier, the biggest question is "How do we implement the bert torch model in C++?" is correct.
First, let's try exporting the bert model to onnx.

Thank you for the reply.

@csukuangfj
Copy link
Collaborator

Currently the ios version has to process the entire text before synthesizing the audio,

I just added the support for passing a callback from Swift to C. Please see #1218

Please play the samples received in the callback by yourself, possibly in a separate thread. We don't have time to add that.


Finally, also noticed ios version can't be published to app store due to framework issue.

Please have a look at #1172


By the way, contributions to sherpa-onnx are highly appreciated.

Hope that you can fix the issues by yourself.

@nanaghartey

@nanaghartey
Copy link

@csukuangfj No problem. I actually made some contributions but noticed the latest version fixes most of the issues i found. Example in sherpa-onnx/jni/jni.cc some reserved words in java were used preventing porting of sample tts kotlin code to java. E.g Java_com_k2fsa_sherpa_onnx_SpeakerEmbeddingExtractor_new Now all is good!

By the way, I just checked out MeloTTS, finetuned a model and exported to sherpa onnx for android. It's great. How can i help bring this to ios? I'm not sure the swiftui tts example accepts melo tts models

@csukuangfj
Copy link
Collaborator

How can i help bring this to ios? I'm not sure the swiftui tts example accepts melo tts models

Yes, it is already supported. In case you don't know how to do it, I just added an example for you.
Please see
#1223

@nanaghartey

@nanaghartey
Copy link

@csukuangfj I have a single speaker fine tuned model (melo). it works great but when i convert to sherpa onnx and then use the provided zh_en *.fst and .dict on android , i get wrong synthesis. I assumed it would work since my model is english. how can i generate the *.fst and .dict files for my custom model? or can we make it work by changing the configurations?

@csukuangfj
Copy link
Collaborator

You don't need *.fst for English only models.

Could you post the code about how you add the metadata?


, i get wrong synthesis.

Could you be more specific? What does wrong mean?

@nanaghartey
Copy link

@csukuangfj thanks for the prompt response.

"wrong" here means unexpected output. wrong pronunciations.

Sorry but this is how i export (the default export script only exports chinese_english):

import torch
from melo.api import TTS
from melo.text import language_id_map, language_tone_start_map
from melo.text.chinese import pinyin_to_symbol_map
from melo.text.english import eng_dict, refine_syllables
from pypinyin import Style, lazy_pinyin, phrases_dict, pinyin_dict
from typing import Any, Dict
import json

# Prepare the pinyin to symbol map
for k, v in pinyin_to_symbol_map.items():
    if isinstance(v, list):
        break
    pinyin_to_symbol_map[k] = v.split()

# Function to get initial, final, and tone from pinyin
def get_initial_final_tone(word: str):
    initials = lazy_pinyin(word, neutral_tone_with_five=True, style=Style.INITIALS)
    finals = lazy_pinyin(word, neutral_tone_with_five=True, style=Style.FINALS_TONE3)

    ans_phone = []
    ans_tone = []

    for c, v in zip(initials, finals):
        raw_pinyin = c + v
        v_without_tone = v[:-1]
        try:
            tone = v[-1]
        except:
            return [], []

        pinyin = c + v_without_tone
        if c:
            v_rep_map = {
                "uei": "ui",
                "iou": "iu",
                "uen": "un",
            }
            if v_without_tone in v_rep_map.keys():
                pinyin = c + v_rep_map[v_without_tone]
        else:
            pinyin_rep_map = {
                "ing": "ying",
                "i": "yi",
                "in": "yin",
                "u": "wu",
            }
            if pinyin in pinyin_rep_map.keys():
                pinyin = pinyin_rep_map[pinyin]
            else:
                single_rep_map = {
                    "v": "yu",
                    "e": "e",
                    "i": "y",
                    "u": "w",
                }
                if pinyin[0] in single_rep_map.keys():
                    pinyin = single_rep_map[pinyin[0]] + pinyin[1:]

        if pinyin not in pinyin_to_symbol_map:
            continue
        phone = pinyin_to_symbol_map[pinyin]
        ans_phone += phone
        ans_tone += [tone] * len(phone)

    return ans_phone, ans_tone

# Function to generate tokens file
def generate_tokens(symbol_list):
    with open("tokens.txt", "w", encoding="utf-8") as f:
        for i, s in enumerate(symbol_list):
            f.write(f"{s} {i}\n")

# Function to add new English words to the lexicon
def add_new_english_words(lexicon):
    lexicon["kaldi"] = [["K", "AH0"], ["L", "D", "IH0"]]
    lexicon["SF"] = [["EH1", "S"], ["EH1", "F"]]

# Function to generate lexicon file
def generate_lexicon():
    word_dict = pinyin_dict.pinyin_dict
    phrases = phrases_dict.phrases_dict
    add_new_english_words(eng_dict)
    with open("lexicon.txt", "w", encoding="utf-8") as f:
        for word in eng_dict:
            phones, tones = refine_syllables(eng_dict[word])
            tones = [t + language_tone_start_map["EN"] for t in tones]
            tones = [str(t) for t in tones]

            phones = " ".join(phones)
            tones = " ".join(tones)

            f.write(f"{word.lower()} {phones} {tones}\n")

        for key in word_dict:
            if not (0x4E00 <= key <= 0x9FA5):
                continue
            w = chr(key)
            phone, tone = get_initial_final_tone(w)
            if not phone:
                continue
            phone = " ".join(phone)
            tone = " ".join(tone)
            f.write(f"{w} {phone} {tone}\n")

        for w in phrases:
            phone, tone = get_initial_final_tone(w)
            if not phone:
                continue
            phone = " ".join(phone)
            tone = " ".join(tone)
            f.write(f"{w} {phone} {tone}\n")

# Function to add metadata to ONNX model
def add_meta_data(filename: str, meta_data: Dict[str, Any]):
    import onnx
    model = onnx.load(filename)
    while len(model.metadata_props):
        model.metadata_props.pop()

    for key, value in meta_data.items():
        meta = model.metadata_props.add()
        meta.key = key
        meta.value = str(value)

    onnx.save(model, filename)

# ModelWrapper class definition
class ModelWrapper(torch.nn.Module):
    def __init__(self, model: "SynthesizerTrn"):
        super().__init__()
        self.model = model
        self.lang_id = language_id_map[model.language]

    def forward(
        self,
        x,
        x_lengths,
        tones,
        sid,
        noise_scale,
        length_scale,
        noise_scale_w,
        max_len=None,
    ):
        bert = torch.zeros(x.shape[0], 1024, x.shape[1], dtype=torch.float32)
        ja_bert = torch.zeros(x.shape[0], 768, x.shape[1], dtype=torch.float32)
        lang_id = torch.zeros_like(x)
        lang_id[:, 1::2] = self.lang_id
        return self.model.model.infer(
            x=x,
            x_lengths=x_lengths,
            sid=sid,
            tone=tones,
            language=lang_id,
            bert=bert,
            ja_bert=ja_bert,
            noise_scale=noise_scale,
            noise_scale_w=noise_scale_w,
            length_scale=length_scale,
        )[0]

# Main function to handle model loading and ONNX export
def main():
    generate_lexicon()  # Generate the lexicon.txt file

    model_path = "model.pth"  # Path to your custom model
    config_path = "config.json"  # Path to your config.json file
    with open(config_path, 'r') as f:
        config = json.load(f)

    model = TTS(language="EN", device="cpu", config_path=config_path, ckpt_path=model_path)
    model.load_state_dict(torch.load(model_path, map_location="cpu"), strict=False)

    generate_tokens(config["symbols"])  # Generate tokens.txt file

    torch_model = ModelWrapper(model)

    x = torch.randint(low=0, high=10, size=(60,), dtype=torch.int64)
    x_lengths = torch.tensor([x.size(0)], dtype=torch.int64)
    sid = torch.tensor([0], dtype=torch.int64)
    tones = torch.zeros_like(x)

    noise_scale = torch.tensor([1.0], dtype=torch.float32)
    length_scale = torch.tensor([1.0], dtype=torch.float32)
    noise_scale_w = torch.tensor([1.0], dtype=torch.float32)

    x = x.unsqueeze(0)
    tones = tones.unsqueeze(0)

    filename = "model.onnx"
    torch.onnx.export(
        torch_model,
        (x, x_lengths, tones, sid, noise_scale, length_scale, noise_scale_w),
        filename,
        opset_version=13,
        input_names=["x", "x_lengths", "tones", "sid", "noise_scale", "length_scale", "noise_scale_w"],
        output_names=["y"],
        dynamic_axes={
            "x": {0: "N", 1: "L"},
            "x_lengths": {0: "N"},
            "tones": {0: "N", 1: "L"},
            "y": {0: "N", 1: "S", 2: "T"},
        },
    )

    meta_data = {
        "model_type": "melo-vits",
        "comment": "melo",
        "version": 2,
        "language": "English",
        "add_blank": int(config["data"]["add_blank"]),
        "n_speakers": config["data"]["n_speakers"],
        "jieba": 1,
        "sample_rate": config["data"]["sampling_rate"],
        "bert_dim": 1024,
        "ja_bert_dim": 768,
        "speaker_id": list(config["data"]["spk2id"].values())[0],
        "lang_id": language_id_map["EN"],
        "tone_start": language_tone_start_map["EN"],
        "url": "https://github.com/myshell-ai/MeloTTS",
        "license": "MIT license",
        "description": "MeloTTS is a high-quality multi-lingual text-to-speech library by MyShell.ai",
    }
    add_meta_data(filename, meta_data)

if __name__ == "__main__":
    main()

then in api.py i do:

class TTS(nn.Module):
    def __init__(self, 
                 language,
                 device='auto',
                 use_hf=True,
                 config_path=None,
                 ckpt_path=None):
        super().__init__()
        if device == 'auto':
            device = 'cpu'
            if torch.cuda.is_available():
                device = 'cuda'
            if torch.backends.mps.is_available():
                device = 'mps'
        if 'cuda' in device:
            assert torch.cuda.is_available()

        # Load configuration from your custom config_path
        if config_path:
            hps = utils.get_hparams_from_file(config_path)
        else:
            hps = load_or_download_config(language, use_hf=use_hf)

        num_languages = hps.num_languages
        num_tones = hps.num_tones
        symbols = hps.symbols

        model = SynthesizerTrn(
            len(symbols),
            hps.data.filter_length // 2 + 1,
            hps.train.segment_size // hps.data.hop_length,
            n_speakers=hps.data.n_speakers,
            num_tones=num_tones,
            num_languages=num_languages,
            **hps.model,
        ).to(device)

        model.eval()
        self.model = model
        self.symbol_to_id = {s: i for i, s in enumerate(symbols)}
        self.hps = hps
        self.device = device

        # load state_dict
        checkpoint_dict = load_or_download_model(language, device, use_hf=use_hf, ckpt_path=ckpt_path)
        self.model.load_state_dict(checkpoint_dict['model'], strict=True)

        language = language.split('_')[0]
        self.language = 'ZH_MIX_EN' if language == 'ZH' else language

@csukuangfj
Copy link
Collaborator

csukuangfj commented Aug 7, 2024

"wrong" here means unexpected output. wrong pronunciations.

Could you post some text and the corresponding generated wav?


please also post the logs if you use sherpa-onnx to generate the wav with your model.

@csukuangfj
Copy link
Collaborator

https://github.com/csukuangfj/onnxruntime-build/actions/runs/9184634501

You can see from the above link that we can successfully build a debug version of static lib.

@nanaghartey
Copy link

"wrong" here means unexpected output. wrong pronunciations.

Could you post some text and the corresponding generated wav?

please also post the logs if you use sherpa-onnx to generate the wav with your model.

custom model 1 : Eng, news (african accent)

text - "things to look out for in the year 2020"

.pth generated wav -

output.mov

onnx generated wav -

generated.mov

custom model 2 - Eng, singing (us accent)

text - "next time won't you sing with me"

.pth generated wav -

output.mov

onnx generated wav -

generated.mov

i use sherpa-onnx but don't get logs. I was only trying out Melo on sherpa so models were not trained for long (training is not the issue though)

I hope you're able to spot the issue. Thanks

@nanaghartey
Copy link

@csukuangfj I can also share my model.pth and config.json files if that'd help.

@csukuangfj
Copy link
Collaborator

When you use .pth to test your model, can you zero out the bert part and try again?

@nanaghartey
Copy link

When you use .pth to test your model, can you zero out the bert part and try again?

The results is still better than onnx's when i zero out the bert part

@csukuangfj
Copy link
Collaborator

Could you show the code about how you did that?

@nanaghartey
Copy link

In api.py #def tts_to_file() i did:

    bert = torch.zeros_like(bert).to(device)

please share your solution if that is wrong.

@csukuangfj
Copy link
Collaborator

could you please post the complete code?

@nanaghartey
Copy link

could you please post the complete code?

def tts_to_file(self, text, speaker_id, output_path=None, sdp_ratio=0.2, noise_scale=0.6, noise_scale_w=0.8, speed=1.0, pbar=None, format=None, position=None, quiet=False,):
       language = self.language
       texts = self.split_sentences_into_pieces(text, language, quiet)
       audio_list = []
       if pbar:
           tx = pbar(texts)
       else:
           if position:
               tx = tqdm(texts, position=position)
           elif quiet:
               tx = texts
           else:
               tx = tqdm(texts)
       for t in tx:
           if language in ['EN', 'ZH_MIX_EN']:
               t = re.sub(r'([a-z])([A-Z])', r'\1 \2', t)
           device = self.device
           bert, ja_bert, phones, tones, lang_ids = utils.get_text_for_tts_infer(t, language, self.hps, device, self.symbol_to_id)
           #bert = torch.zeros_like(bert).to(device)
           #ja_bert = torch.zeros_like(ja_bert).to(device)
           with torch.no_grad():
               x_tst = phones.to(device).unsqueeze(0)
               tones = tones.to(device).unsqueeze(0)
               lang_ids = lang_ids.to(device).unsqueeze(0)
               bert = bert.to(device).unsqueeze(0)
               ja_bert = ja_bert.to(device).unsqueeze(0)
               x_tst_lengths = torch.LongTensor([phones.size(0)]).to(device)
               del phones
               speakers = torch.LongTensor([speaker_id]).to(device)
               audio = self.model.infer(
                       x_tst,
                       x_tst_lengths,
                       speakers,
                       tones,
                       lang_ids,
                       bert,
                       ja_bert,
                       sdp_ratio=sdp_ratio,
                       noise_scale=noise_scale,
                       noise_scale_w=noise_scale_w,
                       length_scale=1. / speed,
                   )[0][0, 0].data.cpu().float().numpy()
               del x_tst, tones, lang_ids, bert, ja_bert, x_tst_lengths, speakers
               # 
           audio_list.append(audio)
       torch.cuda.empty_cache()
       audio = self.audio_numpy_concat(audio_list, sr=self.hps.data.sampling_rate, speed=speed)

       if output_path is None:
           return audio
       else:
           if format:
               soundfile.write(output_path, audio, self.hps.data.sampling_rate, format=format)
           else:
               soundfile.write(output_path, audio, self.hps.data.sampling_rate)

@csukuangfj
Copy link
Collaborator

In api.py #def tts_to_file() i did:

    bert = torch.zeros_like(bert).to(device)

please share your solution if that is wrong.

Could you change

           bert, ja_bert, phones, tones, lang_ids = utils.get_text_for_tts_infer(t, language, self.hps, device, self.symbol_to_id)
           #bert = torch.zeros_like(bert).to(device)
           #ja_bert = torch.zeros_like(ja_bert).to(device)

to

           bert, ja_bert, phones, tones, lang_ids = utils.get_text_for_tts_infer(t, language, self.hps, device, self.symbol_to_id)
           bert.zero_()
           ja_bert.zero_()

@nanaghartey
Copy link

nanaghartey commented Aug 8, 2024

@csukuangfj

      bert, ja_bert, phones, tones, lang_ids = utils.get_text_for_tts_infer(t, language, self.hps, device, self.symbol_to_id)
           bert.zero_()
           ja_bert.zero_()

results is generated wav that sounds almost same as the original .pth inference (without zeroing out ) except for some few pronunciations that sound off. however it's way better than the wavs from onnx above. Here is the output with bert zeroed out:

eng.mov
output_sing.mov

I then tried :

  bert = torch.zeros(x.shape[0], 1024, x.shape[1], dtype=torch.float32)
        ja_bert = torch.zeros(x.shape[0], 768, x.shape[1], dtype=torch.float32)
        bert.zero_()
        ja_bert.zero_()
        
    in export-onnx.py for onnx conversion but i got same "wrong" results shared earlier    

@csukuangfj
Copy link
Collaborator

Please compare the inputs to the model manually and see if they are the same.

@dhc45010
Copy link

dhc45010 commented Aug 15, 2024

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以)
python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx
image

@csukuangfj
Copy link
Collaborator

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以) python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx image

目前解决不了这个问题。

@dhc45010
Copy link

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以) python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx image

目前解决不了这个问题。

好的 感谢

@studionexus-lk
Copy link

do anyone have google collab notebook for this? convert models? i need japan tts voices

@csukuangfj
Copy link
Collaborator

do anyone have google collab notebook for this? convert models? i need japan tts voices

Please see
https://colab.research.google.com/drive/1XsKyAXti1e6_qYiJ3Fiyt8E7d1lPch75?usp=sharing

It is for Chinese+English MeloTTS model.

@nanaghartey
Copy link

nanaghartey commented Aug 25, 2024

do anyone have google collab notebook for this? convert models? i need japan tts voices

Please see https://colab.research.google.com/drive/1XsKyAXti1e6_qYiJ3Fiyt8E7d1lPch75?usp=sharing

It is for Chinese+English MeloTTS model.

Is there one for English only? In future if there is a way to convert a standard English model from the official training script can you share here? Thanks

@csukuangfj
Copy link
Collaborator

Sorry, I only have this one.

@csukuangfj
Copy link
Collaborator

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以) python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx image

请用 onnxruntime 1.12.0

微信群里,有同学反馈, 使用 onnxruntime 1.12.0, gpu 跑 melo tts, 不会有问题.
@dhc45010

@csukuangfj
Copy link
Collaborator

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以) python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx image

请看

#1379

@dhc45010

@dhc45010
Copy link

通过sherpa-onnx gpu(1.10.17+cuda) 调用 vits-melo-tts-zh_en 报错请问是什么原因呢 (cpu可以) python3 ./python-api-examples/offline-tts-play.py --vits-model=./vits-melo-tts-zh_en/model.onnx image

请看

#1379

@dhc45010

好的 感谢大佬的回复,我回头试试

@nanaghartey
Copy link

@csukuangfj any updates on getting the default MeloTTS models to work?

@csukuangfj
Copy link
Collaborator

@csukuangfj any updates on getting the default MeloTTS models to work?

Could you describe the issue you have?
@nanaghartey

@nanaghartey
Copy link

@csukuangfj any updates on getting the default MeloTTS models to work?

Could you describe the issue you have?

@nanaghartey

There is support for Chinese+English MeloTTS model only . If one wants to use metlotts they have to stick to the Chinese+English model .
I'm asking if there are any updates/documentation on converting e.g standard English MeloTTS models.

@csukuangfj
Copy link
Collaborator

please adapt our current script. if you have any troubles, please post error logs.

@nanaghartey
Copy link

please adapt our current script. if you have any troubles, please post error logs.

I already tried that above and it didn't work .

@csukuangfj
Copy link
Collaborator

please adapt our current script. if you have any troubles, please post error logs.

I already tried that above and it didn't work .

Please see
#1509

and please find why it didn't work for you. @nanaghartey

@nanaghartey
Copy link

please adapt our current script. if you have any troubles, please post error logs.

I already tried that above and it didn't work .

Please see #1509

and please find why it didn't work for you. @nanaghartey

Thanks for this. I tried it out with the model you shared - https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-melo-tts-en.tar.bz2

I have this:

modelName = "model.onnx";
        dictDir = "model/dict";
        lexicon  = "lexicon.txt";
        dataDir = null;

I use the same dict for Chinese + English model since i don't have any other. I get this when i run the app :

Current model is not using jieba but you provided --vits-dict-dir

the app hangs during start up with the logs below:


2024-11-05 18:48:37.139 13801-13801 sherpa-onnx             com.k2fsa.sherpa.onnx                W  ---vits model---
                                                                                                    description=MeloTTS is a high-quality multi-lingual text-to-speech library by MyShell.ai
                                                                                                    license=MIT license
                                                                                                    url=https://github.com/myshell-ai/MeloTTS
                                                                                                    tone_start=7
                                                                                                    speaker_id=0
                                                                                                    ja_bert_dim=768
                                                                                                    version=2
                                                                                                    bert_dim=1024
                                                                                                    add_blank=1
                                                                                                    sample_rate=44100
                                                                                                    n_speakers=4
                                                                                                    comment=melo
                                                                                                    lang_id=2
                                                                                                    language=English
                                                                                                    jieba=0
                                                                                                    model_type=melo-vits
                                                                                                    ----------input names----------
                                                                                                    0 x
                                                                                                    1 x_lengths
                                                                                                    2 tones
                                                                                                    3 sid
                                                                                                    4 noise_scale
                                                                                                    5 length_scale
                                                                                                    6 noise_scale_w
                                                                                                    ----------output names----------
                                                                                                    0 y
2024-11-05 18:48:37.194 13801-13801 sherpa-onnx             com.k2fsa.sherpa.onnx                W  Current model is not using jieba but you provided --vits-dict-dir
---------------------------- PROCESS ENDED (13801) for package com.k2fsa.sherpa.onnx ----------------------------  
    the chinese + english model runs fine with these logs:
--vits model---
                                                                                                    description=MeloTTS is a high-quality multi-lingual text-to-speech library by MyShell.ai
                                                                                                    license=MIT license
                                                                                                    url=https://github.com/myshell-ai/MeloTTS
                                                                                                    tone_start=0
                                                                                                    speaker_id=1
                                                                                                    ja_bert_dim=768
                                                                                                    version=2
                                                                                                    bert_dim=1024
                                                                                                    add_blank=1
                                                                                                    sample_rate=44100
                                                                                                    n_speakers=1
                                                                                                    comment=melo
                                                                                                    lang_id=3
                                                                                                    language=Chinese + English
                                                                                                    jieba=1
                                                                                                    model_type=melo-vits
                                                                                                    ----------input names----------
                                                                                                    0 x
                                                                                                    1 x_lengths
                                                                                                    2 tones
                                                                                                    3 sid
                                                                                                    4 noise_scale
                                                                                                    5 length_scale
                                                                                                    6 noise_scale_w
                                                                                                    ----------output names----------
                                                                                                    0 y     

@csukuangfj
Copy link
Collaborator

please don't use files not included in the model directory you have downloaded.

that is, do not use dict dir.

@csukuangfj
Copy link
Collaborator

All you need has included in the model tar.bz2 file.

Please see the comment in #1509 for usage

@nanaghartey
Copy link

@csukuangfj
I forgot to mention i already tried that.

modelName = "model.onnx";
        dictDir = null;
        lexicon  = "lexicon.txt";
        dataDir = null;
        String meloDir = copyDataDir(modelDir);
        modelDir = meloDir + "/" + modelDir;
        assets = null;

the app loads all right but when i enter text and tap generate i get this.

024-11-05 21:57:56.546 12497-12615 sherpa-onnx             com.k2fsa.sherpa.onnx                W  string is: hello
2024-11-05 21:57:56.546 12497-12615 sherpa-onnx             com.k2fsa.sherpa.onnx                W  Raw text: hello
2024-11-05 21:57:56.547 12497-12615 libc++abi               com.k2fsa.sherpa.onnx                E  terminating due to uncaught exception of type std::out_of_range: unordered_map::at: key not found
2024-11-05 21:57:56.547 12497-12615 libc                    com.k2fsa.sherpa.onnx                A  Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid 12615 (Thread-2), pid 12497 (fsa.sherpa.onnx)
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A  Cmdline: com.k2fsa.sherpa.onnx
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A  pid: 12497, tid: 12615, name: Thread-2  >>> com.k2fsa.sherpa.onnx <<<
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A        #01 pc 00000000001a8c90  /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/lib/arm64/libsherpa-onnx-jni.so (BuildId: 73eb9682daf1bd7954ab5281d845c6771d228f77)
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A        #02 pc 00000000001a8588  /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/lib/arm64/libsherpa-onnx-jni.so (BuildId: 73eb9682daf1bd7954ab5281d845c6771d228f77)
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A        #03 pc 00000000001a8448  /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/lib/arm64/libsherpa-onnx-jni.so (BuildId: 73eb9682daf1bd7954ab5281d845c6771d228f77)
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A        #04 pc 00000000001c3328  /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/lib/arm64/libsherpa-onnx-jni.so (BuildId: 73eb9682daf1bd7954ab5281d845c6771d228f77)
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A        #05 pc 00000000001c329c  /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/lib/arm64/libsherpa-onnx-jni.so (__cxa_throw+128) (BuildId: 73eb9682daf1bd7954ab5281d845c6771d228f77)
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A        #06 pc 00000000001ef8d4  /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/lib/arm64/libsherpa-onnx-jni.so (BuildId: 73eb9682daf1bd7954ab5281d845c6771d228f77)
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A        #07 pc 00000000002e2dcc  /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/lib/arm64/libsherpa-onnx-jni.so (BuildId: 73eb9682daf1bd7954ab5281d845c6771d228f77)
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A        #08 pc 00000000002c16a0  /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/lib/arm64/libsherpa-onnx-jni.so (BuildId: 73eb9682daf1bd7954ab5281d845c6771d228f77)
2024-11-05 21:57:56.830 12619-12619 DEBUG                   pid-12619                            A        #09 pc 00000000001d64f8  /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/lib/arm64/libsherpa-onnx-jni.so (Java_com_k2fsa_sherpa_onnx_OfflineTts_generateWithCallbackImpl+552) (BuildId: 73eb9682daf1bd7954ab5281d845c6771d228f77)
2024-11-05 21:57:56.831 12619-12619 DEBUG                   pid-12619                            A        #16 pc 000000000000390c  [anon:dalvik-classes3.dex extracted in memory from /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/base.apk] (com.k2fsa.sherpa.onnx.OfflineTts.generateWithCallback+0)
2024-11-05 21:57:56.831 12619-12619 DEBUG                   pid-12619                            A        #21 pc 0000000000003b64  [anon:dalvik-classes3.dex extracted in memory from /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/base.apk] (com.k2fsa.sherpa.onnx.Tts.generateWithCallback+0)
2024-11-05 21:57:56.831 12619-12619 DEBUG                   pid-12619                            A        #26 pc 00000000000025dc  [anon:dalvik-classes3.dex extracted in memory from /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/base.apk] (com.k2fsa.sherpa.onnx.MainActivityBatches.lambda$onClickGenerate$6$com-k2fsa-sherpa-onnx-MainActivityBatches+0)
2024-11-05 21:57:56.831 12619-12619 DEBUG                   pid-12619                            A        #31 pc 0000000000001ce4  [anon:dalvik-classes3.dex extracted in memory from /data/app/~~0iYprevsJ7urxYbxDG34mA==/com.k2fsa.sherpa.onnx-qDiQQovqO4De8UthMivj6w==/base.apk] (com.k2fsa.sherpa.onnx.MainActivityBatches$$ExternalSyntheticLambda3.run+0)
---------------------------- PROCESS ENDED (12497) for package com.k2fsa.sherpa.onnx ----------------------------

@csukuangfj
Copy link
Collaborator

are you using the latest master to build the libraries?

How did you get the '.so' files?

@nanaghartey
Copy link

for quick testing, I used the .so files in sherpa-onnx-1.10.30-arm64-v8a-zh_en-tts-vits-melo-tts-zh_en.apk from https://k2-fsa.github.io/sherpa/onnx/tts/apk.html

I just built the .so files and tested. Works now! Thanks a lot. By the way in the export-onnx-en script i only changed :

def main():
    generate_lexicon()

    language = "EN"
    model = TTS(language=language, device="cpu")

To

def main():
    generate_lexicon()

    model_path = "model.pth"  # Path to your custom model
    config_path = "config.json"  # Path to your config.json file
    with open(config_path, 'r') as f:
        config = json.load(f)

    model = TTS(language="EN", device="cpu", config_path=config_path, ckpt_path=model_path)
    model.load_state_dict(torch.load(model_path, map_location="cpu"), strict=False)

That should be enough right? It works but wondering if i need to change something else to improve pronunciation

@csukuangfj
Copy link
Collaborator

if i need to change something else to improve pronunciation

You can try to enable bert support.

@csukuangfj
Copy link
Collaborator

for quick testing, I used the .so files in sherpa-onnx-1.10.30-arm64-v8a-zh_en-tts-vits-melo-tts-zh_en.apk from https://k2-fsa.github.io/sherpa/onnx/tts/apk.html

I hope you understand that support for melo-tts English model is added after 1.10.30 and you need to use the latest master to test it, not the code or library from 1.10.30.

@nanaghartey
Copy link

if i need to change something else to improve pronunciation

You can try to enable bert support.

sure i'll try that. Thanks

@nanaghartey
Copy link

please adapt our current script. if you have any troubles, please post error logs.

I already tried that above and it didn't work .

Please see

#1509

and please find why it didn't work for you. @nanaghartey

I noticed some pronunciation differences:

For example, "Google" is pronounced correctly using the original Melo TTS model. However, on the Sherpa ONNX-converted Melo TTS model, each letter is pronounced individually as G-O-O-G-L-E.

@csukuangfj
Copy link
Collaborator

I noticed some pronunciation differences:

@nanaghartey

First, we don't use G2P.

Second, all words that can be pronounced are enumerated in lexicon.txt.

Third, in case you have a word that is not in lexicon.txt, please follow the link in our doc
https://k2-fsa.github.io/sherpa/onnx/tts/pretrained_models/vits.html#vits-melo-tts-zh-en-chinese-english-1-speaker
to add it by yourself.

@nanaghartey
Copy link

@csukuangfj I'm well aware of this . The challenge is I don't know which words users will be generating so i can't add them manually

@nanaghartey
Copy link

I noticed some pronunciation differences:

@nanaghartey

First, we don't use G2P.

Second, all words that can be pronounced are enumerated in lexicon.txt.

Third, in case you have a word that is not in lexicon.txt, please follow the link in our doc https://k2-fsa.github.io/sherpa/onnx/tts/pretrained_models/vits.html#vits-melo-tts-zh-en-chinese-english-1-speaker to add it by yourself.

So is it possible to to add words dynamically during runtime? or a different approach like piper ? without manually adding words to lexicon ??

@csukuangfj
Copy link
Collaborator

So is it possible to to add words dynamically during runtime?

If you know the pronunciation of the words, then it is possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants