Getting audio with contents unrelated to the text input #17

jry-king · 2023-02-10T03:18:13Z

jry-king
Feb 10, 2023

I tried to train a model a bit larger than the nano setting with the config
--decoder-dim 192 --nhead 6 --num-decoder-layers 6 --max-duration 60
and then synthesized audio with the command
python3 bin/infer.py --decoder-dim 192 --nhead 6 --num-decoder-layers 6 --model-name valle \
--text-prompts "Go to her." --audio-prompts ./prompts/61_70970_000007_000001.wav \
--text "To get up and running quickly just follow the steps below." \
--output-dir infer/demo_valle_step220000 --checkpoint exp/valle_nano_full/checkpoint-220000.pt
I've tried many times but the output audio is either silent audio or some speech that has nothing to do with the text, like:
0.webm
The training process now has already occupied almost all VRAM of a single V100. Is it because the model is still to small, or is there any other possible reasons? Thanks!

Answered by MisakaMikoto96

Feb 14, 2023

Yes, the model is too small to train an AR with good performance, as the output of the AR does not have a stable length for output duration and is difficult to get intelligibility.
I tried it on Libritts(small) and got such a similar performance that I got some intelligibility on 'short sentence' and 'the first few words of a long sentence', also, with stochastic performance and result.

View full answer

MisakaMikoto96 · 2023-02-14T09:33:16Z

MisakaMikoto96
Feb 14, 2023

Yes, the model is too small to train an AR with good performance, as the output of the AR does not have a stable length for output duration and is difficult to get intelligibility.
I tried it on Libritts(small) and got such a similar performance that I got some intelligibility on 'short sentence' and 'the first few words of a long sentence', also, with stochastic performance and result.

1 reply

agupta54 Feb 23, 2023

How do we fix this? By using more heads?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting audio with contents unrelated to the text input #17

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Getting audio with contents unrelated to the text input #17

jry-king Feb 10, 2023

Replies: 1 comment · 1 reply

MisakaMikoto96 Feb 14, 2023

agupta54 Feb 23, 2023

jry-king
Feb 10, 2023

Replies: 1 comment 1 reply

MisakaMikoto96
Feb 14, 2023