Phonetic recognition #45

DanielSWolf · 2018-10-08T19:01:35Z

Rhubarb Lip Sync uses word-based speech recognition. That works well for English dialog. For non-English dialog, however, phonetic recognition might work better. Rather than try to extract English words from non-English speech, this will extract phonemes.

I'm planning to add a CLI option to switch to phonetic recognition.

This is only a temporary solution. In the long run, I still plan to implement full (word-based) recognition for languages other than English (see #5).

DanielSWolf · 2018-10-08T19:05:49Z

I created a feature branch (feature/phonetic-recognition) to get some feedback. Here are the instructions I wrote on the Thimbleweed Park forum:

The basic idea of my hack is this: Normally, Rhubarb tries to recognize whole words (and phrases). Since Rhubarb only knows English, it has a hard time finding English words and phrases that resemble the Italian dialog. That's why the results are often rather inaccurate.

My hack simply lowers the granularity. Instead of looking for whole English words and phrases, it now looks for English phonemes and syllables. So the underlying language model is still English, but the chances that a given Italian phone is similar to an existing English phone are pretty good. And the chances that a given Italian syllable has a matching English syllable are still not bad.

There is, however, still some fine-tuning to be done. If Rhubarb only worked at the syllable level, there would still be many Italian syllables it couldn't match. As a result, the animation would look wrong in those places. Worse, the mouth could even stop moving for a moment if Rhubarb really couldn't find a suitable match.

The obvious solution would be not to work at the syllable level, but only at the phone level. Most Italian phones are also present in the English language. The problem here is fluttering. If the voice actor is saying a long phone that's exactly between two known English phones, Rhubarb might first recognize phone A, then phone B, then A again and so on, while actually the speaker is still saying the same sound. As a result, the animated mouth might flutter between several shapes during a single phone, which looks quite bad.

The solution, then, is to blend the two approaches. I've temporarily added an additional (mandatory) command-line argument modelWeight. If you specify a high value (such as 2.0), Rhubarb will try to recognize whole syllables, leading to imprecise or freezing animation. If you specify a low value (such as 0.1), Rhubarb will try to recognize individual phones, leading to fluttering. I found that the value 1.0 seems to work well, balancing the advantages of both approaches. But I didn't try any other values between the two extremes. So maybe something like 0.8 or 1.3 could work even better. Also, I only tried the new approach with a short one-minute dialog containing Italian, Spanish, French, and German. Trying it out on a larger body of recordings may give additional insights.

My plan is to settle for a fixed model weight before the release. Then I'll add a new command-line option to switch between the original, word-based recognition (which looks best for English) and the phonetic recognition (which will hopefully work better for non-English dialog).

Let me know what you think! I'm grateful for any feedback. And if you found a modelWeight value that seems to work better than 1.0, let me know.

Note that this feature branch may change at any time.

morevnaproject · 2018-10-09T04:50:54Z

Thank you very much!

nshmyrev · 2018-10-23T09:49:25Z

You should try something like montreal-forced-aligner, it supports many languages out of box

DanielSWolf · 2018-10-26T07:43:58Z

@nshmyrev Thanks for the reference; that project looks interesting.

However, this issue is about phonetic recognition, while Montreal Forced Aligner is about forced alignment and G2P. Am I missing a connection?

DanielSWolf · 2019-01-02T10:35:38Z

I've extracted all the speech recognition logic into an interface called Recognizer, so that recognizers can be selected via CLI. To start, I implemented two recognizers: pocketSphinx is the old English recognizer; phonetic is the new, language-agnostic recognizer.

DanielSWolf mentioned this issue Oct 8, 2018

Languages #5

Open

DanielSWolf mentioned this issue Dec 30, 2018

Phonetic recognition #47

Merged

DanielSWolf closed this as completed in #47 Jan 2, 2019

DanielSWolf added the enhancement label Jun 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phonetic recognition #45

Phonetic recognition #45

DanielSWolf commented Oct 8, 2018 •

edited

Loading

DanielSWolf commented Oct 8, 2018

morevnaproject commented Oct 9, 2018

nshmyrev commented Oct 23, 2018

DanielSWolf commented Oct 26, 2018

DanielSWolf commented Jan 2, 2019

Phonetic recognition #45

Phonetic recognition #45

Comments

DanielSWolf commented Oct 8, 2018 • edited Loading

DanielSWolf commented Oct 8, 2018

morevnaproject commented Oct 9, 2018

nshmyrev commented Oct 23, 2018

DanielSWolf commented Oct 26, 2018

DanielSWolf commented Jan 2, 2019

DanielSWolf commented Oct 8, 2018 •

edited

Loading