Skip to content
Zach Leigh edited this page May 8, 2016 · 4 revisions

Contents

Parsing Kana

MeCab is not good at detecting kana words that are usually written in kanji. For example, when the kanji word 年月日 is passed through MeCab, it is correctly parsed as a single word.
Raw MeCab output:

年月日
年月日  名詞,一般,*,*,*,*,年月日,ネンガッピ,ネンガッピ

However, if the word is passed to MeCab in kana, it incorrectly splits the word into three separate words.
Raw MeCab output:

ねんがっぴ
ねん    助詞,終助詞,*,*,*,*,ねん,ネン,ネン
がっ    動詞,接尾,*,*,五段・ラ行,連用タ接続,がる,ガッ,ガッ
ぴ      名詞,一般,*,*,*,*,*

This creates several problems for the plugins, notably the romaji plugin. To get around this, the noParse() method exists on the Limelight object. noParse() runs the raw input through the plugins without first sending the data through MeCab.

$results = $limelight->noParse('かな');

The method returns an instance of LimelightResults, just like the parse() method.

Let's again use 年月日 to illustrate the differences between parse() and noParse(). Here we use the parse() method with the proper kanji word:

$results = $limelight->parse('年月日');

echo $results->string('romaji');

Output:
nengappi

No problems there.

Now, with the kana version of the word using the parse() method:

$results = $limelight->parse('ねんがっぴ');

echo $results->string('romaji');

Output:
nenga

Because of the way MeCab handles the unknown kana word, the complete romaji can not be resolved properly. However, if we pass the kana to the noparse() method, this problem is avoided:

$results = $limelight->noParse('ねんがっぴ');

echo $results->string('romaji');

Output: nengappi

If kanji is passed to noParse(), it will fail and throw an InvalidInputException.

Clone this wiki locally