Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The validation of Dictionary::assignUserDictionaryCosts() is inappropriate #76

Open
CookieBox26 opened this issue Mar 21, 2024 · 0 comments

Comments

@CookieBox26
Copy link

Problem

When using the UniDic dictionary and attempting to estimate the cost of user dictionaries, a validation error occurs at the following location.

CHECK_DIE(cid.left_size() == matrix.left_size() &&
cid.right_size() == matrix.right_size())
<< "Context ID files("
<< left_id_file
<< " or "
<< right_id_file << " may be broken: "
<< cid.left_size() << " " << matrix.left_size() << " "
<< cid.right_size() << " " << matrix.right_size();

dictionary.cpp(184) [cid.left_size() == matrix.left_size() && cid.right_size() ==
matrix.right_size()] Context ID files(C:/Program Files/MeCab/dic/unidic-csj-3.1.1-
full\left-id.def or C:/Program Files/MeCab/dic/unidic-csj-3.1.1-full\right-id.def
may be broken: 18552 15629 20859 15389

Causes and Solutions

This issue is due to the fact that the context_id is not unique for each line in the left_id_file (right_id_file). For instance, the left_id_file of unidic-csj-3.1.1-full is as follows:

7845 名詞,固有名詞,人名,姓,*,*,*,*,固,ツ促,促音形,*,1,*,*
7845 名詞,固有名詞,人名,姓,*,*,*,*,固,ツ促,基本形,*,1,*,*

Therefore, at the above-mentioned location, validation must be performed using the number of unique context_ids, not cid.left_size() (the number of lines in the left_id_file).

And it seems that the left and right are also reversed. Ideally, I believe it should be as follows:

  CHECK_DIE(cid.right_context_id_unique_size() == matrix.left_size() &&
            cid.left_context_id_unique_size()  == matrix.right_size())

A workaround for estimating the cost of user dictionaries involves only rewriting the first line of matrix.def and then rebuilding the user dictionary after cost estimation (pointed out in https://zenn.dev/zagvym/articles/28056236903369).
However, I believe that fixing the aforementioned validation location is the fundamental solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant