Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

matrix right/left dimension checking is inconsistent (compiling user dictionary/assigning user dict costs) #42

Open
wareya opened this issue Mar 14, 2018 · 3 comments

Comments

@wareya
Copy link

wareya commented Mar 14, 2018

with unidic as downloaded:

$ mecab/bin/mecab-dict-index.exe -m mecab/dic/unidic-csj/model.bin -d mecab/dic/unidic-csj/ -f utf-8 -c utf-8 -a userdict.csv -u user2.csv
dictionary.cpp(183) [cid.left_size() == matrix.left_size() && cid.right_size() == matrix.right_size()] Context ID files(mecab/dic/unidic-csj/\left-id.def or mecab/dic/unidic-csj/\right-id.def may be broken: 7074 8407 8407 7074

after swapping right-id and left-id files:

$ mecab/bin/mecab-dict-index.exe -m mecab/dic/unidic-csj/model.bin -d mecab/dic/unidic-csj/ -f utf-8 -c utf-8 -a userdict.csv -u user2.csv
reading userdict.csv ... dictionary.cpp(212) [lid >= 0 && rid >= 0 && matrix.is_valid(lid, rid)] invalid ids are found lid=7151 rid=6170
@wareya wareya changed the title matrix right/left dimension checking is inconsistent (compiling user dictionary) matrix right/left dimension checking is inconsistent (compiling user dictionary/assigning user dict costs) Mar 14, 2018
@wareya
Copy link
Author

wareya commented Mar 15, 2018

swapping matrix.left_size() and matrix.right_size() in the assertions seems to fix it without breaking left/right id generation

however, building on my toolchain (mingw-w64 32-bit) was difficult, because the linker can't find wmain; I had to change winmain.h like so:

int main_wrapper(int argc, char **argv);

int main(int argc, char ** argv) {
    int argcount;
    LPWSTR commandline = GetCommandLineW();
    LPWSTR * arglist = CommandLineToArgvW(commandline, &argcount);
    
    CommandLine cmd(argcount, arglist);
    return main_wrapper(cmd.argc(), cmd.argv());
}
#define main(argc, argv) main_wrapper(argc, argv)

and remove -municode from configure

@wareya
Copy link
Author

wareya commented Apr 1, 2018

the diff I used to work around this bug is basically:

diff a/mecab/src/dictionary.cpp b/mecab/src/dictionary.cpp
--- a/mecab/src/dictionary.cpp
+++ b/mecab/src/dictionary.cpp
@@ -179,14 +179,14 @@ bool Dictionary::assignUserDictionaryCosts(

   cid.open(left_id_file.c_str(),
            right_id_file.c_str(), &config_iconv);
-  CHECK_DIE(cid.left_size()  == matrix.left_size() &&
-            cid.right_size() == matrix.right_size())
+  CHECK_DIE(cid.left_size()  == matrix.right_size() &&
+            cid.right_size() == matrix.left_size())
       << "Context ID files("
       << left_id_file
       << " or "
       << right_id_file << " may be broken: "
-      << cid.left_size() << " " << matrix.left_size() << " "
-      << cid.right_size() << " " << matrix.right_size();
+      << cid.left_size() << " " << matrix.right_size() << " "
+      << cid.right_size() << " " << matrix.left_size();

   std::ofstream ofs(output);
   CHECK_DIE(ofs) << "permission denied: " << output;
@@ -352,8 +352,8 @@ bool Dictionary::compile(const Param &param,
           cid.reset(new ContextID);
           cid->open(left_id_file.c_str(),
                     right_id_file.c_str(), &config_iconv);
-          CHECK_DIE(cid->left_size()  == matrix.left_size() &&
-                    cid->right_size() == matrix.right_size())
+          CHECK_DIE(cid->left_size()  == matrix.right_size() &&
+                    cid->right_size() == matrix.left_size())
               << "Context ID files("
               << left_id_file
               << " or "

@wareya
Copy link
Author

wareya commented Dec 18, 2018

Connector::open() checks for the "left" count at 0x0 into matrix.bin, and for the "right" count at 0x2, but the numbers and filesizes in unidic 2.3.0 are consistent with the "right" count coming first. (i.e. the first short in matrix.bin is larger than the second, but right-id.def is larger than and has more lines than left-id.def).

What's at fault here? The ContextID files do not appear to be named incorrectly (see the opening post of this issue - swapping their names results in an error in cost recalculation), and again, everything works just fine when a user dictionary is not being used.

lsize_ = static_cast<unsigned short>((*cmmap_)[0]);

explorer_2018-12-18_05-33-08

explorer_2018-12-18_05-33-25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@wareya and others