Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing data in indexed model on large data set; yields much lower counts than unindexed model on the same data with the same parameters! #41

Closed
proycon opened this issue Nov 28, 2018 · 4 comments

Comments

@proycon
Copy link
Owner

proycon commented Nov 28, 2018

As reported by Pavel Vondřička, something fishy is going on in the computation of an indexed model on a large dataset (8.5GB compressed):

Indexed:

$ colibri-patternmodeller -l 1 -t 1 -f gigacorpus.colibri.dat                                                        
Loading corpus data...
Training model on  gigacorpus.colibri.dat
Training patternmodel, occurrence threshold: 1
Counting *all* n-grams (occurrence threshold=1)
 Found 2562104 ngrams... computing total word types prior to pruning...2562104...pruned 0...total kept: 2562104
Sorting all indices...

Unindexed (these are the correct):

$ colibri-patternmodeller -u -l 1 -t 1 -f gigacorpus.colibri.dat
Training unindexed model on  gigacorpus.colibri.dat
Training patternmodel, occurrence threshold: 1
Counting *all* n-grams (occurrence threshold=1)
 Found 11459477 ngrams... computing total word types prior to pruning...11459477...pruned 0...total kept: 11459477

The encoded corpus file has been verified to be fine (i.e. it decodes properly):

yes, I tried decoding the corpus back and it had a different size, but there was the whole contents - it seems that just some (white)spaces got lost, which is understandable. Anyway, the corpus wasn’t clipped.

I did some tests and the problem does NOT reproduce on a small text (counts are equal there as expected), which also explains why it isn't caught by our automated tests. So the cause is not yet clear and further debugging is needed.

@proycon
Copy link
Owner Author

proycon commented Dec 3, 2018

I enabled debug mode on the test data; the indexed model finishes prematurely (but without error) after processing line 18995401:

   (container ready)
        5 ngrams in line
                Adding @18995401:0 n=1 category=1
                Adding @18995401:1 n=1 category=1
                Adding @18995401:2 n=1 category=1
                Adding @18995401:3 n=1 category=1
                Adding @18995401:4 n=1 category=1
 Found 2562104 ngrams... computing total word types prior to pruning...2562104...pruned 0...total kept: 2562104
Sorting all indices...
Writing model to tst
Generating desired views..

The unindexed model continues to at least line 122777961 (and then the harddisk is full ;) )

proycon added a commit that referenced this issue Dec 3, 2018
@proycon
Copy link
Owner Author

proycon commented Dec 3, 2018

The problem is the IndexedCorpus model loaded as reverse index, it seems to be clipped at exactly that sentence count, investigating further...

proycon added a commit that referenced this issue Dec 3, 2018
proycon added a commit that referenced this issue Dec 3, 2018
proycon added a commit that referenced this issue Dec 5, 2018
@proycon
Copy link
Owner Author

proycon commented Dec 5, 2018

Found the problem, unsigned int is too small to store the corpus size.

proycon added a commit that referenced this issue Dec 5, 2018
…inter can be one corpus), using size_t instead #41
@proycon
Copy link
Owner Author

proycon commented Dec 5, 2018

Fixed and released in v2.4.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant