-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing data in indexed model on large data set; yields much lower counts than unindexed model on the same data with the same parameters! #41
Comments
I enabled debug mode on the test data; the indexed model finishes prematurely (but without error) after processing line 18995401: (container ready)
5 ngrams in line
Adding @18995401:0 n=1 category=1
Adding @18995401:1 n=1 category=1
Adding @18995401:2 n=1 category=1
Adding @18995401:3 n=1 category=1
Adding @18995401:4 n=1 category=1
Found 2562104 ngrams... computing total word types prior to pruning...2562104...pruned 0...total kept: 2562104
Sorting all indices...
Writing model to tst
Generating desired views.. The unindexed model continues to at least line 122777961 (and then the harddisk is full ;) ) |
The problem is the IndexedCorpus model loaded as reverse index, it seems to be clipped at exactly that sentence count, investigating further... |
Found the problem, |
…inter can be one corpus), using size_t instead #41
Fixed and released in v2.4.10 |
As reported by Pavel Vondřička, something fishy is going on in the computation of an indexed model on a large dataset (8.5GB compressed):
Indexed:
Unindexed (these are the correct):
The encoded corpus file has been verified to be fine (i.e. it decodes properly):
I did some tests and the problem does NOT reproduce on a small text (counts are equal there as expected), which also explains why it isn't caught by our automated tests. So the cause is not yet clear and further debugging is needed.
The text was updated successfully, but these errors were encountered: