Missing data in indexed model on large data set; yields much lower counts than unindexed model on the same data with the same parameters! #41

proycon · 2018-11-28T13:57:00Z

As reported by Pavel Vondřička, something fishy is going on in the computation of an indexed model on a large dataset (8.5GB compressed):

Indexed:

$ colibri-patternmodeller -l 1 -t 1 -f gigacorpus.colibri.dat                                                        
Loading corpus data...
Training model on  gigacorpus.colibri.dat
Training patternmodel, occurrence threshold: 1
Counting *all* n-grams (occurrence threshold=1)
 Found 2562104 ngrams... computing total word types prior to pruning...2562104...pruned 0...total kept: 2562104
Sorting all indices...

Unindexed (these are the correct):

$ colibri-patternmodeller -u -l 1 -t 1 -f gigacorpus.colibri.dat
Training unindexed model on  gigacorpus.colibri.dat
Training patternmodel, occurrence threshold: 1
Counting *all* n-grams (occurrence threshold=1)
 Found 11459477 ngrams... computing total word types prior to pruning...11459477...pruned 0...total kept: 11459477

The encoded corpus file has been verified to be fine (i.e. it decodes properly):

yes, I tried decoding the corpus back and it had a different size, but there was the whole contents - it seems that just some (white)spaces got lost, which is understandable. Anyway, the corpus wasn’t clipped.

I did some tests and the problem does NOT reproduce on a small text (counts are equal there as expected), which also explains why it isn't caught by our automated tests. So the cause is not yet clear and further debugging is needed.

The text was updated successfully, but these errors were encountered:

proycon · 2018-12-03T18:28:02Z

I enabled debug mode on the test data; the indexed model finishes prematurely (but without error) after processing line 18995401:

   (container ready)
        5 ngrams in line
                Adding @18995401:0 n=1 category=1
                Adding @18995401:1 n=1 category=1
                Adding @18995401:2 n=1 category=1
                Adding @18995401:3 n=1 category=1
                Adding @18995401:4 n=1 category=1
 Found 2562104 ngrams... computing total word types prior to pruning...2562104...pruned 0...total kept: 2562104
Sorting all indices...
Writing model to tst
Generating desired views..

The unindexed model continues to at least line 122777961 (and then the harddisk is full ;) )

proycon · 2018-12-03T20:45:52Z

The problem is the IndexedCorpus model loaded as reverse index, it seems to be clipped at exactly that sentence count, investigating further...

…sn't solve anything)

proycon · 2018-12-05T14:19:23Z

Found the problem, unsigned int is too small to store the corpus size.

…inter can be one corpus), using size_t instead #41

proycon · 2018-12-05T14:49:08Z

Fixed and released in v2.4.10

proycon added bug PRIORITY investigate labels Nov 28, 2018

proycon self-assigned this Nov 28, 2018

proycon added a commit that referenced this issue Dec 3, 2018

Extra debug for #41

eadd7c6

proycon added a commit that referenced this issue Dec 3, 2018

Implemented extra sanity check in IndexedCorpus loading #41

133226c

proycon added a commit that referenced this issue Dec 3, 2018

propagate debug to IndexedCorpus constructor from patternmodeller #41

02ab8a1

proycon added a commit that referenced this issue Dec 3, 2018

extra debug #41

b90fb6d

proycon added a commit that referenced this issue Dec 3, 2018

more sanity checks #41

5d81495

proycon added a commit that referenced this issue Dec 5, 2018

trying to be more explicit in opening ifstream (for #41, probably doe…

d05bbf3

…sn't solve anything)

proycon added a commit that referenced this issue Dec 5, 2018

unsigned int is too small for storing huge corpora (and one PatternPo…

ac17b0c

…inter can be one corpus), using size_t instead #41

proycon added a commit that referenced this issue Dec 5, 2018

adjusting some cython data types to match #41

4f4f3ce

proycon closed this as completed Dec 5, 2018

proycon mentioned this issue Dec 5, 2018

Unable to load large corpora into memory because PatternPointer length can't exceed 2^32 bytes (32 bit size descriptor) #42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing data in indexed model on large data set; yields much lower counts than unindexed model on the same data with the same parameters! #41

Missing data in indexed model on large data set; yields much lower counts than unindexed model on the same data with the same parameters! #41

proycon commented Nov 28, 2018 •

edited

Loading

proycon commented Dec 3, 2018 •

edited

Loading

proycon commented Dec 3, 2018 •

edited

Loading

proycon commented Dec 5, 2018

proycon commented Dec 5, 2018

Missing data in indexed model on large data set; yields much lower counts than unindexed model on the same data with the same parameters! #41

Missing data in indexed model on large data set; yields much lower counts than unindexed model on the same data with the same parameters! #41

Comments

proycon commented Nov 28, 2018 • edited Loading

proycon commented Dec 3, 2018 • edited Loading

proycon commented Dec 3, 2018 • edited Loading

proycon commented Dec 5, 2018

proycon commented Dec 5, 2018

proycon commented Nov 28, 2018 •

edited

Loading

proycon commented Dec 3, 2018 •

edited

Loading

proycon commented Dec 3, 2018 •

edited

Loading