Duplicated revision pairs when bzip2 input is used #1

whym · 2011-08-16T17:45:19Z

Revisions around a page ending can be duplicated in the results when bzip2 input is used.

ottomata · 2014-10-09T18:29:44Z

Hiya! I'm talking to Aaron Halfaker right now! We are thinking about using this again. Is this still an issue? He seems to remember you guys resolving this.

whym · 2014-10-10T14:16:40Z

I believe it is, although the duplicates shouldn't be too many. Change "<=" in the last assertion in testSplitCompressed() to "==", and it won't pass (while it ideally should). According to the error I get there, the scale of duplicates looks like this: "expected: 93939, found: 93946".

The problem is in the way bzip files can be split - splits must be aligned to bzip2 blocks, which might end at in the middle of a revision. To not lose any revision, I had to implement to cover some revisions doubly.

It might make sense to solve this by adding another layer of a Hadoop job to remove duplicates in the larger workflow. (Looking back, I have a very vague memory discussing solving it more neatly, but anyway it wasn't implemented at the end.)

GabrielF00 mentioned this issue May 28, 2013

download link broken #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicated revision pairs when bzip2 input is used #1

Duplicated revision pairs when bzip2 input is used #1

whym commented Aug 16, 2011

ottomata commented Oct 9, 2014

whym commented Oct 10, 2014

Duplicated revision pairs when bzip2 input is used #1

Duplicated revision pairs when bzip2 input is used #1

Comments

whym commented Aug 16, 2011

ottomata commented Oct 9, 2014

whym commented Oct 10, 2014