Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parquet definition levels #511

Merged
merged 2 commits into from
Jul 3, 2021
Merged

Conversation

nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Jun 30, 2021

Which issue does this PR close?

Relates to #385, but does not close it.

Rationale for this change

While investigating #385, I noticed that there was a discrepancy between the max definitions calculated in parquet::arrow::levels. and what the parquet type system emits. So I kept digging, and found that my interpretation of the rules had errors.

  • non-null primitive should have def = 0, was misinterpreting the spec
  • list increments 1 if not null, or 2 if null

This fixes these issues, and updates the tests

What changes are included in this PR?

Described above

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 30, 2021
@nevi-me
Copy link
Contributor Author

nevi-me commented Jun 30, 2021

@hohav may you please point your test repro to my branch, write the file, then confirm if parquet tools sees the correct output.

I don't have the Java parquet tools, so it'll be quicker if you do that. Thanks

@codecov-commenter
Copy link

codecov-commenter commented Jun 30, 2021

Codecov Report

Merging #511 (9f7d2f9) into master (f1a831f) will decrease coverage by 0.03%.
The diff coverage is 97.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #511      +/-   ##
==========================================
- Coverage   82.74%   82.71%   -0.04%     
==========================================
  Files         165      166       +1     
  Lines       45686    45827     +141     
==========================================
+ Hits        37805    37905     +100     
- Misses       7881     7922      +41     
Impacted Files Coverage Δ
parquet/src/arrow/levels.rs 83.08% <94.44%> (+0.29%) ⬆️
parquet/src/arrow/arrow_writer.rs 97.98% <100.00%> (-0.07%) ⬇️
arrow/src/array/transform/boolean.rs 76.92% <0.00%> (-7.70%) ⬇️
parquet/src/util/bit_packing.rs 99.96% <0.00%> (-0.01%) ⬇️
arrow/src/datatypes/mod.rs 100.00% <0.00%> (ø)
arrow/src/array/equal/variable_size.rs 100.00% <0.00%> (ø)
arrow-pyarrow-integration-testing/src/lib.rs 0.00% <0.00%> (ø)
arrow/src/datatypes/ffi.rs 72.86% <0.00%> (ø)
parquet/src/schema/types.rs 88.18% <0.00%> (+0.11%) ⬆️
parquet/src/encodings/rle.rs 92.96% <0.00%> (+0.24%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f1a831f...9f7d2f9. Read the comment docs.

parquet/src/arrow/levels.rs Outdated Show resolved Hide resolved
@hohav
Copy link

hohav commented Jun 30, 2021

This branch does remove the assertion panic I initially ran into. parquet cat still crashes on the resulting file in all cases. Output from parquet meta:

  • v1: count 6 / nulls 1
  • v2: count 3 / nulls 1 (unchanged)
  • v3: count 2 / nulls 0 (previously count 2 / nulls 2)

@nevi-me
Copy link
Contributor Author

nevi-me commented Jun 30, 2021

Hi @hohav, I've looked at all 3 versions, and I see that while this PR fixes the definition issue, it doesn't address your specific issue because of the writer bug that I indicated in #385.

I see that you also tested against

arrow = { git = "https://github.com/apache/arrow-rs.git" }
parquet = { git = "https://github.com/apache/arrow-rs.git" }

I might have not been clear enough, I had meant for you to test against

arrow = { git = "https://github.com/nevi-me/arrow-rs.git", branch = "parquet-fix-levels" }
parquet = { git = "https://github.com/nevi-me/arrow-rs.git", branch = "parquet-fix-levels" }

It's not an issue though, as I've cloned your repro repo, and installed parquet-mr tools to check what you're seeing.

The solution is two-fold, this PR, and then computing stats in #512 so that we avoid the column writer computing these stats.

I don't have enough bandwidth to drive #512, would you be able to help with it? I think the main tasks would be:

  • checking that the stats for lists make sense (they should, but we probably need tests)
  • adding stats for fixed-len binary where it makes sense.

It would be a good exercise in digging into the arrow compute code.

CC @alamb @jorgecarleitao for a review.

- non-null primitive should have def = 0, was misinterpreting the spec
- list increments 1 if not null, or 2 if null

This fixes these issues, and updates the tests
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully grok the definition level details so I can't really say if this is correct or not, but I reviewed the tests and skimmed the code and it seems reasonable to me. 👍

Comment on lines 749 to 752
let file = get_temp_file("test_arrow_writer_list_non_null.parquet", &[]);
let mut writer = ArrowWriter::try_new(file, Arc::new(schema), None).unwrap();
writer.write(&batch).unwrap();
writer.close().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed other tests like arrow_writer_binary also read data back from parquet and validate the results (and thus confirm data survives the roundtrip).

Would it make sense to have the same test here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb it's because fn roundtrip was added after these tests existed (and I think the reader was lagging behind the writer), so we never expanded it to them. I've now done so, removing quite a bit of duplication.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much nicer 👍

@nevi-me nevi-me merged commit 83ad35c into apache:master Jul 3, 2021
@hohav
Copy link

hohav commented Jul 3, 2021

Hi @nevi-me, thanks again for taking a look at this. I did test against both your branch and master, though it's possible I messed up while juggling so many variants.

I can take a stab at #512 when I find time, but would that also address the crash in parquet cat? If not, do you know what's causing it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants