Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix BitReader::get_batch zero extension (#1708) #1722

Merged
merged 2 commits into from
May 23, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented May 22, 2022

Which issue does this PR close?

Closes #1708.

Rationale for this change

unpack32 as used by BitReader::get_batch always decodes to an array of 32, 32-bit integers. If the type being read is a different size it will read to a temporary buffer and then either truncate or zero-extend the data.

An oversight meant that instead of zero-extending, it would just set up to the first 4 bytes, and leave any remaining bytes alone.

In practice this didn't matter as we were only using this for decoding 16-bit levels, however, with #1284 the DeltaBitPackDecoder was changed to also use this method. As this does get used with 64-bit types, buffer reuse could easily lead to garbage data on decode.

What changes are included in this PR?

Fixes BitReader::get_batch to correctly zero-extend, and reduces the use of unsafe in the process.

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label May 22, 2022
@tustvold
Copy link
Contributor Author

Unfortunately this does represent a performance regression, but better slow and correct, than fast and incorrect. I'll have a think about how we might reduce this

arrow_array_reader/Int64Array/binary packed, mandatory, no NULLs                                                                             
                        time:   [57.063 us 57.083 us 57.103 us]
                        change: [+96.098% +96.190% +96.272%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
arrow_array_reader/Int64Array/binary packed, optional, no NULLs                                                                            
                        time:   [69.239 us 69.262 us 69.285 us]
                        change: [+75.438% +75.559% +75.670%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
arrow_array_reader/Int64Array/binary packed, optional, half NULLs                                                                             
                        time:   [49.888 us 49.967 us 50.119 us]
                        change: [+32.282% +32.712% +33.038%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

@tustvold
Copy link
Contributor Author

We also have test failures 👀

@tustvold tustvold marked this pull request as draft May 22, 2022 15:26
@tustvold tustvold marked this pull request as ready for review May 22, 2022 15:38
@codecov-commenter
Copy link

Codecov Report

Merging #1722 (3e3a453) into master (7b164a0) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1722      +/-   ##
==========================================
- Coverage   83.32%   83.31%   -0.01%     
==========================================
  Files         195      195              
  Lines       56044    55986      -58     
==========================================
- Hits        46698    46646      -52     
+ Misses       9346     9340       -6     
Impacted Files Coverage Δ
parquet/src/util/bit_util.rs 94.02% <100.00%> (+0.82%) ⬆️
arrow/src/array/data.rs 83.65% <0.00%> (-0.80%) ⬇️
arrow/src/array/transform/mod.rs 86.74% <0.00%> (-0.06%) ⬇️
arrow/src/compute/util.rs 98.90% <0.00%> (-0.01%) ⬇️
parquet/src/encodings/encoding.rs 93.65% <0.00%> (+0.19%) ⬆️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (+0.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b164a0...3e3a453. Read the comment docs.

@asayers
Copy link
Contributor

asayers commented May 23, 2022

I tested this branch on the repro from #1708 and it produces the correct results 👍

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for the fix!

@tustvold tustvold merged commit 590f506 into apache:master May 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Roundtrip failure when using DELTA_BINARY_PACKED
4 participants