-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support reading byte stream split encoded floats and doubles in parquet #17099
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Looks great. I have left some comments.
Thanks for the review, I've addressed those comments and also added a const for the max element size. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17099 +/- ##
==========================================
+ Coverage 80.86% 80.89% +0.03%
==========================================
Files 1456 1458 +2
Lines 191141 191484 +343
Branches 2728 2742 +14
==========================================
+ Hits 154562 154902 +340
Misses 36073 36073
- Partials 506 509 +3 ☔ View full report in Codecov by Sentry. |
Alright. Thank you @adamreeve |
Fixes #17042
This PR adds support for reading Parquet files with float or double values that are encoded with the byte stream split encoding. I started out using the changes in jorgecarleitao/parquet2#221, but refactored the way the decoder worked to avoid needing to change the State enums for the basic and nested PrimitiveDecoder structs too much (I think these would have at least needed extra type parameters for the ParqueNativeType and converter function).
Version 2.11 of the Parquet format extended the byte stream split encoding to be valid for integer types and fixed length byte arrays, motivated by the addition of the float16 logical type which uses a 2-byte fixed length byte array physical type. But it looks like Polars doesn't support reading float16 data so I think for now it's fine to only handle floats and doubles.
This code doesn't handle when
is_filtered
is true inbuild_state
, but I couldn't find any code path where that is used by Polars, andis_filtered == true
also seems to be unimplemented for dictionary encoded data, so I assume this isn't necessary?