feat: Support reading byte stream split encoded floats and doubles in parquet #17099

adamreeve · 2024-06-20T23:00:17Z

This PR adds support for reading Parquet files with float or double values that are encoded with the byte stream split encoding. I started out using the changes in jorgecarleitao/parquet2#221, but refactored the way the decoder worked to avoid needing to change the State enums for the basic and nested PrimitiveDecoder structs too much (I think these would have at least needed extra type parameters for the ParqueNativeType and converter function).

Version 2.11 of the Parquet format extended the byte stream split encoding to be valid for integer types and fixed length byte arrays, motivated by the addition of the float16 logical type which uses a 2-byte fixed length byte array physical type. But it looks like Polars doesn't support reading float16 data so I think for now it's fine to only handle floats and doubles.

This code doesn't handle when is_filtered is true in build_state, but I couldn't find any code path where that is used by Polars, and is_filtered == true also seems to be unimplemented for dictionary encoded data, so I assume this isn't necessary?

crates/polars-parquet/src/parquet/encoding/byte_stream_split/decoder.rs

py-polars/tests/unit/io/test_parquet.py

ritchie46

Thanks. Looks great. I have left some comments.

adamreeve · 2024-06-23T09:48:12Z

Thanks for the review, I've addressed those comments and also added a const for the max element size.

codecov · 2024-06-23T09:53:01Z

Codecov Report

Attention: Patch coverage is 98.67550% with 2 lines in your changes missing coverage. Please review.

Project coverage is 80.89%. Comparing base (8a6bf4b) to head (c95231c).
Report is 28 commits behind head on main.

Files	Patch %	Lines
...uet/src/arrow/read/deserialize/primitive/nested.rs	90.47%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #17099      +/-   ##
==========================================
+ Coverage   80.86%   80.89%   +0.03%     
==========================================
  Files        1456     1458       +2     
  Lines      191141   191484     +343     
  Branches     2728     2742      +14     
==========================================
+ Hits       154562   154902     +340     
  Misses      36073    36073              
- Partials      506      509       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ritchie46 · 2024-06-24T13:29:39Z

Alright. Thank you @adamreeve

adamreeve added 6 commits June 21, 2024 10:16

Add test for reading byte stream split encoded floats

61543c5

Implement reading of non-filtered, non-nullable byte-stream-split data

5b81fa0

Support reading nullable data

a1e4aec

Add test for byte stream split encoding with list values

5e6832b

Support reading arrays of byte stream split floats

3a1419c

Add error check

c9339e0

adamreeve requested review from ritchie46, stinodego, c-peters, alexander-beedie, MarcoGorelli, reswqa and orlp as code owners June 20, 2024 23:00

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jun 20, 2024

Fix type checking

154dc79

ritchie46 reviewed Jun 22, 2024

View reviewed changes

crates/polars-parquet/src/parquet/encoding/byte_stream_split/decoder.rs Show resolved Hide resolved

ritchie46 reviewed Jun 22, 2024

View reviewed changes

py-polars/tests/unit/io/test_parquet.py Show resolved Hide resolved

ritchie46 reviewed Jun 22, 2024

View reviewed changes

py-polars/tests/unit/io/test_parquet.py Show resolved Hide resolved

ritchie46 reviewed Jun 22, 2024

View reviewed changes

adamreeve added 2 commits June 23, 2024 19:41

Mark tests as slow

339c0ce

Use get_unchecked in unsafe block and add MAX_ELEMENT_SIZE const

644355c

Improve test coverage

c95231c

ritchie46 changed the title ~~feat(python,rust): Support reading byte stream split encoded floats and doubles~~ feat: Support reading byte stream split encoded floats and doubles in parquet Jun 24, 2024

ritchie46 merged commit 4731834 into pola-rs:main Jun 24, 2024
27 checks passed

adamreeve deleted the byte_stream_split branch June 24, 2024 21:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support reading byte stream split encoded floats and doubles in parquet #17099

feat: Support reading byte stream split encoded floats and doubles in parquet #17099

adamreeve commented Jun 20, 2024 •

edited

Loading

ritchie46 left a comment

adamreeve commented Jun 23, 2024

codecov bot commented Jun 23, 2024 •

edited

Loading

ritchie46 commented Jun 24, 2024

feat: Support reading byte stream split encoded floats and doubles in parquet #17099

feat: Support reading byte stream split encoded floats and doubles in parquet #17099

Conversation

adamreeve commented Jun 20, 2024 • edited Loading

ritchie46 left a comment

Choose a reason for hiding this comment

adamreeve commented Jun 23, 2024

codecov bot commented Jun 23, 2024 • edited Loading

Codecov Report

ritchie46 commented Jun 24, 2024

adamreeve commented Jun 20, 2024 •

edited

Loading

codecov bot commented Jun 23, 2024 •

edited

Loading