fix: Properly project unordered column in parquet prefiltered #20189

coastalwhite · 2024-12-06T10:30:08Z

Related to #20175.

Related to pola-rs#20175.

nameexhaustion · 2024-12-06T11:00:04Z

py-polars/tests/unit/io/test_parquet.py

@@ -2588,4 +2588,28 @@ def test_utf8_verification_with_slice_20174() -> None:
    )

    f.seek(0)
-    pl.scan_parquet(f).head(1).collect()


coastalwhite · 2024-12-06T11:05:53Z

@nameexhaustion any chance you can take this PR from here? You are way more knowledgeable about the hive partitioning code than me

nameexhaustion · 2024-12-06T11:06:27Z

Will check

nameexhaustion · 2024-12-06T11:28:13Z

crates/polars-plan/src/plans/optimizer/projection_pushdown/mod.rs

@@ -436,7 +436,7 @@ impl ProjectionPushDown {
                            &acc_projections,
                            expr_arena,
                            &file_info.schema,
-                            scan_type.sort_projection(&file_options) || hive_parts.is_some(),


revert sorting projections in the projection pushdown optimizer from 29373d1

Maybe we should just project unsorted columns in the optimizer. That makes IO code a lot simpler.

I have changed this for Parquet. I've left the other scans as they may rely on sorted projections.

codecov · 2024-12-06T12:35:34Z

Codecov Report

Attention: Patch coverage is 34.95146% with 67 lines in your changes missing coverage. Please review.

Project coverage is 79.61%. Comparing base (36b1244) to head (43a6457).
Report is 23 commits behind head on main.

Files with missing lines	Patch %	Lines
...m/src/nodes/io_sources/parquet/row_group_decode.rs	0.00%	63 Missing ⚠️
crates/polars-schema/src/schema.rs	71.42%	2 Missing ⚠️
...ates/polars-stream/src/physical_plan/lower_expr.rs	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #20189      +/-   ##
==========================================
- Coverage   79.63%   79.61%   -0.02%     
==========================================
  Files        1564     1564              
  Lines      217472   217868     +396     
  Branches     2474     2477       +3     
==========================================
+ Hits       173188   173463     +275     
- Misses      43715    43836     +121     
  Partials      569      569

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nameexhaustion · 2024-12-06T14:32:27Z

In summary it was extremely tricky because we have an optimization where we skip loading hive columns from the actual file, but we still load them in the correct positions as if they were loaded from the file.

nameexhaustion · 2024-12-06T14:37:22Z

crates/polars-io/src/parquet/read/read_impl.rs

+
+                materialize_hive_partitions(
+                    &mut df,
+                    schema.as_ref(),


For a full-projection scan().collect() this materializes hive columns that exist in the file in the position based on the file schema.But in the case where projections were pushed down the resulting columns are actually not properly ordered because the columns in df no longer match schema - this is still fine because we add a Select {} node on top of the scan in projection pushdown to get the correct order.

coastalwhite · 2024-12-06T14:52:23Z

I think just in general. No IO source should have to reorder its columns in a projection. It might be better to provide the IO sources with a schema and a bitmap on which columns to load. The complexity of reordering the columns can then be handled by almost free SELECT immediately afterward.

This will only get more complicated as we expand the hive partitioning support. We should try to divide those problems as much as possible. The efficiency hit would be absolute minimal. Ideally, I would like to remove the materialize_hive_partitions from the polars-io/parquet directory entirely.

Doesn't have to be this PR though.

fix: Properly project unordered column in parquet prefiltered

55050ce

Related to pola-rs#20175.

coastalwhite requested review from ritchie46, c-peters, alexander-beedie, MarcoGorelli, reswqa and orlp as code owners December 6, 2024 10:30

github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Dec 6, 2024

nameexhaustion reviewed Dec 6, 2024

View reviewed changes

fix hive projection

5dfd19a

nameexhaustion reviewed Dec 6, 2024

View reviewed changes

nameexhaustion marked this pull request as draft December 6, 2024 11:31

nameexhaustion added 5 commits December 7, 2024 00:35

c

8395a80

nit

82756c6

uncomment tests

35f6440

fix new-streaming more

41ae898

fix build

15124be

nameexhaustion reviewed Dec 6, 2024

View reviewed changes

nameexhaustion added 2 commits December 7, 2024 01:39

c

e2f029e

c

43a6457

nameexhaustion marked this pull request as ready for review December 6, 2024 15:15

ritchie46 approved these changes Dec 7, 2024

View reviewed changes

ritchie46 merged commit 430bb4d into pola-rs:main Dec 7, 2024
24 of 25 checks passed

coastalwhite deleted the fix/pq-project-unordered-columns branch December 7, 2024 08:15

cmdlineluser mentioned this pull request Dec 7, 2024

scan_parquet filter/select optimisation error #20175

Closed

2 tasks

c-peters added the accepted Ready for implementation label Dec 8, 2024

c-peters assigned coastalwhite Dec 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Properly project unordered column in parquet prefiltered #20189

fix: Properly project unordered column in parquet prefiltered #20189

coastalwhite commented Dec 6, 2024

nameexhaustion Dec 6, 2024

coastalwhite commented Dec 6, 2024

nameexhaustion commented Dec 6, 2024

nameexhaustion Dec 6, 2024

coastalwhite Dec 6, 2024

nameexhaustion Dec 6, 2024

codecov bot commented Dec 6, 2024 •

edited

Loading

nameexhaustion commented Dec 6, 2024

nameexhaustion Dec 6, 2024

coastalwhite commented Dec 6, 2024 •

edited

Loading

fix: Properly project unordered column in parquet prefiltered #20189

fix: Properly project unordered column in parquet prefiltered #20189

Conversation

coastalwhite commented Dec 6, 2024

nameexhaustion Dec 6, 2024

Choose a reason for hiding this comment

coastalwhite commented Dec 6, 2024

nameexhaustion commented Dec 6, 2024

nameexhaustion Dec 6, 2024

Choose a reason for hiding this comment

coastalwhite Dec 6, 2024

Choose a reason for hiding this comment

nameexhaustion Dec 6, 2024

Choose a reason for hiding this comment

codecov bot commented Dec 6, 2024 • edited Loading

Codecov Report

nameexhaustion commented Dec 6, 2024

nameexhaustion Dec 6, 2024

Choose a reason for hiding this comment

coastalwhite commented Dec 6, 2024 • edited Loading

codecov bot commented Dec 6, 2024 •

edited

Loading

coastalwhite commented Dec 6, 2024 •

edited

Loading