Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Properly project unordered column in parquet prefiltered #20189

Merged

Conversation

coastalwhite
Copy link
Collaborator

Related to #20175.

@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Dec 6, 2024
@@ -2588,4 +2588,28 @@ def test_utf8_verification_with_slice_20174() -> None:
)

f.seek(0)
pl.scan_parquet(f).head(1).collect()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄

@coastalwhite
Copy link
Collaborator Author

@nameexhaustion any chance you can take this PR from here? You are way more knowledgeable about the hive partitioning code than me

@nameexhaustion
Copy link
Collaborator

Will check

@@ -436,7 +436,7 @@ impl ProjectionPushDown {
&acc_projections,
expr_arena,
&file_info.schema,
scan_type.sort_projection(&file_options) || hive_parts.is_some(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert sorting projections in the projection pushdown optimizer from 29373d1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just project unsorted columns in the optimizer. That makes IO code a lot simpler.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed this for Parquet. I've left the other scans as they may rely on sorted projections.

@nameexhaustion nameexhaustion marked this pull request as draft December 6, 2024 11:31
Copy link

codecov bot commented Dec 6, 2024

Codecov Report

Attention: Patch coverage is 34.95146% with 67 lines in your changes missing coverage. Please review.

Project coverage is 79.61%. Comparing base (36b1244) to head (43a6457).
Report is 23 commits behind head on main.

Files with missing lines Patch % Lines
...m/src/nodes/io_sources/parquet/row_group_decode.rs 0.00% 63 Missing ⚠️
crates/polars-schema/src/schema.rs 71.42% 2 Missing ⚠️
...ates/polars-stream/src/physical_plan/lower_expr.rs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #20189      +/-   ##
==========================================
- Coverage   79.63%   79.61%   -0.02%     
==========================================
  Files        1564     1564              
  Lines      217472   217868     +396     
  Branches     2474     2477       +3     
==========================================
+ Hits       173188   173463     +275     
- Misses      43715    43836     +121     
  Partials      569      569              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nameexhaustion
Copy link
Collaborator

In summary it was extremely tricky because we have an optimization where we skip loading hive columns from the actual file, but we still load them in the correct positions as if they were loaded from the file.


materialize_hive_partitions(
&mut df,
schema.as_ref(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a full-projection scan().collect() this materializes hive columns that exist in the file in the position based on the file schema.But in the case where projections were pushed down the resulting columns are actually not properly ordered because the columns in df no longer match schema - this is still fine because we add a Select {} node on top of the scan in projection pushdown to get the correct order.

@coastalwhite
Copy link
Collaborator Author

coastalwhite commented Dec 6, 2024

I think just in general. No IO source should have to reorder its columns in a projection. It might be better to provide the IO sources with a schema and a bitmap on which columns to load. The complexity of reordering the columns can then be handled by almost free SELECT immediately afterward.

This will only get more complicated as we expand the hive partitioning support. We should try to divide those problems as much as possible. The efficiency hit would be absolute minimal. Ideally, I would like to remove the materialize_hive_partitions from the polars-io/parquet directory entirely.

Doesn't have to be this PR though.

@nameexhaustion nameexhaustion marked this pull request as ready for review December 6, 2024 15:15
@ritchie46 ritchie46 merged commit 430bb4d into pola-rs:main Dec 7, 2024
24 of 25 checks passed
@coastalwhite coastalwhite deleted the fix/pq-project-unordered-columns branch December 7, 2024 08:15
@c-peters c-peters added the accepted Ready for implementation label Dec 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants