-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Properly project unordered column in parquet prefiltered #20189
fix: Properly project unordered column in parquet prefiltered #20189
Conversation
@@ -2588,4 +2588,28 @@ def test_utf8_verification_with_slice_20174() -> None: | |||
) | |||
|
|||
f.seek(0) | |||
pl.scan_parquet(f).head(1).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😄
@nameexhaustion any chance you can take this PR from here? You are way more knowledgeable about the hive partitioning code than me |
Will check |
@@ -436,7 +436,7 @@ impl ProjectionPushDown { | |||
&acc_projections, | |||
expr_arena, | |||
&file_info.schema, | |||
scan_type.sort_projection(&file_options) || hive_parts.is_some(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert sorting projections in the projection pushdown optimizer from 29373d1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should just project unsorted columns in the optimizer. That makes IO code a lot simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed this for Parquet. I've left the other scans as they may rely on sorted projections.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20189 +/- ##
==========================================
- Coverage 79.63% 79.61% -0.02%
==========================================
Files 1564 1564
Lines 217472 217868 +396
Branches 2474 2477 +3
==========================================
+ Hits 173188 173463 +275
- Misses 43715 43836 +121
Partials 569 569 ☔ View full report in Codecov by Sentry. |
In summary it was extremely tricky because we have an optimization where we skip loading hive columns from the actual file, but we still load them in the correct positions as if they were loaded from the file. |
|
||
materialize_hive_partitions( | ||
&mut df, | ||
schema.as_ref(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a full-projection scan().collect()
this materializes hive columns that exist in the file in the position based on the file schema.But in the case where projections were pushed down the resulting columns are actually not properly ordered because the columns in df
no longer match schema
- this is still fine because we add a Select {}
node on top of the scan in projection pushdown to get the correct order.
I think just in general. No IO source should have to reorder its columns in a projection. It might be better to provide the IO sources with a schema and a bitmap on which columns to load. The complexity of reordering the columns can then be handled by almost free This will only get more complicated as we expand the hive partitioning support. We should try to divide those problems as much as possible. The efficiency hit would be absolute minimal. Ideally, I would like to remove the Doesn't have to be this PR though. |
Related to #20175.