-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Materialize predicate columns before projection columns #13608
Labels
enhancement
New feature or an improvement of an existing feature
Comments
bchalk101
added
the
enhancement
New feature or an improvement of an existing feature
label
Jan 10, 2024
3 tasks
bchalk101
added a commit
to bchalk101/polars
that referenced
this issue
Sep 5, 2024
This is done with the |
@coastalwhite as mentioned here, #13746 (comment), prefiltered doesn't exactly address this issue. Prefiltered will still "download" the data even if it isn't needed. |
It shouldn't. If so, I would consider that missing part of prefiltered. |
bchalk101
added a commit
to bchalk101/polars
that referenced
this issue
Nov 12, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
Background
I am currently working with a Parquet dataset, where some columns are very large ie large lists or JPEG images in string representation. The dataset size is about 80_000 rows, where each row is approx. 3mb. Another thing to note about the dataset is that it is fully randomized/shuffled, and can not be sorted, with the implication being that the statistics in each row group do not help and almost all row groups need to be read to apply simple filters.
I am trying to filter over this dataset, selecting only a few rows from the entire dataset.
An example filter is:
Issue
This filter takes quite a bit of time to execute, given that it hasn't run out of memory before that.
From taking a look through the function
polars/crates/polars-io/src/parquet/read_impl.rs
Line 255 in a8bdc76
it appears that all the columns are first materialized and once materialized the predicate is applied.
Suggestion
I'm proposing that that function, first loads only the predicate columns and applies the predicate, and only if the resulting data frame is not empty load all the projection columns.
If this is accepted, I have implemented this for myself and would be happy to open a PR.
(It's very rough currently, brute forced to prove out the idea.)
The text was updated successfully, but these errors were encountered: