Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Materialize predicate columns before projection columns #13608

Open
bchalk101 opened this issue Jan 10, 2024 · 3 comments
Open

Materialize predicate columns before projection columns #13608

bchalk101 opened this issue Jan 10, 2024 · 3 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@bchalk101
Copy link
Contributor

Description

Background

I am currently working with a Parquet dataset, where some columns are very large ie large lists or JPEG images in string representation. The dataset size is about 80_000 rows, where each row is approx. 3mb. Another thing to note about the dataset is that it is fully randomized/shuffled, and can not be sorted, with the implication being that the statistics in each row group do not help and almost all row groups need to be read to apply simple filters.

I am trying to filter over this dataset, selecting only a few rows from the entire dataset.

An example filter is:

import polars as pl

df = pl.scan_parquet("s3://some-dataset/*.parquet").filter(pl.col("error_1") > 0.9999).collect(streaming=True)

Issue

This filter takes quite a bit of time to execute, given that it hasn't run out of memory before that.

From taking a look through the function

fn rg_to_dfs_par_over_rg(

it appears that all the columns are first materialized and once materialized the predicate is applied.

Suggestion

I'm proposing that that function, first loads only the predicate columns and applies the predicate, and only if the resulting data frame is not empty load all the projection columns.

If this is accepted, I have implemented this for myself and would be happy to open a PR.
(It's very rough currently, brute forced to prove out the idea.)

@bchalk101 bchalk101 added the enhancement New feature or an improvement of an existing feature label Jan 10, 2024
@coastalwhite
Copy link
Collaborator

This is done with the parallel=prefiltered setting.

@bchalk101
Copy link
Contributor Author

@coastalwhite as mentioned here, #13746 (comment), prefiltered doesn't exactly address this issue.

Prefiltered will still "download" the data even if it isn't needed.

@coastalwhite
Copy link
Collaborator

It shouldn't. If so, I would consider that missing part of prefiltered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants