Materialize predicate columns before projection columns #13608

bchalk101 · 2024-01-10T17:03:40Z

Description

Background

I am currently working with a Parquet dataset, where some columns are very large ie large lists or JPEG images in string representation. The dataset size is about 80_000 rows, where each row is approx. 3mb. Another thing to note about the dataset is that it is fully randomized/shuffled, and can not be sorted, with the implication being that the statistics in each row group do not help and almost all row groups need to be read to apply simple filters.

I am trying to filter over this dataset, selecting only a few rows from the entire dataset.

An example filter is:

import polars as pl

df = pl.scan_parquet("s3://some-dataset/*.parquet").filter(pl.col("error_1") > 0.9999).collect(streaming=True)

Issue

This filter takes quite a bit of time to execute, given that it hasn't run out of memory before that.

From taking a look through the function

polars/crates/polars-io/src/parquet/read_impl.rs

Line 255 in a8bdc76

fn rg_to_dfs_par_over_rg(

it appears that all the columns are first materialized and once materialized the predicate is applied.

Suggestion

I'm proposing that that function, first loads only the predicate columns and applies the predicate, and only if the resulting data frame is not empty load all the projection columns.

If this is accepted, I have implemented this for myself and would be happy to open a PR.
(It's very rough currently, brute forced to prove out the idea.)

…#13608)

coastalwhite · 2024-10-08T10:59:47Z

This is done with the parallel=prefiltered setting.

bchalk101 · 2024-10-09T10:43:00Z

@coastalwhite as mentioned here, #13746 (comment), prefiltered doesn't exactly address this issue.

Prefiltered will still "download" the data even if it isn't needed.

coastalwhite · 2024-10-09T11:54:14Z

It shouldn't. If so, I would consider that missing part of prefiltered.

…#13608)

bchalk101 added the enhancement New feature or an improvement of an existing feature label Jan 10, 2024

bchalk101 mentioned this issue Jan 15, 2024

feat(rust): Prune row groups before loading all columns #13746

Closed

3 tasks

bchalk101 added a commit to bchalk101/polars that referenced this issue Sep 5, 2024

feat(rust): optimize column load of row groups with predicate(pola-rs…

5823b25

…#13608)

coastalwhite closed this as completed Oct 8, 2024

coastalwhite reopened this Oct 9, 2024

bchalk101 added a commit to bchalk101/polars that referenced this issue Nov 12, 2024

feat(rust): optimize column load of row groups with predicate(pola-rs…

cd9f21b

…#13608)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Materialize predicate columns before projection columns #13608

Materialize predicate columns before projection columns #13608

bchalk101 commented Jan 10, 2024

coastalwhite commented Oct 8, 2024

bchalk101 commented Oct 9, 2024

coastalwhite commented Oct 9, 2024

Materialize predicate columns before projection columns #13608

Materialize predicate columns before projection columns #13608

Comments

bchalk101 commented Jan 10, 2024

Description

Background

Issue

Suggestion

coastalwhite commented Oct 8, 2024

bchalk101 commented Oct 9, 2024

coastalwhite commented Oct 9, 2024