Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust): Prune row groups before loading all columns #13746

Closed
wants to merge 1 commit into from

Conversation

bchalk101
Copy link
Contributor

@bchalk101 bchalk101 commented Jan 15, 2024

This addresses #13608.

This addresses the situation when the row group statistics don't help with filtering out row groups. By loading just the columns required to apply the predicate and filtering out non-required row groups, followed by loading all the data from the leftover row groups.

The main downside of this implementation is that in the bad case, where the number of rows for the predicate and projection are equal the data is being downloaded twice. To get around this, perhaps a feature flag can be added, something like ROW_GROUP_PRUNING, and only if it is turned on will this filtering be applied.

I will note, that on the datasets I am currently working on, as specified in the issue, the filtering went from 25min and 32 GB memory consumption to 25 seconds and negligible memory.

  • Is it safe? Handles all situations when there is no predicate
  • Feature flagged
  • Tested

@bchalk101
Copy link
Contributor Author

Hey @ritchie46,
Would be great to get your take on this enhancement.
Is this something you will be ok merging in? Is this being implemented in the right direction?

Just want to check before putting more effort into this and implementing the feature toggle and testing.

@bchalk101 bchalk101 changed the title Prune row groups before loading all columns feat(rust): Prune row groups before loading all columns Jan 16, 2024
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature rust Related to Rust Polars labels Jan 16, 2024
@bchalk101 bchalk101 force-pushed the optimize_rg_read branch 6 times, most recently from 75e7a1a to 872956f Compare January 21, 2024 08:57
@bchalk101 bchalk101 marked this pull request as ready for review January 21, 2024 08:59
@bchalk101 bchalk101 force-pushed the optimize_rg_read branch 2 times, most recently from 057cdf0 to a8c771b Compare January 23, 2024 08:08
@bchalk101 bchalk101 closed this Jan 23, 2024
@ritchie46
Copy link
Member

Hi @bchalk101 why was this closed? Didn't it work?

@bchalk101
Copy link
Contributor Author

Hey @ritchie46,
No, it does work. I just wasn't sure if it would get merged in and wanted to test out some other optimizations, so I closed it. But happy to re-open it again.

@bchalk101 bchalk101 reopened this Jan 24, 2024
@bchalk101 bchalk101 force-pushed the optimize_rg_read branch 3 times, most recently from cace8b0 to 480a13f Compare January 29, 2024 13:03
@mkleinbort
Copy link

Hi. Very interested in this feature - it'd be amazing for some very wide tables.

@ritchie46
Copy link
Member

Yes, it is interesting, but I want to think about this a bit more. As the most common case would slow down. We loose an embarrassingly parallel load to a sequential one with a very low probability of being faster.

@bchalk101
Copy link
Contributor Author

I want to give some further insight into our use case, which perhaps can help with the decision. We are using Parquets as the format for saving data for training ML models, this means that each row can be quite large, even if some columns are small. For example, large compressed numpy arrays or even jpeg images in a column. There is still small metadata in each row, which is what we use with the filters before actually selecting the rows. While I think we would be a small percentage of users, there is definitely a large number of people using Parquets for ML training data.

I would be open to other ways to use this with Polars, potentially a secondary read library suited to such data. The issue is that it makes it very difficult to use Polars with such data without a bunch of changes. Other changes may include, applying limits (is .limit) on the pruned row groups before collecting more data (which again will slow down the general use case), implementing a generator for loading data or allowing setting parquet size when sinking data into files.

The other option, as I mentioned in the description is to use a feature flag, but of course, this can lead to "flag bloat".

@bchalk101 bchalk101 force-pushed the optimize_rg_read branch 3 times, most recently from f2d0215 to aeff35b Compare March 17, 2024 08:03
@bchalk101 bchalk101 force-pushed the optimize_rg_read branch 2 times, most recently from 2c357ce to 99c566e Compare March 26, 2024 10:36
@bchalk101 bchalk101 requested a review from reswqa as a code owner April 4, 2024 06:56
Copy link

codspeed-hq bot commented Apr 11, 2024

CodSpeed Performance Report

Merging #13746 will not alter performance

Comparing bchalk101:optimize_rg_read (c77cf72) with main (11fe9d8)

Summary

✅ 34 untouched benchmarks

@bchalk101 bchalk101 force-pushed the optimize_rg_read branch from b91ddda to c9b217e Compare May 5, 2024 07:04
@bchalk101 bchalk101 force-pushed the optimize_rg_read branch 3 times, most recently from c04ae0f to a746135 Compare June 3, 2024 07:34
@bchalk101 bchalk101 force-pushed the optimize_rg_read branch 2 times, most recently from 3b8398b to 55e3623 Compare July 4, 2024 11:59
@bchalk101 bchalk101 force-pushed the optimize_rg_read branch 2 times, most recently from e97142e to 2b960f6 Compare July 11, 2024 08:22
@kszlim
Copy link
Contributor

kszlim commented Sep 9, 2024

Is this the same optimization as Late Materialization?

@bchalk101
Copy link
Contributor Author

Not exactly - it's the same idea, but done at the IO level.
The process is as follows:

  1. Download predicate-only columns
  2. Apply predicate to decide if the rest of the required columns should be downloaded
  3. Mark the Row Group as required and use it to apply the slice.
  4. Download the entire Row Group and apply the predicate (At this point late materialisation would help)

This specifically helps when I/O and memory are the blockers, ie wide tables with columns that contain heavy data.

@ritchie46
Copy link
Member

I will close this one as this isn't getting merged anymore. I will see us dynamically adapting to such a strategy in the new streaming engine, but for now it's out of scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants