-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(rust): Prune row groups before loading all columns #13746
Conversation
8ac4d95
to
2621b7b
Compare
Hey @ritchie46, Just want to check before putting more effort into this and implementing the feature toggle and testing. |
75e7a1a
to
872956f
Compare
057cdf0
to
a8c771b
Compare
Hi @bchalk101 why was this closed? Didn't it work? |
Hey @ritchie46, |
cace8b0
to
480a13f
Compare
Hi. Very interested in this feature - it'd be amazing for some very wide tables. |
Yes, it is interesting, but I want to think about this a bit more. As the most common case would slow down. We loose an embarrassingly parallel load to a sequential one with a very low probability of being faster. |
I want to give some further insight into our use case, which perhaps can help with the decision. We are using Parquets as the format for saving data for training ML models, this means that each row can be quite large, even if some columns are small. For example, large compressed numpy arrays or even jpeg images in a column. There is still small metadata in each row, which is what we use with the filters before actually selecting the rows. While I think we would be a small percentage of users, there is definitely a large number of people using Parquets for ML training data. I would be open to other ways to use this with Polars, potentially a secondary read library suited to such data. The issue is that it makes it very difficult to use Polars with such data without a bunch of changes. Other changes may include, applying limits (is The other option, as I mentioned in the description is to use a feature flag, but of course, this can lead to "flag bloat". |
4926c9a
to
e6f4d8b
Compare
f2d0215
to
aeff35b
Compare
2c357ce
to
99c566e
Compare
da5a159
to
017d65e
Compare
017d65e
to
b91ddda
Compare
CodSpeed Performance ReportMerging #13746 will not alter performanceComparing Summary
|
b91ddda
to
c9b217e
Compare
adfea3b
to
c77cf72
Compare
59aa55a
to
628c6ba
Compare
c04ae0f
to
a746135
Compare
3b8398b
to
55e3623
Compare
e97142e
to
2b960f6
Compare
2b960f6
to
1200460
Compare
631b961
to
ae4ed76
Compare
ae4ed76
to
5823b25
Compare
Is this the same optimization as Late Materialization? |
Not exactly - it's the same idea, but done at the IO level.
This specifically helps when I/O and memory are the blockers, ie wide tables with columns that contain heavy data. |
I will close this one as this isn't getting merged anymore. I will see us dynamically adapting to such a strategy in the new streaming engine, but for now it's out of scope. |
This addresses #13608.
This addresses the situation when the row group statistics don't help with filtering out row groups. By loading just the columns required to apply the predicate and filtering out non-required row groups, followed by loading all the data from the leftover row groups.
The main downside of this implementation is that in the bad case, where the number of rows for the predicate and projection are equal the data is being downloaded twice. To get around this, perhaps a feature flag can be added, something like
ROW_GROUP_PRUNING
, and only if it is turned on will this filtering be applied.I will note, that on the datasets I am currently working on, as specified in the issue, the filtering went from 25min and 32 GB memory consumption to 25 seconds and negligible memory.