Fix parquet predicate filtering with column projection #15113

karthikeyann · 2024-02-21T19:56:30Z

Description

The predicate filtering in parquet did not work while column projection is used. This PR fixes that limitation.

With this PR change, the user will be able to use both column name reference and column index reference in the filter.

column name reference: the filters may specify any columns by name even if they are not present in column projection.
column reference (index): The indices used should be the indices of output columns in the requested order.

This is achieved by extracting column names from filter and add to output buffers, after predicate filtering is done, these filter-only columns are removed and only requested columns are returned.
The change includes reading only output columns' statistics data instead of all root columns.

Summary of changes:

get_column_names_in_expression extracts column names in filter.
The extra columns in filter are added to output buffers during reader initialization
- cpp/src/io/parquet/reader_impl_helpers.cpp, cpp/src/io/parquet/reader_impl.cpp
instead of extracting statistics data of all root columns, it extracts for only output columns (including columns in filter)
- cpp/src/io/parquet/predicate_pushdown.cpp
- To do this, output column schemas and its dtypes should be cached.
- statistics data extraction code is updated to check for schema_idx in row group metadata.
- No need to convert filter again for all root columns, reuse the passed output columns reference filter.
- Rest of the code is same.
After the output filter predicate is calculated, these filter-only columns are removed
moved named_to_reference_converter constructor to cpp, and remove used constructor.
small include<> cleanup

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

GregoryKimball · 2024-02-22T04:43:56Z

This is great! Nice work @karthikeyann. Let's please add a step to drop any filter columns that weren't part of the requested column set before returning the table.

wence- · 2024-02-26T17:08:35Z

Thanks, like @GregoryKimball it would be nice if the column selection were disjoint from the filtering selection logic in terms of determining what columns are finally returned. So that I can filter on (say) "A" and "B", but return "C" and "D".

wence-

A few questions

cpp/src/io/parquet/reader_impl_helpers.hpp

cpp/src/io/parquet/predicate_pushdown.cpp

get_columns() need not specify all columns in filter. columns in filter are read, and discard finally at the output after filtering.

cpp/include/cudf/io/parquet.hpp

…n/cudf into fix-pq_filter_col_projection

cpp/include/cudf/io/parquet.hpp

vuule

first pass - small suggestions

cpp/include/cudf/io/parquet.hpp

cpp/src/io/parquet/reader_impl_helpers.hpp

cpp/src/io/parquet/reader_impl.cpp

cpp/src/io/parquet/reader_impl_helpers.hpp

cpp/src/io/parquet/reader_impl.hpp

cpp/src/io/parquet/predicate_pushdown.cpp

copy-pr-bot · 2024-04-09T03:07:40Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

karthikeyann · 2024-04-09T03:08:29Z

/ok to test

GregoryKimball · 2024-04-24T19:06:50Z

@etseidl Does this change look like it will work for your application?

etseidl · 2024-04-24T20:00:50Z

@etseidl Does this change look like it will work for your application?

Yes, I believe so.

karthikeyann · 2024-05-10T07:42:29Z

/ok to test

wence-

Thanks! LGTM

cpp/include/cudf/io/parquet.hpp

wence- · 2024-05-14T08:29:55Z

cpp/src/io/parquet/predicate_pushdown.cpp

+            min.set_index(stats_idx, thrust::nullopt, {});
+            max.set_index(stats_idx, thrust::nullopt, {});


nit (non-blocking): Possibly worthwhile migrating set_index to std::optional.

min_value and min from Statistics struct uses thrust::optional, which is passed here.
https://github.com/rapidsai/cudf/blob/064dd7b02166cc67e882b708d66621bc3fafd70b/cpp/src/io/parquet/parquet.hpp uses thrust::optional everywhere (except at 2 places). Not sure why.
@vuule

All of the thrift data structures use thrust::optional over std::optional because some are used on device. I assume these will migrate to cuda::std::optional eventually. #15091 (comment)

cpp/src/io/parquet/reader_impl_helpers.hpp

wence- · 2024-05-14T08:36:48Z

cpp/tests/io/parquet_reader_test.cpp

+      auto read_opts = cudf::io::parquet_reader_options::builder(cudf::io::source_info{filepath})
+                         .columns({"col_double", "col_uint32"})
+                         .filter(read_ref_expr);
+      EXPECT_THROW(cudf::io::read_parquet(read_opts), cudf::logic_error);


nit (non-blocking): should these throw (per #12885) respectively cudf::data_type_error and std::out_of_range?

Nice suggestion. These exceptions are thrown from AST. I will create another PR to fix it in AST code.

if there are two reasons to throw, we need two separate tests

mhaseeb123 · 2024-05-15T07:33:40Z

/ok to test

mhaseeb123 · 2024-05-15T18:33:33Z

cpp/src/io/parquet/reader_impl_helpers.hpp

@@ -127,6 +123,7 @@ class aggregate_reader_metadata {
  int64_t num_rows;
  size_type num_row_groups;

+  std::vector<data_type> _output_types;


Do we need to keep _output_types cached here or can we clear this after the output has been built? Does extract output_dtypes like inside reader::impl::preprocess_file incur a lot of repetitive overhead.

mhaseeb123

Thanks for the effort. The changes look good. Not a blocker but please confirm if we need to cache output_dtypes at this time or if just constructing it on the fly would be better.

karthikeyann · 2024-05-16T03:35:30Z

Thanks for the effort. The changes look good. Not a blocker but please confirm if we need to cache output_dtypes at this time or if just constructing it on the fly would be better.

Updated to construct on the fly. will work with read_chunk() as well.

karthikeyann · 2024-05-16T03:35:48Z

/ok to test

…nn/cudf into fix-pq_filter_col_projection

karthikeyann · 2024-05-16T04:31:43Z

/ok to test

karthikeyann · 2024-05-16T15:47:52Z

/merge

fix stats filter conversion dtypes and names

be089f3

karthikeyann added bug Something isn't working 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue non-breaking Non-breaking change labels Feb 21, 2024

wence- reviewed Feb 26, 2024

View reviewed changes

karthikeyann and others added 4 commits March 1, 2024 14:29

filter columns limitation fixed.

f458410

get_columns() need not specify all columns in filter. columns in filter are read, and discard finally at the output after filtering.

address review comments, added docstring

b01b2d8

Merge branch 'branch-24.04' into fix-pq_filter_col_projection

b348db4

add docstring for filter

4a07e3d

karthikeyann added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Mar 1, 2024

karthikeyann marked this pull request as ready for review March 1, 2024 10:04

karthikeyann requested a review from a team as a code owner March 1, 2024 10:04

karthikeyann requested review from robertmaynard and vuule March 1, 2024 10:04

wence- reviewed Mar 1, 2024

View reviewed changes

cpp/include/cudf/io/parquet.hpp Show resolved Hide resolved

karthikeyann and others added 4 commits March 6, 2024 11:34

Merge branch 'branch-24.04' into fix-pq_filter_col_projection

6ee2bcf

update docs with example

acb0723

Merge branch 'fix-pq_filter_col_projection' of github.com:karthikeyan…

bff38f5

…n/cudf into fix-pq_filter_col_projection

Merge branch 'branch-24.04' into fix-pq_filter_col_projection

d643ce1

wence- reviewed Mar 6, 2024

View reviewed changes

cpp/include/cudf/io/parquet.hpp Show resolved Hide resolved

GregoryKimball assigned karthikeyann Mar 6, 2024

vuule requested changes Mar 7, 2024

View reviewed changes

karthikeyann changed the base branch from branch-24.04 to branch-24.06 April 9, 2024 03:07

Merge branch 'branch-24.06' into fix-pq_filter_col_projection

e79552c

karthikeyann and others added 2 commits April 24, 2024 02:35

address review comments, include cleanup, reorg code

e40cffc

Merge branch 'branch-24.06' into fix-pq_filter_col_projection

926a75a

vuule self-requested a review May 8, 2024 00:23

fix col index ref on projection

a220d7d

karthikeyann requested a review from wence- May 10, 2024 07:41

Merge branch 'branch-24.06' into fix-pq_filter_col_projection

c0e734c

mhaseeb123 self-requested a review May 10, 2024 20:02

wence- approved these changes May 14, 2024

View reviewed changes

Merge branch 'branch-24.06' into fix-pq_filter_col_projection

96ea0e8

vuule approved these changes May 14, 2024

View reviewed changes

Merge branch 'branch-24.06' into fix-pq_filter_col_projection

47c5413

mhaseeb123 reviewed May 15, 2024

View reviewed changes

mhaseeb123 approved these changes May 15, 2024

View reviewed changes

karthikeyann and others added 2 commits May 15, 2024 22:34

remove caching output dtypes

9e4008e

Merge branch 'branch-24.06' into fix-pq_filter_col_projection

cc3bd26

wMerge branch 'fix-pq_filter_col_projection' of github.com:karthikeya…

f64294e

…nn/cudf into fix-pq_filter_col_projection

rapids-bot bot merged commit 47ed345 into rapidsai:branch-24.06 May 16, 2024
70 checks passed

mhaseeb123 mentioned this pull request Jun 24, 2024

Row group selection support in Parquet chunked reader #13913

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parquet predicate filtering with column projection #15113

Fix parquet predicate filtering with column projection #15113

karthikeyann commented Feb 21, 2024 •

edited

Loading

GregoryKimball commented Feb 22, 2024

wence- commented Feb 26, 2024

wence- left a comment

vuule left a comment

copy-pr-bot bot commented Apr 9, 2024

karthikeyann commented Apr 9, 2024

GregoryKimball commented Apr 24, 2024

etseidl commented Apr 24, 2024

karthikeyann commented May 10, 2024

wence- left a comment

wence- May 14, 2024

karthikeyann May 14, 2024

etseidl May 14, 2024

wence- May 14, 2024

karthikeyann May 14, 2024

vuule May 14, 2024

mhaseeb123 commented May 15, 2024

mhaseeb123 May 15, 2024

mhaseeb123 left a comment •

edited

Loading

karthikeyann commented May 16, 2024

karthikeyann commented May 16, 2024

karthikeyann commented May 16, 2024

karthikeyann commented May 16, 2024

		min.set_index(stats_idx, thrust::nullopt, {});
		max.set_index(stats_idx, thrust::nullopt, {});

Fix parquet predicate filtering with column projection #15113

Fix parquet predicate filtering with column projection #15113

Conversation

karthikeyann commented Feb 21, 2024 • edited Loading

Description

Checklist

GregoryKimball commented Feb 22, 2024

wence- commented Feb 26, 2024

wence- left a comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

copy-pr-bot bot commented Apr 9, 2024

karthikeyann commented Apr 9, 2024

GregoryKimball commented Apr 24, 2024

etseidl commented Apr 24, 2024

karthikeyann commented May 10, 2024

wence- left a comment

Choose a reason for hiding this comment

wence- May 14, 2024

Choose a reason for hiding this comment

karthikeyann May 14, 2024

Choose a reason for hiding this comment

etseidl May 14, 2024

Choose a reason for hiding this comment

wence- May 14, 2024

Choose a reason for hiding this comment

karthikeyann May 14, 2024

Choose a reason for hiding this comment

vuule May 14, 2024

Choose a reason for hiding this comment

mhaseeb123 commented May 15, 2024

mhaseeb123 May 15, 2024

Choose a reason for hiding this comment

mhaseeb123 left a comment • edited Loading

Choose a reason for hiding this comment

karthikeyann commented May 16, 2024

karthikeyann commented May 16, 2024

karthikeyann commented May 16, 2024

karthikeyann commented May 16, 2024

karthikeyann commented Feb 21, 2024 •

edited

Loading

mhaseeb123 left a comment •

edited

Loading