Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bind read_parquet_metadata API to libcudf instead of pyarrow and extract RowGroup information #15398

Merged
merged 16 commits into from
Apr 17, 2024

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Mar 27, 2024

Description

The cudf.io.read_parquet_metadata is now bound to corresponding libcudf API instead of relying on pyarrow. The libcudf API now also returns high level RowGroup metadata to solve #11214. Added additional tests and doc updates as well.

More metadata information such min, max values for each column in each row group can also be extracted and returned if needed. Thoughts?

Recommend: Closing #15320 without merging in favor of this PR.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mhaseeb123 mhaseeb123 requested review from a team as code owners March 27, 2024 00:16
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Mar 27, 2024
@mhaseeb123 mhaseeb123 added 3 - Ready for Review Ready for review by team cuIO cuIO issue breaking Breaking change improvement Improvement / enhancement to an existing function labels Mar 27, 2024
@mhaseeb123 mhaseeb123 changed the title Bind read_parquet_metadata API to libcudf instead of pyarrow and also extract row_group information Bind read_parquet_metadata API to libcudf instead of pyarrow and extract row_group information Mar 27, 2024
@mhaseeb123 mhaseeb123 changed the title Bind read_parquet_metadata API to libcudf instead of pyarrow and extract row_group information Bind read_parquet_metadata API to libcudf instead of pyarrow and extract RowGroup information Mar 27, 2024
@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress and removed 3 - Ready for Review Ready for review by team labels Apr 1, 2024
Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @mhaseeb123 ! Looks like there are some missing columns that is resulting in pytest failures; https://github.com/rapidsai/cudf/actions/runs/8461251922/job/23181884309?pr=15398#step:8:1758

@mhaseeb123 mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Apr 2, 2024
@mhaseeb123
Copy link
Member Author

mhaseeb123 commented Apr 2, 2024

Thanks for working on this @mhaseeb123 ! Looks like there are some missing columns that is resulting in pytest failures; https://github.com/rapidsai/cudf/actions/runs/8461251922/job/23181884309?pr=15398#step:8:1758

Thanks for reviewing the code @galipremsagar. I was already on it and just pushed a fix. The test should pass now. 🙂

@mhaseeb123 mhaseeb123 added breaking Breaking change and removed breaking Breaking change labels Apr 2, 2024
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few small questions/suggestions

python/cudf/cudf/_lib/parquet.pyx Show resolved Hide resolved
python/cudf/cudf/_lib/parquet.pyx Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/parquet_metadata.hpp Show resolved Hide resolved
python/cudf/cudf/_lib/parquet.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/parquet.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/io/parquet.py Outdated Show resolved Hide resolved
@mhaseeb123
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit f222b4a into rapidsai:branch-24.06 Apr 17, 2024
75 checks passed
@mhaseeb123 mhaseeb123 deleted the read-parquet-metadata branch April 17, 2024 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants