Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add more functionality to cudf.io.read_parquet_metadata API #11214

Closed
galipremsagar opened this issue Jul 7, 2022 · 7 comments
Closed
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@galipremsagar
Copy link
Contributor

Is your feature request related to a problem? Please describe.
It would be nicer to have the row-group wise metadata returned instead of returning just the number of row-groups. That way users can identify how many rows & columns are stored in each row-group.

@galipremsagar galipremsagar added feature request New feature or request Python Affects Python cuDF API. labels Jul 7, 2022
@galipremsagar galipremsagar self-assigned this Jul 7, 2022
@galipremsagar
Copy link
Contributor Author

galipremsagar commented Jul 7, 2022

@vuule Did you intended to the same pyarrow parquet schema? Like:

(Pdb) x = pq.ParquetFile(fname)
(Pdb) x.metadata
<pyarrow._parquet.FileMetaData object at 0x7fbe0a172040>
  created_by: parquet-cpp-arrow version 8.0.0
  num_columns: 15
  num_rows: 0
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 7481

(Pdb) x.metadata.row_group(0)
<pyarrow._parquet.RowGroupMetaData object at 0x7fbe0a172270>
  num_columns: 15
  num_rows: 0
  total_byte_size: 196

or do we want to extract only the necessary bit of details and return those?

@vuule
Copy link
Contributor

vuule commented Jul 7, 2022

I like the option to follow the pyarrow metada structure here, if it's not a huge overhead to gather.

@galipremsagar
Copy link
Contributor Author

Yea, should not be an issue since we already tap into this API anyways: https://github.com/rapidsai/cudf/blob/branch-22.08/python/cudf/cudf/io/parquet.py#L199

@galipremsagar
Copy link
Contributor Author

cc: @rjzamora for visibility

@github-actions
Copy link

github-actions bot commented Aug 6, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed inactive-30d labels Apr 3, 2023
rapids-bot bot pushed a commit that referenced this issue Jul 27, 2023
Closes #11675
Adds `read_parquet_metadata` to libcudf.
The metadata has following information
- schema - (type, name, children)
- num_rows
- num_rowgroups
- key-value string metadata in file footer

To Reviewers: Request for adding more information in metadata. Refer #11214

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Divye Gala (https://github.com/divyegala)
  - Ray Douglass (https://github.com/raydouglass)

URL: #13663
@GregoryKimball
Copy link
Contributor

GregoryKimball commented Feb 16, 2024

That way users can identify how many rows & columns are stored in each row-group.

Now that we have read_parquet_metadata from #13663, could we re-scope this issue to specify the changes we would like to see in the parquet_metadata class?

As I understand, the number of columns will be identical for all row groups. We could add row counts for each row group as a new vector, or perhaps to metadata. Are the row group min/max statistics stored as key-value pairs in parquet_metadata.metadata?

rapids-bot bot pushed a commit that referenced this issue Apr 17, 2024
…tract `RowGroup` information (#15398)

The `cudf.io.read_parquet_metadata` is now bound to corresponding libcudf API instead of relying on pyarrow. The libcudf API now also returns high level `RowGroup` metadata to solve #11214. Added additional tests and doc updates as well. 

More metadata information such `min, max` values for each column in each row group can also be extracted and returned if needed. Thoughts? 

Recommend: Closing #15320 without merging in favor of this PR.

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #15398
@mhaseeb123
Copy link
Member

Closed by #15398

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
Archived in project
Development

No branches or pull requests

4 participants