-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose some Parquet per-column configuration options via the python API #15613
Conversation
@GregoryKimball does this do what you wanted? |
@mhaseeb123 I added some parameterization to at least verify that all the valid Parquet encodings work on the write side. There is already testing elsewhere to verify encoding interoperability with pyarrow/pandas.[1] I also added a check that the column that should still be compressed actually is. The option set for [1] With the exception of BYTE_STREAM_SPLIT, which was only recently fully implemented in arrow 16. This too can be addressed in a follow up PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review. Just a minor comment. Looks good otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great, thanks! One change to make then this should be good to go.
/ok to test |
@galipremsagar Would you please check how The user command would be something like By the way, what would |
This will invoke cudf parquet writer.
This will invoke pyarrow's parquet writer through cudf code, but ignore |
/okay to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/merge |
Description
Several recent PRs (#15081, #15411, #15600) added the ability to control some aspects of Parquet file writing on a per-column basis. During discussion of #15081 it was suggested that these options be exposed by cuDF-python in a manner similar to pyarrow. This PR adds the ability to control per-column encoding, compression, binary output, and fixed-length data width, using fully qualified Parquet column names. For example, given a cuDF table with an integer column 'a', and a
list<int32>
column 'b', the fully qualified column names would be 'a' and 'b.list.element'.Addresses "Add cuDF-python API support for specifying encodings" task in #13501.
Checklist