Expose some Parquet per-column configuration options via the python API #15613

etseidl · 2024-04-29T22:41:23Z

Description

Several recent PRs (#15081, #15411, #15600) added the ability to control some aspects of Parquet file writing on a per-column basis. During discussion of #15081 it was suggested that these options be exposed by cuDF-python in a manner similar to pyarrow. This PR adds the ability to control per-column encoding, compression, binary output, and fixed-length data width, using fully qualified Parquet column names. For example, given a cuDF table with an integer column 'a', and a list<int32> column 'b', the fully qualified column names would be 'a' and 'b.list.element'.

Addresses "Add cuDF-python API support for specifying encodings" task in #13501.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…ndtrip

copy-pr-bot · 2024-04-29T22:41:27Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

etseidl · 2024-04-29T22:41:59Z

@GregoryKimball does this do what you wanted?

etseidl · 2024-05-10T23:01:18Z

I think we would benefit from some tests using write_parquet function and the ParquetWriter to write data and then using pyarrow parquet reader to read and verify correct encodings for each column.

@mhaseeb123 I added some parameterization to at least verify that all the valid Parquet encodings work on the write side. There is already testing elsewhere to verify encoding interoperability with pyarrow/pandas.[1] I also added a check that the column that should still be compressed actually is.

The option set for ParquetWriter has fallen far behind the DataFrame.to_parquet() path. Bringing the former up to date is beyond scope for this PR IMHO.

[1] With the exception of BYTE_STREAM_SPLIT, which was only recently fully implemented in arrow 16. This too can be addressed in a follow up PR.

python/cudf/cudf/tests/test_parquet.py

mhaseeb123

Partial review. Just a minor comment. Looks good otherwise

mhaseeb123

Looks good to me!

mhaseeb123 · 2024-05-14T02:38:59Z

/ok to test

vyasr

Overall looks great, thanks! One change to make then this should be good to go.

python/cudf/cudf/_lib/pylibcudf/libcudf/io/types.pxd

vuule · 2024-05-16T16:51:53Z

/ok to test

GregoryKimball · 2024-05-20T18:27:02Z

@galipremsagar Would you please check how cudf.pandas runs when column_encoding is specified?

The user command would be something like df.to_parquet(columns_encoding='DELTA_BYTE_ARRAY').

By the way, what would cudf.pandas do if the user specified an engine such as:
df.to_parquet(engine='pyarrow', columns_encoding='DELTA_BYTE_ARRAY')?

galipremsagar · 2024-05-22T14:44:55Z

@galipremsagar Would you please check how cudf.pandas runs when column_encoding is specified?

The user command would be something like df.to_parquet(columns_encoding='DELTA_BYTE_ARRAY').

This will invoke cudf parquet writer.

By the way, what would cudf.pandas do if the user specified an engine such as: df.to_parquet(engine='pyarrow', columns_encoding='DELTA_BYTE_ARRAY')?

This will invoke pyarrow's parquet writer through cudf code, but ignore columns_encoding

galipremsagar · 2024-05-22T17:16:06Z

/okay to test

vyasr

LGTM

vuule · 2024-05-22T20:20:06Z

/merge

etseidl and others added 17 commits April 25, 2024 10:40

round trip fixed_len_byte_array data properly

9bb6c6a

Merge remote-tracking branch 'origin/branch-24.06' into fixed_len_rou…

3a3ea7b

…ndtrip

address review comments

b475084

checkpoint metadata paths

ad45ee1

add some column_in_metadata methods

101ea2d

finish first cut at new options

a774553

add doc stubs

92a1fe0

Merge branch 'rapidsai:branch-24.06' into python_metadata

79f5008

clean up docs and change skip_compression to a set

1974f12

do not build full_path if it is not needed

f43dae2

formatting

1a44944

skip setting element names too

4c4b224

add test that uses new option

cee8c9c

add output_as_binary as separate option

4ad48cd

add to documentation of column_type_length

ee9f414

flesh out documentation

c75b301

Merge branch 'rapidsai:branch-24.06' into fixed_len_roundtrip

eae6b04

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Apr 29, 2024

etseidl and others added 8 commits April 30, 2024 20:06

add test for encoding and compression override

4c9b8a4

Merge remote-tracking branch 'origin/branch-24.06' into python_metadata

aebd86f

Merge branch 'branch-24.06' into fixed_len_roundtrip

79a93c3

Merge branch 'rapidsai:branch-24.06' into python_metadata

d7b23ab

Merge branch 'rapidsai:branch-24.06' into fixed_len_roundtrip

e41cd04

Merge branch 'branch-24.06' into fixed_len_roundtrip

7934e31

Merge branch 'branch-24.06' into python_metadata

48d5de6

Merge branch 'branch-24.06' into fixed_len_roundtrip

64dc418

etseidl mentioned this pull request May 3, 2024

Change the default dictionary policy in Parquet writer from ALWAYS to ADAPTIVE #15570

Merged

3 tasks

vuule added the 3 - Ready for Review Ready for review by team label May 10, 2024

etseidl and others added 2 commits May 10, 2024 20:56

Merge remote-tracking branch 'origin/branch-24.06' into python_metadata

30d9eb3

Merge remote-tracking branch 'origin/branch-24.06' into python_metadata

e188af2

mhaseeb123 reviewed May 13, 2024

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Show resolved Hide resolved

mhaseeb123 reviewed May 13, 2024

View reviewed changes

Merge branch 'branch-24.06' into python_metadata

3d30b4d

mhaseeb123 approved these changes May 14, 2024

View reviewed changes

Merge remote-tracking branch 'origin/branch-24.06' into python_metadata

b53933b

vyasr requested changes May 14, 2024

View reviewed changes

python/cudf/cudf/_lib/pylibcudf/libcudf/io/types.pxd Outdated Show resolved Hide resolved

change enum to PEP 435 style

f0183ff

etseidl requested a review from vyasr May 14, 2024 22:46

etseidl and others added 2 commits May 15, 2024 09:38

Merge branch 'branch-24.06' into python_metadata

78100f4

Merge branch 'branch-24.06' into python_metadata

bc06472

Merge branch 'rapidsai:branch-24.06' into python_metadata

1348779

GregoryKimball requested a review from galipremsagar May 20, 2024 18:23

etseidl and others added 2 commits May 20, 2024 12:22

Merge branch 'branch-24.06' into python_metadata

c797b56

Merge remote-tracking branch 'origin/branch-24.06' into python_metadata

39023c5

galipremsagar approved these changes May 22, 2024

View reviewed changes

Merge branch 'branch-24.06' into python_metadata

bd99301

vyasr approved these changes May 22, 2024

View reviewed changes

rapids-bot bot merged commit b4ce6e4 into rapidsai:branch-24.06 May 22, 2024
70 checks passed

etseidl deleted the python_metadata branch May 22, 2024 23:07

GregoryKimball mentioned this pull request Jun 11, 2024

[FEA] Support V2 encodings in Parquet reader and writer #13501

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose some Parquet per-column configuration options via the python API #15613

Expose some Parquet per-column configuration options via the python API #15613

etseidl commented Apr 29, 2024 •

edited

Loading

copy-pr-bot bot commented Apr 29, 2024

etseidl commented Apr 29, 2024

etseidl commented May 10, 2024

mhaseeb123 left a comment

mhaseeb123 left a comment

mhaseeb123 commented May 14, 2024

vyasr left a comment

vuule commented May 16, 2024

GregoryKimball commented May 20, 2024 •

edited

Loading

galipremsagar commented May 22, 2024

galipremsagar commented May 22, 2024

vyasr left a comment

vuule commented May 22, 2024

Expose some Parquet per-column configuration options via the python API #15613

Expose some Parquet per-column configuration options via the python API #15613

Conversation

etseidl commented Apr 29, 2024 • edited Loading

Description

Checklist

copy-pr-bot bot commented Apr 29, 2024

etseidl commented Apr 29, 2024

etseidl commented May 10, 2024

mhaseeb123 left a comment

Choose a reason for hiding this comment

mhaseeb123 left a comment

Choose a reason for hiding this comment

mhaseeb123 commented May 14, 2024

vyasr left a comment

Choose a reason for hiding this comment

vuule commented May 16, 2024

GregoryKimball commented May 20, 2024 • edited Loading

galipremsagar commented May 22, 2024

galipremsagar commented May 22, 2024

vyasr left a comment

Choose a reason for hiding this comment

vuule commented May 22, 2024

etseidl commented Apr 29, 2024 •

edited

Loading

GregoryKimball commented May 20, 2024 •

edited

Loading