GH-36284: [Python][Parquet] Support write page index in Python API #36290

mapleFU · 2023-06-25T11:36:02Z

Rationale for this change

Support write_page_index in Parquet Python API

What changes are included in this PR?

support write_page_index in properties

Are these changes tested?

Currently not

Are there any user-facing changes?

User can generate page index here.

Closes: [Python][Parquet] Support write page index in Parquet #36284

github-actions · 2023-06-25T11:36:28Z

⚠️ GitHub issue #36284 has been automatically assigned in GitHub to PR creator.

mapleFU · 2023-06-25T13:59:17Z

@jorisvandenbossche @pitrou Mind take a look? I'm not so familiar with Python part, so maybe make something wrong

mapleFU · 2023-07-03T06:30:11Z

@pitrou @westonpace Would you mind take a look? This patch support Python to write page_index.

AlenkaF

I only have a minor suggestion about the write_page_index docstrings.

python/pyarrow/parquet/core.py

AlenkaF

Thanks!

mapleFU · 2023-07-05T09:17:36Z

Can this patch be merged? Or should I wait for other committers review?

pitrou · 2023-07-05T09:19:11Z

@github-actions crossbow submit -g python

mapleFU · 2023-07-05T11:02:04Z

test-conda-python-3.10-spark-master
test-cuda-python
test-conda-python-3.8-spark-v3.1.2
test-conda-python-3.10-spark-master

These cases failed, how can I try to fix them?

pitrou · 2023-07-05T11:41:16Z

@mapleFU Those are unrelated to this PR. Can you try to rebase?

AlenkaF · 2023-07-05T11:41:34Z

test-conda-python-3.10-spark-master
test-conda-python-3.8-spark-v3.1.2
test-conda-python-3.9-spark-v3.2.0

Spark failures are known and have an issue opened.

test-conda-python-3.11-hypothesis

Hypothesis failure is a new one but I do not see how it could be related to this PR.

test-cuda-python

I have seen nightlies fail with this error today already, so this is not related to the PR either.

jorisvandenbossche · 2023-07-05T11:56:17Z

Hypothesis failure is a new one but I do not see how it could be related to this PR.

Hmm, that seems very similar to the one that I fixed last week (#36349, but now with another unknown timezone). In any case, you can ignore it here.

jorisvandenbossche · 2023-07-05T12:05:45Z

python/pyarrow/parquet/core.py

@@ -867,6 +867,10 @@ def _sanitize_table(table, new_schema, flavor):
    it will restore the timezone (Parquet only stores the UTC values without
    timezone), or columns with duration type will be restored from the int64
    Parquet column.
+write_page_index : bool, default False


Side question: should we consider making this turned on by default at some point?

Currently not, I found it's hard to implement page index pruning in current implementions. If we implements it, maybe we can change it to default.

Even if it's not used already, it would probably be beneficial to write files with the index enabled, for future use.
Is there a performance issue with enabling it?

I guess most time there is no performance issue. But when user has extremly long string, we might write to much data.

We are allowed to trim the min/max values, right?

Sure. Here it will "discard" too long statistics, and discard the page index. I will implement truncate in the future

So if I understand correctly, we are currently not yet using the PageIndex when reading files (through the python APIs) for pruning pages when given a filter?

Should we mention that in the docstring to note that you can already write a PageIndex, but it will not yet be used when reading using pyarrow?

@jorisvandenbossche I've done that. By the way, we cannot filter using pyarrow, but parquet-rs and parquet-mr can optimize by it.

python/pyarrow/_parquet.pyx

pitrou · 2023-07-05T12:11:05Z

@github-actions crossbow submit -g python

mapleFU · 2023-07-05T14:02:33Z

Still these failed, lol

python/pyarrow/_parquet.pyx

python/pyarrow/tests/parquet/test_metadata.py

mapleFU · 2023-07-07T11:34:21Z

@pitrou @jorisvandenbossche I've tried to fix the comment here. Would you mind take a look?

Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>

pitrou · 2023-07-10T15:34:41Z

@github-actions crossbow submit -g python

github-actions · 2023-07-10T15:37:30Z

Revision: 9840291

Submitted crossbow builds: ursacomputing/crossbow @ actions-f780c64692

Task	Status
test-conda-python-3.10
test-conda-python-3.10-hdfs-2.9.2
test-conda-python-3.10-hdfs-3.2.1
test-conda-python-3.10-pandas-latest
test-conda-python-3.10-pandas-nightly
test-conda-python-3.10-spark-master
test-conda-python-3.10-substrait
test-conda-python-3.11
test-conda-python-3.11-dask-latest
test-conda-python-3.11-dask-upstream_devel
test-conda-python-3.11-hypothesis
test-conda-python-3.11-pandas-upstream_devel
test-conda-python-3.8
test-conda-python-3.8-pandas-1.0
test-conda-python-3.8-spark-v3.1.2
test-conda-python-3.9
test-conda-python-3.9-pandas-latest
test-conda-python-3.9-spark-v3.2.0
test-cuda-python
test-debian-11-python-3
test-fedora-35-python-3
test-ubuntu-20.04-python-3
test-ubuntu-22.04-python-3

…36290) ### Rationale for this change Support `write_page_index` in Parquet Python API ### What changes are included in this PR? support `write_page_index` in properties ### Are these changes tested? Currently not ### Are there any user-facing changes? User can generate page index here. * Closes: #36284 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

conbench-apache-arrow · 2023-07-21T20:52:26Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 12f45ba.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

github-actions bot added Component: Python awaiting review Awaiting review labels Jun 25, 2023

mapleFU force-pushed the parquet/enable-write-page-index branch from 28d6e68 to 789abf8 Compare June 25, 2023 12:18

mapleFU marked this pull request as ready for review June 26, 2023 16:31

mapleFU requested a review from AlenkaF as a code owner June 26, 2023 16:31

mapleFU force-pushed the parquet/enable-write-page-index branch 2 times, most recently from d758a74 to 39553b5 Compare July 3, 2023 05:15

AlenkaF reviewed Jul 4, 2023

View reviewed changes

python/pyarrow/parquet/core.py Outdated Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 4, 2023

AlenkaF approved these changes Jul 4, 2023

View reviewed changes

This comment was marked as outdated.

Sign in to view

jorisvandenbossche approved these changes Jul 5, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Jul 5, 2023

jorisvandenbossche reviewed Jul 5, 2023

View reviewed changes

python/pyarrow/_parquet.pyx Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jul 5, 2023

This comment was marked as outdated.

Sign in to view

pitrou requested changes Jul 6, 2023

View reviewed changes

python/pyarrow/_parquet.pyx Outdated Show resolved Hide resolved

python/pyarrow/tests/parquet/test_metadata.py Outdated Show resolved Hide resolved

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 6, 2023

mapleFU and others added 9 commits July 10, 2023 17:23

[add] api add write-page-index support

fbe9962

[ADD] add comments

ceab5e4

[Update] Export metadata out

23acd5e

tiny update

4c86d53

add tests

bed8f96

Update python/pyarrow/parquet/core.py

3a8c93d

Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>

fix comment

6e55b5a

[Update] fix comment

b5572f8

Improve docstrings

5f789a6

pitrou force-pushed the parquet/enable-write-page-index branch from 2f764cd to 5f789a6 Compare July 10, 2023 15:31

Remove unused member

9840291

pitrou approved these changes Jul 10, 2023

View reviewed changes

pitrou merged commit 12f45ba into apache:main Jul 10, 2023

pitrou removed the awaiting change review Awaiting change review label Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-36284: [Python][Parquet] Support write page index in Python API #36290

GH-36284: [Python][Parquet] Support write page index in Python API #36290

mapleFU commented Jun 25, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Jun 25, 2023

mapleFU commented Jun 25, 2023 •

edited

Loading

mapleFU commented Jul 3, 2023

AlenkaF left a comment

AlenkaF left a comment

mapleFU commented Jul 5, 2023

pitrou commented Jul 5, 2023

This comment was marked as outdated.

mapleFU commented Jul 5, 2023

pitrou commented Jul 5, 2023

AlenkaF commented Jul 5, 2023

jorisvandenbossche commented Jul 5, 2023

jorisvandenbossche Jul 5, 2023

mapleFU Jul 5, 2023

pitrou Jul 5, 2023

mapleFU Jul 5, 2023

pitrou Jul 5, 2023

mapleFU Jul 5, 2023 •

edited

Loading

jorisvandenbossche Jul 6, 2023

mapleFU Jul 6, 2023

mapleFU Jul 6, 2023

pitrou commented Jul 5, 2023

This comment was marked as outdated.

mapleFU commented Jul 5, 2023

mapleFU commented Jul 7, 2023

pitrou commented Jul 10, 2023

github-actions bot commented Jul 10, 2023

conbench-apache-arrow bot commented Jul 21, 2023

GH-36284: [Python][Parquet] Support write page index in Python API #36290

GH-36284: [Python][Parquet] Support write page index in Python API #36290

Conversation

mapleFU commented Jun 25, 2023 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Jun 25, 2023

mapleFU commented Jun 25, 2023 • edited Loading

mapleFU commented Jul 3, 2023

AlenkaF left a comment

Choose a reason for hiding this comment

AlenkaF left a comment

Choose a reason for hiding this comment

mapleFU commented Jul 5, 2023

pitrou commented Jul 5, 2023

This comment was marked as outdated.

mapleFU commented Jul 5, 2023

pitrou commented Jul 5, 2023

AlenkaF commented Jul 5, 2023

jorisvandenbossche commented Jul 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU Jul 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Jul 5, 2023

This comment was marked as outdated.

mapleFU commented Jul 5, 2023

mapleFU commented Jul 7, 2023

pitrou commented Jul 10, 2023

github-actions bot commented Jul 10, 2023

conbench-apache-arrow bot commented Jul 21, 2023

mapleFU commented Jun 25, 2023 •

edited by github-actions bot

Loading

mapleFU commented Jun 25, 2023 •

edited

Loading

mapleFU Jul 5, 2023 •

edited

Loading