GH-33973: [Python][Docs] Update documentation for Parquet filter keyword #33974

Fokko · 2023-02-01T14:11:25Z

Rationale for this change

I wrote a converter from an arbitrary expression to DNF, but this was not needed after learning that it just accepts an expression now.

Closes #33973

github-actions · 2023-02-01T14:11:51Z

Closes: [Python] Parquet filter keyword of read_table docstring is outdated #33973

github-actions · 2023-02-01T14:11:54Z

⚠️ GitHub issue #33973 has been automatically assigned in GitHub to PR creator.

westonpace · 2023-02-02T14:07:23Z

I don't have a problem with this change but wonder if the root cause might be a little more fundamental. I wonder if ParquetDataset itself should be deprecated. The docs you probably want are pyarrow.dataset.dataset and pyarrow.dataset.Dataset (though filter would be provided by pyarrow.dataset.Dataset.to_table which unfortunately redirects its documentation to pyarrow.dataset.Scanner.from_dataset 😰 )

Fokko · 2023-02-02T14:52:31Z

@westonpace Thanks for the reply!

The docs you probably want are pyarrow.dataset.dataset and pyarrow.dataset.Dataset

I would agree with you, but I went for a lower API because I want to re-use the connection. This avoids another HEAD get request to S3. More background is provided here, and the dataset only accepts paths. If you think it is worthwhile to accept NativeFile there as well, let me know, and happy to raise a PR.

which unfortunately redirects its documentation to pyarrow.dataset.Scanner.from_dataset 😰 )

Do you want me to create a PR to copy those docs? Because of the redirect, the arguments are also not showing up in PyCharm. Let me know and I'll create a PR.

westonpace · 2023-02-02T23:24:27Z

Do you want me to create a PR to copy those docs? Because of the redirect, the arguments are also not showing up in PyCharm. Let me know and I'll create a PR.

Yes, I think that is a good idea but I might CC @jorisvandenbossche or @amol- to weigh in on whether they know some better way to avoid the duplication or have a preference here (I normally focus on the C++ end of things.)

jorisvandenbossche · 2023-02-03T16:20:57Z

@Fokko thanks for the catch! You are currently updating the docstring of ParquetDataset, but we should do the same update for read_table:

arrow/python/pyarrow/parquet/core.py

Line 2775 in 78a8da4

filters : List[Tuple] or List[List[Tuple]] or None (default)

(I think it's also read_table you are using in PyIceberg, and not ParquetDataset?)

Do you want me to create a PR to copy those docs? Because of the redirect, the arguments are also not showing up in PyCharm. Let me know and I'll create a PR.

Yes, I think that is a good idea but I might CC @jorisvandenbossche or @amol- to weigh in on whether they know some better way to avoid the duplication or have a preference here (I normally focus on the C++ end of things.)

Yeah, I think we should prefer some duplication if that gives better docstrings. I agree the indirection for the user right now isn't very user friendly.
We might be able to share some part of the docstring and inject that in multiple places to avoid duplicating the actual content, if that doesn't make things too complicated.

Fokko · 2023-02-03T16:30:44Z

@jorisvandenbossche Updating table one is a good suggestion indeed. Updated that one as well 👍🏻

(I think it's also read_table you are using in PyIceberg, and not ParquetDataset?)

Currently, we use the read_table indeed. I've also played around with the ParquetDataset, and it looked very similar. However, we don't need the lazy nature of the dataset, so directly loading a table makes more sense in our situation.

Yeah, I think we should prefer some duplication if that gives better docstrings. I agree the indirection for the user right now isn't very user friendly.
We might be able to share some part of the docstring and inject that in multiple places to avoid duplicating the actual content, if that doesn't make things too complicated.

Makes a lot of sense, I think sharing would be best. Let me create a separate PR for that

Fokko · 2023-02-03T16:56:39Z

@jorisvandenbossche first a bit of cleanup in #34034

…te-doc

jorisvandenbossche · 2023-02-07T15:53:11Z

One more thing about the docstring: for both read_table and ParquetDataset, the docstring also injects the content of _DNF_filter_doc (which holds the actual explanation of how the filter is expressed in this list of tuples form). We should probably update that to start with saying that the filter can either be a pyarrow Expression, or this DNF list of tuples form.

Fokko · 2023-02-09T09:42:06Z

@jorisvandenbossche Good call. I've updated the docstring and added an example.

…te-doc

Fokko · 2023-02-14T22:30:06Z

@jorisvandenbossche WDYT?

jorisvandenbossche

Thanks!

ursabot · 2023-02-15T13:44:49Z

Benchmark runs are scheduled for baseline = e63215c and contender = 306026d. 306026d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.58% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.16% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 306026d8 ec2-t3-xlarge-us-east-2
[Failed] 306026d8 test-mac-arm
[Finished] 306026d8 ursa-i9-9960x
[Finished] 306026d8 ursa-thinkcentre-m75q
[Finished] e63215ca ec2-t3-xlarge-us-east-2
[Failed] e63215ca test-mac-arm
[Finished] e63215ca ursa-i9-9960x
[Finished] e63215ca ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2023-02-15T13:45:57Z

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

…r keyword (apache#33974) ### Rationale for this change I wrote a converter from an arbitrary expression to DNF, but this was not needed after learning that it just accepts an expression now. Closes apache#33973 Authored-by: Fokko Driesprong <fokko@tabular.io> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Fokko requested a review from AlenkaF as a code owner February 1, 2023 14:11

github-actions bot added the Component: Python label Feb 1, 2023

Fokko mentioned this pull request Feb 1, 2023

Python: Remove the DNF conversion apache/iceberg#6721

Merged

kou changed the title ~~GH-33973: [PYTHON] Update parquet filter javadoc~~ GH-33973: [Python][Docs] Update documentation for Parquet filter Feb 1, 2023

Fokko force-pushed the fd-update-doc branch from e6434d2 to 9e6155d Compare February 3, 2023 16:31

[Python] Update Parquet/Table filter Pythondoc

831168c

Fokko force-pushed the fd-update-doc branch from 9e6155d to 831168c Compare February 3, 2023 19:08

Fokko added 2 commits February 7, 2023 09:39

Merge branch 'master' of https://github.com/apache/arrow into fd-upda…

e0df898

…te-doc

Fix CI

daf6cbb

Fokko mentioned this pull request Feb 9, 2023

[Python] Type checking support #32609

Open

Add Expression to the DNF string

aa79c78

Merge branch 'master' of https://github.com/apache/arrow into fd-upda…

7a07c38

…te-doc

jorisvandenbossche approved these changes Feb 15, 2023

View reviewed changes

jorisvandenbossche changed the title ~~GH-33973: [Python][Docs] Update documentation for Parquet filter~~ GH-33973: [Python][Docs] Update documentation for Parquet filter keyword Feb 15, 2023

jorisvandenbossche merged commit 306026d into apache:master Feb 15, 2023

jorisvandenbossche mentioned this pull request Feb 15, 2023

[Python] Parquet filter keyword of read_table docstring is outdated #33973

Closed

jorisvandenbossche added this to the 12.0.0 milestone Feb 15, 2023

Fokko deleted the fd-update-doc branch February 15, 2023 11:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-33973: [Python][Docs] Update documentation for Parquet filter keyword #33974

GH-33973: [Python][Docs] Update documentation for Parquet filter keyword #33974

Fokko commented Feb 1, 2023 •

edited by jorisvandenbossche

Loading

github-actions bot commented Feb 1, 2023

github-actions bot commented Feb 1, 2023

westonpace commented Feb 2, 2023

Fokko commented Feb 2, 2023 •

edited

Loading

westonpace commented Feb 2, 2023

jorisvandenbossche commented Feb 3, 2023

Fokko commented Feb 3, 2023

Fokko commented Feb 3, 2023

jorisvandenbossche commented Feb 7, 2023

Fokko commented Feb 9, 2023

Fokko commented Feb 14, 2023

jorisvandenbossche left a comment

ursabot commented Feb 15, 2023

ursabot commented Feb 15, 2023

GH-33973: [Python][Docs] Update documentation for Parquet filter keyword #33974

GH-33973: [Python][Docs] Update documentation for Parquet filter keyword #33974

Conversation

Fokko commented Feb 1, 2023 • edited by jorisvandenbossche Loading

Rationale for this change

github-actions bot commented Feb 1, 2023

github-actions bot commented Feb 1, 2023

westonpace commented Feb 2, 2023

Fokko commented Feb 2, 2023 • edited Loading

westonpace commented Feb 2, 2023

jorisvandenbossche commented Feb 3, 2023

Fokko commented Feb 3, 2023

Fokko commented Feb 3, 2023

jorisvandenbossche commented Feb 7, 2023

Fokko commented Feb 9, 2023

Fokko commented Feb 14, 2023

jorisvandenbossche left a comment

Choose a reason for hiding this comment

ursabot commented Feb 15, 2023

ursabot commented Feb 15, 2023

Fokko commented Feb 1, 2023 •

edited by jorisvandenbossche

Loading

Fokko commented Feb 2, 2023 •

edited

Loading