Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33973: [Python][Docs] Update documentation for Parquet filter keyword #33974

Merged
merged 5 commits into from
Feb 15, 2023

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Feb 1, 2023

Rationale for this change

I wrote a converter from an arbitrary expression to DNF, but this was not needed after learning that it just accepts an expression now.

Closes #33973

@Fokko Fokko requested a review from AlenkaF as a code owner February 1, 2023 14:11
@github-actions
Copy link

github-actions bot commented Feb 1, 2023

@github-actions
Copy link

github-actions bot commented Feb 1, 2023

⚠️ GitHub issue #33973 has been automatically assigned in GitHub to PR creator.

@kou kou changed the title GH-33973: [PYTHON] Update parquet filter javadoc GH-33973: [Python][Docs] Update documentation for Parquet filter Feb 1, 2023
@westonpace
Copy link
Member

I don't have a problem with this change but wonder if the root cause might be a little more fundamental. I wonder if ParquetDataset itself should be deprecated. The docs you probably want are pyarrow.dataset.dataset and pyarrow.dataset.Dataset (though filter would be provided by pyarrow.dataset.Dataset.to_table which unfortunately redirects its documentation to pyarrow.dataset.Scanner.from_dataset 😰 )

@Fokko
Copy link
Contributor Author

Fokko commented Feb 2, 2023

@westonpace Thanks for the reply!

The docs you probably want are pyarrow.dataset.dataset and pyarrow.dataset.Dataset

I would agree with you, but I went for a lower API because I want to re-use the connection. This avoids another HEAD get request to S3. More background is provided here, and the dataset only accepts paths. If you think it is worthwhile to accept NativeFile there as well, let me know, and happy to raise a PR.

which unfortunately redirects its documentation to pyarrow.dataset.Scanner.from_dataset 😰 )

Do you want me to create a PR to copy those docs? Because of the redirect, the arguments are also not showing up in PyCharm. Let me know and I'll create a PR.

@westonpace
Copy link
Member

Do you want me to create a PR to copy those docs? Because of the redirect, the arguments are also not showing up in PyCharm. Let me know and I'll create a PR.

Yes, I think that is a good idea but I might CC @jorisvandenbossche or @amol- to weigh in on whether they know some better way to avoid the duplication or have a preference here (I normally focus on the C++ end of things.)

@jorisvandenbossche
Copy link
Member

@Fokko thanks for the catch! You are currently updating the docstring of ParquetDataset, but we should do the same update for read_table:

filters : List[Tuple] or List[List[Tuple]] or None (default)

(I think it's also read_table you are using in PyIceberg, and not ParquetDataset?)

Do you want me to create a PR to copy those docs? Because of the redirect, the arguments are also not showing up in PyCharm. Let me know and I'll create a PR.

Yes, I think that is a good idea but I might CC @jorisvandenbossche or @amol- to weigh in on whether they know some better way to avoid the duplication or have a preference here (I normally focus on the C++ end of things.)

Yeah, I think we should prefer some duplication if that gives better docstrings. I agree the indirection for the user right now isn't very user friendly.
We might be able to share some part of the docstring and inject that in multiple places to avoid duplicating the actual content, if that doesn't make things too complicated.

@Fokko
Copy link
Contributor Author

Fokko commented Feb 3, 2023

@jorisvandenbossche Updating table one is a good suggestion indeed. Updated that one as well 👍🏻

(I think it's also read_table you are using in PyIceberg, and not ParquetDataset?)

Currently, we use the read_table indeed. I've also played around with the ParquetDataset, and it looked very similar. However, we don't need the lazy nature of the dataset, so directly loading a table makes more sense in our situation.

Yeah, I think we should prefer some duplication if that gives better docstrings. I agree the indirection for the user right now isn't very user friendly.
We might be able to share some part of the docstring and inject that in multiple places to avoid duplicating the actual content, if that doesn't make things too complicated.

Makes a lot of sense, I think sharing would be best. Let me create a separate PR for that

@Fokko
Copy link
Contributor Author

Fokko commented Feb 3, 2023

@jorisvandenbossche first a bit of cleanup in #34034

@jorisvandenbossche
Copy link
Member

One more thing about the docstring: for both read_table and ParquetDataset, the docstring also injects the content of _DNF_filter_doc (which holds the actual explanation of how the filter is expressed in this list of tuples form). We should probably update that to start with saying that the filter can either be a pyarrow Expression, or this DNF list of tuples form.

@Fokko
Copy link
Contributor Author

Fokko commented Feb 9, 2023

@jorisvandenbossche Good call. I've updated the docstring and added an example.

@Fokko
Copy link
Contributor Author

Fokko commented Feb 14, 2023

@jorisvandenbossche WDYT?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jorisvandenbossche jorisvandenbossche changed the title GH-33973: [Python][Docs] Update documentation for Parquet filter GH-33973: [Python][Docs] Update documentation for Parquet filter keyword Feb 15, 2023
@jorisvandenbossche jorisvandenbossche merged commit 306026d into apache:master Feb 15, 2023
@jorisvandenbossche jorisvandenbossche added this to the 12.0.0 milestone Feb 15, 2023
@Fokko Fokko deleted the fd-update-doc branch February 15, 2023 11:41
@ursabot
Copy link

ursabot commented Feb 15, 2023

Benchmark runs are scheduled for baseline = e63215c and contender = 306026d. 306026d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.58% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.16% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 306026d8 ec2-t3-xlarge-us-east-2
[Failed] 306026d8 test-mac-arm
[Finished] 306026d8 ursa-i9-9960x
[Finished] 306026d8 ursa-thinkcentre-m75q
[Finished] e63215ca ec2-t3-xlarge-us-east-2
[Failed] e63215ca test-mac-arm
[Finished] e63215ca ursa-i9-9960x
[Finished] e63215ca ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Feb 15, 2023

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

gringasalpastor pushed a commit to gringasalpastor/arrow that referenced this pull request Feb 17, 2023
…r keyword (apache#33974)

### Rationale for this change

I wrote a converter from an arbitrary expression to DNF, but this was not needed after learning that it just accepts an expression now.

Closes apache#33973

Authored-by: Fokko Driesprong <fokko@tabular.io>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Feb 24, 2023
…r keyword (apache#33974)

### Rationale for this change

I wrote a converter from an arbitrary expression to DNF, but this was not needed after learning that it just accepts an expression now.

Closes apache#33973

Authored-by: Fokko Driesprong <fokko@tabular.io>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Parquet filter keyword of read_table docstring is outdated
4 participants