Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35531: [Python] C Data Interface PyCapsule Protocol #37797

Merged
merged 1 commit into from
Oct 18, 2023

Conversation

wjones127
Copy link
Member

@wjones127 wjones127 commented Sep 20, 2023

Rationale for this change

What changes are included in this PR?

  • A new specification for Arrow PyCapsules and related dunder methods
  • Implementing the dunder methods for DataType, Field, Schema, Array, RecordBatch, Table, and RecordBatchReader.

Are these changes tested?

Yes, I've added various roundtrip tests for each of the types.

Are there any user-facing changes?

This introduces some new APIs and documents them.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 23, 2023
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot Will for putting this up!

I already took a look at the spec description, will try to review the implementation tomorrow as well.

One general question on that: do we generally expect to only handle with "data" objects that expose their schema through __arrow_c_schema__? Or do we also want to enable / encourage libraries to pass actual schema objects around that just implement this interface? (here you implemented it for pyarrow.DataType/Field/Schema)

And question for ArrowArrayExportable: it might be useful for generalization that an Array object (like our pyarrow.Array/RecordBatch) also implements the stream interface, with just returning a single batch? That way, consumers that typically work with a stream of data (eg duckdb) could automatically support Array/RecordBatch as well, without needing specific code for this subset of objects.

docs/source/format/CDataInterface/PyCapsuleInterface.rst Outdated Show resolved Hide resolved
docs/source/format/CDataInterface/PyCapsuleInterface.rst Outdated Show resolved Hide resolved
docs/source/format/CDataInterface/PyCapsuleInterface.rst Outdated Show resolved Hide resolved
docs/source/format/CDataInterface/PyCapsuleInterface.rst Outdated Show resolved Hide resolved
docs/source/index.rst Outdated Show resolved Hide resolved
docs/source/format/CDataInterface/PyCapsuleInterface.rst Outdated Show resolved Hide resolved
@pitrou pitrou changed the title GH-34031: C Data Interface PyCapsule Protocol GH-34031: [Python] C Data Interface PyCapsule Protocol Sep 26, 2023
docs/source/format/CDataInterface/PyCapsuleInterface.rst Outdated Show resolved Hide resolved
docs/source/format/CDataInterface/PyCapsuleInterface.rst Outdated Show resolved Hide resolved
docs/source/format/CDataInterface/PyCapsuleInterface.rst Outdated Show resolved Hide resolved
docs/source/format/CDataInterface/PyCapsuleInterface.rst Outdated Show resolved Hide resolved
docs/source/format/CDataInterface/PyCapsuleInterface.rst Outdated Show resolved Hide resolved
python/pyarrow/types.pxi Outdated Show resolved Hide resolved
python/pyarrow/types.pxi Show resolved Hide resolved
python/pyarrow/tests/test_cffi.py Outdated Show resolved Hide resolved
python/pyarrow/tests/test_cffi.py Show resolved Hide resolved
python/pyarrow/includes/libarrow.pxd Outdated Show resolved Hide resolved
@jorisvandenbossche jorisvandenbossche changed the title GH-34031: [Python] C Data Interface PyCapsule Protocol GH-35531: [Python] C Data Interface PyCapsule Protocol Sep 27, 2023
@github-actions
Copy link

⚠️ GitHub issue #35531 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 27, 2023
@wjones127 wjones127 force-pushed the gh-34031-c-interface-capsules branch from 6ec61ca to 446606c Compare September 27, 2023 18:40
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 27, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Sep 27, 2023
@wjones127 wjones127 force-pushed the gh-34031-c-interface-capsules branch from 02fe895 to 6fff484 Compare October 1, 2023 04:06
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some small mostly doc comments. And nice new examples section!

Can be for a follow-up as well, but should we add __arrow_c_stream__ to ChunkedArray?

docs/source/format/ADBC.rst Outdated Show resolved Hide resolved
docs/source/format/CDataInterface/PyCapsuleInterface.rst Outdated Show resolved Hide resolved
docs/source/format/CDataInterface/PyCapsuleInterface.rst Outdated Show resolved Hide resolved
docs/source/format/CDataInterface/PyCapsuleInterface.rst Outdated Show resolved Hide resolved
docs/source/format/CDataInterface/PyCapsuleInterface.rst Outdated Show resolved Hide resolved
python/pyarrow/array.pxi Outdated Show resolved Hide resolved
python/pyarrow/table.pxi Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Oct 18, 2023

@github-actions crossbow submit -g python -g wheel

@pitrou pitrou self-requested a review October 18, 2023 09:09
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, really great! Let's hope we can get it in 14.0.0, if only for the spec definition.

@github-actions
Copy link

Revision: 51657bb

Submitted crossbow builds: ursacomputing/crossbow @ actions-7f39314d16

Task Status
preview-docs Github Actions

@github-actions

This comment was marked as outdated.

@pitrou pitrou force-pushed the gh-34031-c-interface-capsules branch from 51657bb to 87e5fc5 Compare October 18, 2023 09:52
@pitrou
Copy link
Member

pitrou commented Oct 18, 2023

@github-actions crossbow submit -g python -g wheel

@github-actions
Copy link

Revision: 87e5fc5

Submitted crossbow builds: ursacomputing/crossbow @ actions-477302fc6e

Task Status
test-conda-python-3.10 Github Actions
test-conda-python-3.10-cython2 Github Actions
test-conda-python-3.10-hdfs-2.9.2 Github Actions
test-conda-python-3.10-hdfs-3.2.1 Github Actions
test-conda-python-3.10-pandas-latest Github Actions
test-conda-python-3.10-pandas-nightly Github Actions
test-conda-python-3.10-spark-v3.5.0 Github Actions
test-conda-python-3.10-substrait Github Actions
test-conda-python-3.11 Github Actions
test-conda-python-3.11-dask-latest Github Actions
test-conda-python-3.11-dask-upstream_devel Github Actions
test-conda-python-3.11-hypothesis Github Actions
test-conda-python-3.11-pandas-upstream_devel Github Actions
test-conda-python-3.11-spark-master Github Actions
test-conda-python-3.12 Github Actions
test-conda-python-3.8 Github Actions
test-conda-python-3.8-pandas-1.0 Github Actions
test-conda-python-3.8-spark-v3.5.0 Github Actions
test-conda-python-3.9 Github Actions
test-conda-python-3.9-pandas-latest Github Actions
test-cuda-python Github Actions
test-debian-11-python-3 Azure
test-fedora-35-python-3 Azure
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-python-3 Github Actions
wheel-clean Github Actions
wheel-macos-big-sur-cp310-arm64 Github Actions
wheel-macos-big-sur-cp311-arm64 Github Actions
wheel-macos-big-sur-cp312-arm64 Github Actions
wheel-macos-big-sur-cp38-arm64 Github Actions
wheel-macos-big-sur-cp39-arm64 Github Actions
wheel-macos-mojave-cp310-amd64 Github Actions
wheel-macos-mojave-cp311-amd64 Github Actions
wheel-macos-mojave-cp312-amd64 Github Actions
wheel-macos-mojave-cp38-amd64 Github Actions
wheel-macos-mojave-cp39-amd64 Github Actions
wheel-manylinux-2-28-cp310-amd64 Github Actions
wheel-manylinux-2-28-cp310-arm64 Github Actions
wheel-manylinux-2-28-cp311-amd64 Github Actions
wheel-manylinux-2-28-cp311-arm64 Github Actions
wheel-manylinux-2-28-cp312-amd64 Github Actions
wheel-manylinux-2-28-cp312-arm64 Github Actions
wheel-manylinux-2-28-cp38-amd64 Github Actions
wheel-manylinux-2-28-cp38-arm64 Github Actions
wheel-manylinux-2-28-cp39-amd64 Github Actions
wheel-manylinux-2-28-cp39-arm64 Github Actions
wheel-manylinux-2014-cp310-amd64 Github Actions
wheel-manylinux-2014-cp310-arm64 Github Actions
wheel-manylinux-2014-cp311-amd64 Github Actions
wheel-manylinux-2014-cp311-arm64 Github Actions
wheel-manylinux-2014-cp312-amd64 Github Actions
wheel-manylinux-2014-cp312-arm64 Github Actions
wheel-manylinux-2014-cp38-amd64 Github Actions
wheel-manylinux-2014-cp38-arm64 Github Actions
wheel-manylinux-2014-cp39-amd64 Github Actions
wheel-manylinux-2014-cp39-arm64 Github Actions
wheel-windows-cp310-amd64 Github Actions
wheel-windows-cp311-amd64 Github Actions
wheel-windows-cp312-amd64 Github Actions
wheel-windows-cp38-amd64 Github Actions
wheel-windows-cp39-amd64 Github Actions

@pitrou pitrou merged commit d68f8e2 into apache:main Oct 18, 2023
11 checks passed
@pitrou pitrou removed the awaiting merge Awaiting merge label Oct 18, 2023
raulcd pushed a commit that referenced this pull request Oct 18, 2023
### Rationale for this change

### What changes are included in this PR?

* A new specification for Arrow PyCapsules and related dunder methods
* Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`.

### Are these changes tested?

Yes, I've added various roundtrip tests for each of the types.

### Are there any user-facing changes?

This introduces some new APIs and documents them.

* Closes: #34031
* Closes: #35531

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit d68f8e2.

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…37797)

### Rationale for this change

### What changes are included in this PR?

* A new specification for Arrow PyCapsules and related dunder methods
* Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`.

### Are these changes tested?

Yes, I've added various roundtrip tests for each of the types.

### Are there any user-facing changes?

This introduces some new APIs and documents them.

* Closes: apache#34031
* Closes: apache#35531

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…37797)

### Rationale for this change

### What changes are included in this PR?

* A new specification for Arrow PyCapsules and related dunder methods
* Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`.

### Are these changes tested?

Yes, I've added various roundtrip tests for each of the types.

### Are there any user-facing changes?

This introduces some new APIs and documents them.

* Closes: apache#34031
* Closes: apache#35531

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…37797)

### Rationale for this change

### What changes are included in this PR?

* A new specification for Arrow PyCapsules and related dunder methods
* Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`.

### Are these changes tested?

Yes, I've added various roundtrip tests for each of the types.

### Are there any user-facing changes?

This introduces some new APIs and documents them.

* Closes: apache#34031
* Closes: apache#35531

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
jorisvandenbossche added a commit that referenced this pull request May 17, 2024
…psule (#41538)

### Rationale for this change

Fixes the dropped `pa.schema` metadata reported in #38575, which was introduced in #37797.

### What changes are included in this PR?

Passes through the `metadata` to the short-circuited `Schema` created with `_import_from_c_capsule`.

### Are these changes tested?

Yes - added `metadata` to the existing test.

### Are there any user-facing changes?

I'm not sure this quite rises to the `(b) a bug that caused incorrect or invalid data to be produced,` condition, but I added that note to be safe since the resulting schema is "incorrect" (and broke some round-trip tests on my end after a pyarrow update):

**This PR contains a "Critical Fix".**

* GitHub Issue: #38575

Lead-authored-by: Jacob Hayes <jacob.r.hayes@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…m PyCapsule (apache#41538)

### Rationale for this change

Fixes the dropped `pa.schema` metadata reported in apache#38575, which was introduced in apache#37797.

### What changes are included in this PR?

Passes through the `metadata` to the short-circuited `Schema` created with `_import_from_c_capsule`.

### Are these changes tested?

Yes - added `metadata` to the existing test.

### Are there any user-facing changes?

I'm not sure this quite rises to the `(b) a bug that caused incorrect or invalid data to be produced,` condition, but I added that note to be safe since the resulting schema is "incorrect" (and broke some round-trip tests on my end after a pyarrow update):

**This PR contains a "Critical Fix".**

* GitHub Issue: apache#38575

Lead-authored-by: Jacob Hayes <jacob.r.hayes@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
jorisvandenbossche added a commit that referenced this pull request Jun 26, 2024
… Data support (#40708)

### Rationale for this change

We defined a protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and dunder methods `__arrow_c_schema/array/stream__` (#35531 / #37797).

We also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (#34972).

This expands the Python exposure of the interface with support for the newer Device structs.

### What changes are included in this PR?

Update the specification to defined two additional dunders:

* `__arrow_c_device_array__` returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses "arrow_device_array" for the capsule name
* `__arrow_c_device_stream__` returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream" 

### Are these changes tested?

Spec-only change

* GitHub Issue: #38325

Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Jul 9, 2024
…Device Data support (apache#40708)

### Rationale for this change

We defined a protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and dunder methods `__arrow_c_schema/array/stream__` (apache#35531 / apache#37797).

We also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (apache#34972).

This expands the Python exposure of the interface with support for the newer Device structs.

### What changes are included in this PR?

Update the specification to defined two additional dunders:

* `__arrow_c_device_array__` returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses "arrow_device_array" for the capsule name
* `__arrow_c_device_stream__` returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream" 

### Are these changes tested?

Spec-only change

* GitHub Issue: apache#38325

Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@vyasr vyasr mentioned this pull request Aug 28, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants