Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python bindings: check for Arrow PyCapsule Interface in ogr.Layer.WritePyArrow #9132

Closed
kylebarron opened this issue Jan 24, 2024 · 1 comment · Fixed by #9133
Closed

Python bindings: check for Arrow PyCapsule Interface in ogr.Layer.WritePyArrow #9132

kylebarron opened this issue Jan 24, 2024 · 1 comment · Fixed by #9133
Assignees

Comments

@kylebarron
Copy link

Expected behavior and actual behavior.

This is a corollary of #9043, which added support for the Arrow PyCapsule Interface for reading from a layer. This ticket is a feature request for writing from objects that expose the PyCapsule Interface.

The current implementation of ogr.Layer.WritePyArrow uses pyarrow-specific APIs, including the to_batches method

if hasattr(pa_batch, "to_batches"):

and later the _export_to_c methods

pa_batch._export_to_c(array._getPtr(), schema._getPtr())

pa_schema._export_to_c(schema._getPtr())

With the PyCapsule Interface, any arrow-based table or record batch would be supported. Not just pyarrow (v14 or higher) but also geoarrow-c, geoarrow-rust, and potentially more in the future, like geopandas (ref pandas-dev/pandas#56587 for the pandas implementation).

Table constructs like pyarrow.Table include an __arrow_c_stream__() method and RecordBatch constructs like pyarrow.RecordBatch include an __arrow_c_array__ method that returns a struct column.

The exact changes to WritePyArrow would be:

  • If keeping the dependency on pyarrow is desired:
    • The most minimal change would be checking for __arrow_c_stream__ on the input value and calling pyarrow.table() on the input. If __arrow_c_stream__ does not exist but __arrow_c_array__ does exist, then call pyarrow.record_batch() on the input.
  • If removing the dependency on pyarrow is desired:
    • Call __arrow_c_stream__ to access the underlying stream pointer, which has a reference to the schema and an iterator for the batches. Then pass those pointers directly into self.CreateFieldFromArrowSchema and self.WriteArrowBatch.

Steps to reproduce the problem.

Feature request, not a bug.

Operating system

Feature request, not a bug.

GDAL version and provenance

Feature request, not a bug.

@rouault
Copy link
Member

rouault commented Jan 24, 2024

implemneted per #9133

rouault added a commit to rouault/gdal that referenced this issue Jan 24, 2024
…w_c_stream__ or __arrow_c_array__ interfaces

fixes OSGeo#9132
rouault added a commit to rouault/gdal that referenced this issue Jan 24, 2024
…w_c_stream__ or __arrow_c_array__ interfaces

fixes OSGeo#9132
rouault added a commit to rouault/gdal that referenced this issue Jan 24, 2024
…w_c_stream__ or __arrow_c_array__ interfaces

fixes OSGeo#9132
dshean pushed a commit to dshean/gdal that referenced this issue Jan 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants