GH-40066: [Python] Support `requested_schema` in `__arrow_c_stream__()` #40070

paleolimbot · 2024-02-13T19:42:42Z

Rationale for this change

The requested_schema portion of the __arrow_c_stream__() protocol methods errored in all cases if passed an unequal schema. There was a note about figuring out how to check the cast before doing it and a comment in #40066 about how it should be done lazily. This PR (hopefully) solves both!

What changes are included in this PR?

Added arrow::py::CastingRecordBatchReader, which wraps a arrow::RecordBatchReader, casting each batch as it is pulled.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes: the current approach adds RecordBatchReader.cast() as the way to access the casting reader.

github-actions · 2024-02-13T19:43:10Z

⚠️ GitHub issue #40066 has been automatically assigned in GitHub to PR creator.

pitrou · 2024-02-15T08:34:25Z

python/pyarrow/src/arrow/python/ipc.cc

@@ -19,6 +19,7 @@

 #include <memory>

+#include "arrow/compute/api.h"


Nit: arrow/compute/cast.h is probably sufficient and will pull less headers.

pitrou · 2024-02-15T08:35:02Z

python/pyarrow/src/arrow/python/ipc.cc

+  // Try to cast an empty version of all the columns before succceeding
+  compute::CastOptions options;
+  for (int i = 0; i < num_fields; i++) {
+    ARROW_ASSIGN_OR_RAISE(auto empty_array, MakeEmptyArray(src->field(i)->type()));


Instead, you can probably call CanCast on the pairs of types?

pitrou · 2024-02-15T08:36:46Z

python/pyarrow/src/arrow/python/ipc.cc

+  ArrayVector columns(num_columns);
+  for (int i = 0; i < num_columns; i++) {
+    ARROW_ASSIGN_OR_RAISE(columns[i],
+                          compute::Cast(*out->column(i), schema_->field(i)->type()));


Do we want to check for nulls if the destination fields is non-nullable?

python/pyarrow/ipc.pxi

pitrou · 2024-02-15T08:39:56Z

python/pyarrow/tests/test_cffi.py

+    batch = make_batch()
+    requested_schema = pa.schema([('ints', pa.list_(pa.int64()))])
+    requested_capsule = requested_schema.__arrow_c_schema__()
+    # RecordBatch has no cast() method


Do we have a GH issue open for this?

I wrapped the Table implementation (but could also remove that and open an issue). I considered pasting the implementation as well but assembling the options is not trivial and copying that also seemed problematic.

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

paleolimbot · 2024-02-26T13:37:24Z

@pitrou I think I've implemented your suggestions whenever you have time to take another look!

pitrou · 2024-02-26T16:08:25Z

python/pyarrow/ipc.pxi

+            RecordBatchReader out
+
+        if self.schema.names != target_schema.names:
+            raise ValueError("Target schema's field names are not matching "


Nit, but you can use f-strings now rather than explicit format calls.

python/pyarrow/src/arrow/python/ipc.cc

pitrou · 2024-02-26T16:10:32Z

python/pyarrow/src/arrow/python/ipc.cc

+  // Ensure all columns can be cast before succeeding
+  for (int i = 0; i < num_fields; i++) {
+    if (!compute::CanCast(*src->field(i)->type(), *schema->field(i)->type())) {
+      return Status::NotImplemented("Field ", i, " cannot be cast from ",


Status::TypeError sounds better IMHO. NotImplemented implies that the corresponding cast should be implemented some day.

pitrou · 2024-02-26T16:14:24Z

python/pyarrow/table.pxi

@@ -2995,7 +3017,7 @@ cdef class RecordBatch(_Tabular):
        ----------
        requested_schema : PyCapsule | None
            A PyCapsule containing a C ArrowSchema representation of a requested
-            schema. PyArrow will attempt to cast the batch to this schema.
+            schema. PyArrow will attempt to cast each batch to this schema.


Why this change?

I don't remember! (Reverted!)

python/pyarrow/table.pxi

pitrou · 2024-02-26T16:17:25Z

python/pyarrow/tests/test_ipc.py

@@ -51,16 +51,16 @@ def get_source(self):

    def write_batches(self, num_batches=5, as_table=False):
        nrows = 5
-        schema = pa.schema([('one', pa.float64()), ('two', pa.utf8())])
+        schema = pa.schema([("one", pa.float64()), ("two", pa.utf8())])


Any reason for all these style changes? These don't seem related. Did you apply a formatting tool by mistake?

😬 (Fixed now!)

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

pitrou

Thanks @paleolimbot !

pitrou · 2024-02-28T11:33:01Z

@jorisvandenbossche @wjones127 Does one of you want to take a quick look here? Otherwise I'll merge.

jorisvandenbossche

Looks good, thanks a lot!

…eam__()` (apache#40070) ### Rationale for this change The `requested_schema` portion of the `__arrow_c_stream__()` protocol methods errored in all cases if passed an unequal schema. There was a note about figuring out how to check the cast before doing it and a comment in apache#40066 about how it should be done lazily. This PR (hopefully) solves both! ### What changes are included in this PR? - Added `arrow::py::CastingRecordBatchReader`, which wraps a `arrow::RecordBatchReader`, casting each batch as it is pulled. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes: the current approach adds `RecordBatchReader.cast()` as the way to access the casting reader. * Closes: apache#40066 * GitHub Issue: apache#40066 Lead-authored-by: Dewey Dunnington <dewey@fishandwhistle.net> Co-authored-by: Dewey Dunnington <dewey@voltrondata.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

conbench-apache-arrow · 2024-02-29T03:45:48Z

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit d6b9051.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

…eam__()` (apache#40070) ### Rationale for this change The `requested_schema` portion of the `__arrow_c_stream__()` protocol methods errored in all cases if passed an unequal schema. There was a note about figuring out how to check the cast before doing it and a comment in apache#40066 about how it should be done lazily. This PR (hopefully) solves both! ### What changes are included in this PR? - Added `arrow::py::CastingRecordBatchReader`, which wraps a `arrow::RecordBatchReader`, casting each batch as it is pulled. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes: the current approach adds `RecordBatchReader.cast()` as the way to access the casting reader. * Closes: apache#40066 * GitHub Issue: apache#40066 Lead-authored-by: Dewey Dunnington <dewey@fishandwhistle.net> Co-authored-by: Dewey Dunnington <dewey@voltrondata.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

paleolimbot added 2 commits February 13, 2024 15:18

first pass

37fe245

better checking of the cast

a907ef9

github-actions bot added Component: Python awaiting committer review Awaiting committer review labels Feb 13, 2024

paleolimbot added 4 commits February 13, 2024 15:52

fix include

ef38cfa

clang format

09a8d67

fix tests

1fafc3e

format

5c4edb9

paleolimbot marked this pull request as ready for review February 14, 2024 13:48

paleolimbot requested a review from pitrou February 14, 2024 13:48

pitrou reviewed Feb 15, 2024

View reviewed changes

paleolimbot and others added 4 commits February 16, 2024 11:25

Update python/pyarrow/ipc.pxi

7a5f906

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

use cast header

ea79c7e

batch reset

2f7ac44

use cancast

ff91019

paleolimbot force-pushed the python-casting-reader branch from 56dc004 to ff91019 Compare February 21, 2024 16:25

paleolimbot added 7 commits February 21, 2024 12:55

one more test

e58386a

yet another test

c0a1378

some tests

5480733

check nulls

0d3f66a

documentation

c1b5ace

align with table.cast()

3572ef0

add cast method for record batch

a233935

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Feb 23, 2024

pitrou reviewed Feb 26, 2024

View reviewed changes

paleolimbot and others added 2 commits February 26, 2024 16:21

Update python/pyarrow/src/arrow/python/ipc.cc

d503ec7

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

Update python/pyarrow/table.pxi

581ba96

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 26, 2024

paleolimbot added 7 commits February 26, 2024 16:24

use fstring

169a2ce

fix warning class

2cfbe16

revert typo change

3c7aad1

revert all changes to test_ipc.py

8f9a162

re-add tests

3011115

fix quotes

86a476e

re-add one more test

9f085af

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 27, 2024

pitrou changed the title ~~GH-40066: [Python] Support requested_schema in __arrow_c__stream__()~~ GH-40066: [Python] Support requested_schema in __arrow_c_stream__() Feb 28, 2024

pitrou approved these changes Feb 28, 2024

View reviewed changes

jorisvandenbossche approved these changes Feb 28, 2024

View reviewed changes

jorisvandenbossche merged commit d6b9051 into apache:main Feb 28, 2024
18 of 19 checks passed

jorisvandenbossche removed the awaiting changes Awaiting changes label Feb 28, 2024

github-actions bot added the awaiting merge Awaiting merge label Feb 28, 2024

paleolimbot deleted the python-casting-reader branch March 7, 2024 02:03

This was referenced Apr 8, 2024

GH-38010: [Python] Construct pyarrow.Field and ChunkedArray through Arrow PyCapsule Protocol #40818

Merged

[Python] Cast one chunk at a time in ChunkedArray ArrowStream export with requested type #41064

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-40066: [Python] Support `requested_schema` in `__arrow_c_stream__()` #40070

GH-40066: [Python] Support `requested_schema` in `__arrow_c_stream__()` #40070

paleolimbot commented Feb 13, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Feb 13, 2024

pitrou Feb 15, 2024

paleolimbot Feb 23, 2024

pitrou Feb 15, 2024

paleolimbot Feb 23, 2024

pitrou Feb 15, 2024

paleolimbot Feb 23, 2024

pitrou Feb 15, 2024

paleolimbot Feb 23, 2024

paleolimbot commented Feb 26, 2024

pitrou Feb 26, 2024

paleolimbot Feb 27, 2024

pitrou Feb 26, 2024

paleolimbot Feb 27, 2024

pitrou Feb 26, 2024

paleolimbot Feb 27, 2024

pitrou Feb 26, 2024

paleolimbot Feb 27, 2024

pitrou left a comment

pitrou commented Feb 28, 2024

jorisvandenbossche left a comment

conbench-apache-arrow bot commented Feb 29, 2024

		@@ -19,6 +19,7 @@

		#include <memory>

		#include "arrow/compute/api.h"

GH-40066: [Python] Support requested_schema in __arrow_c_stream__() #40070

GH-40066: [Python] Support requested_schema in __arrow_c_stream__() #40070

Conversation

paleolimbot commented Feb 13, 2024 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Feb 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot commented Feb 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou left a comment

Choose a reason for hiding this comment

pitrou commented Feb 28, 2024

jorisvandenbossche left a comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Feb 29, 2024

GH-40066: [Python] Support `requested_schema` in `__arrow_c_stream__()` #40070

GH-40066: [Python] Support `requested_schema` in `__arrow_c_stream__()` #40070

paleolimbot commented Feb 13, 2024 •

edited by github-actions bot

Loading