Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40066: [Python] Support requested_schema in __arrow_c_stream__() #40070

Merged
merged 26 commits into from
Feb 28, 2024

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Feb 13, 2024

Rationale for this change

The requested_schema portion of the __arrow_c_stream__() protocol methods errored in all cases if passed an unequal schema. There was a note about figuring out how to check the cast before doing it and a comment in #40066 about how it should be done lazily. This PR (hopefully) solves both!

What changes are included in this PR?

  • Added arrow::py::CastingRecordBatchReader, which wraps a arrow::RecordBatchReader, casting each batch as it is pulled.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes: the current approach adds RecordBatchReader.cast() as the way to access the casting reader.

Copy link

⚠️ GitHub issue #40066 has been automatically assigned in GitHub to PR creator.

@paleolimbot paleolimbot marked this pull request as ready for review February 14, 2024 13:48
@@ -19,6 +19,7 @@

#include <memory>

#include "arrow/compute/api.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: arrow/compute/cast.h is probably sufficient and will pull less headers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

// Try to cast an empty version of all the columns before succceeding
compute::CastOptions options;
for (int i = 0; i < num_fields; i++) {
ARROW_ASSIGN_OR_RAISE(auto empty_array, MakeEmptyArray(src->field(i)->type()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, you can probably call CanCast on the pairs of types?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

ArrayVector columns(num_columns);
for (int i = 0; i < num_columns; i++) {
ARROW_ASSIGN_OR_RAISE(columns[i],
compute::Cast(*out->column(i), schema_->field(i)->type()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to check for nulls if the destination fields is non-nullable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

python/pyarrow/ipc.pxi Outdated Show resolved Hide resolved
batch = make_batch()
requested_schema = pa.schema([('ints', pa.list_(pa.int64()))])
requested_capsule = requested_schema.__arrow_c_schema__()
# RecordBatch has no cast() method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a GH issue open for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrapped the Table implementation (but could also remove that and open an issue). I considered pasting the implementation as well but assembling the options is not trivial and copying that also seemed problematic.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Feb 23, 2024
@paleolimbot
Copy link
Member Author

@pitrou I think I've implemented your suggestions whenever you have time to take another look!

RecordBatchReader out

if self.schema.names != target_schema.names:
raise ValueError("Target schema's field names are not matching "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but you can use f-strings now rather than explicit format calls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

python/pyarrow/src/arrow/python/ipc.cc Outdated Show resolved Hide resolved
// Ensure all columns can be cast before succeeding
for (int i = 0; i < num_fields; i++) {
if (!compute::CanCast(*src->field(i)->type(), *schema->field(i)->type())) {
return Status::NotImplemented("Field ", i, " cannot be cast from ",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status::TypeError sounds better IMHO. NotImplemented implies that the corresponding cast should be implemented some day.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -2995,7 +3017,7 @@ cdef class RecordBatch(_Tabular):
----------
requested_schema : PyCapsule | None
A PyCapsule containing a C ArrowSchema representation of a requested
schema. PyArrow will attempt to cast the batch to this schema.
schema. PyArrow will attempt to cast each batch to this schema.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember! (Reverted!)

python/pyarrow/table.pxi Outdated Show resolved Hide resolved
@@ -51,16 +51,16 @@ def get_source(self):

def write_batches(self, num_batches=5, as_table=False):
nrows = 5
schema = pa.schema([('one', pa.float64()), ('two', pa.utf8())])
schema = pa.schema([("one", pa.float64()), ("two", pa.utf8())])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for all these style changes? These don't seem related. Did you apply a formatting tool by mistake?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😬 (Fixed now!)

paleolimbot and others added 2 commits February 26, 2024 16:21
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 26, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 27, 2024
@pitrou pitrou changed the title GH-40066: [Python] Support requested_schema in __arrow_c__stream__() GH-40066: [Python] Support requested_schema in __arrow_c_stream__() Feb 28, 2024
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @paleolimbot !

@pitrou
Copy link
Member

pitrou commented Feb 28, 2024

@jorisvandenbossche @wjones127 Does one of you want to take a quick look here? Otherwise I'll merge.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks a lot!

@jorisvandenbossche jorisvandenbossche merged commit d6b9051 into apache:main Feb 28, 2024
18 of 19 checks passed
@jorisvandenbossche jorisvandenbossche removed the awaiting changes Awaiting changes label Feb 28, 2024
@github-actions github-actions bot added the awaiting merge Awaiting merge label Feb 28, 2024
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
…eam__()` (apache#40070)

### Rationale for this change

The `requested_schema` portion of the `__arrow_c_stream__()` protocol methods errored in all cases if passed an unequal schema. There was a note about figuring out how to check the cast before doing it and a comment in apache#40066 about how it should be done lazily. This PR (hopefully) solves both!

### What changes are included in this PR?

- Added `arrow::py::CastingRecordBatchReader`, which wraps a `arrow::RecordBatchReader`, casting each batch as it is pulled.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes: the current approach adds `RecordBatchReader.cast()` as the way to access the casting reader.

* Closes: apache#40066
* GitHub Issue: apache#40066

Lead-authored-by: Dewey Dunnington <dewey@fishandwhistle.net>
Co-authored-by: Dewey Dunnington <dewey@voltrondata.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit d6b9051.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

@paleolimbot paleolimbot deleted the python-casting-reader branch March 7, 2024 02:03
thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Mar 8, 2024
…eam__()` (apache#40070)

### Rationale for this change

The `requested_schema` portion of the `__arrow_c_stream__()` protocol methods errored in all cases if passed an unequal schema. There was a note about figuring out how to check the cast before doing it and a comment in apache#40066 about how it should be done lazily. This PR (hopefully) solves both!

### What changes are included in this PR?

- Added `arrow::py::CastingRecordBatchReader`, which wraps a `arrow::RecordBatchReader`, casting each batch as it is pulled.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes: the current approach adds `RecordBatchReader.cast()` as the way to access the casting reader.

* Closes: apache#40066
* GitHub Issue: apache#40066

Lead-authored-by: Dewey Dunnington <dewey@fishandwhistle.net>
Co-authored-by: Dewey Dunnington <dewey@voltrondata.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Support requested_schema in __arrow_c_stream__ implementations
3 participants