-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-40066: [Python] Support requested_schema
in __arrow_c_stream__()
#40070
GH-40066: [Python] Support requested_schema
in __arrow_c_stream__()
#40070
Conversation
|
@@ -19,6 +19,7 @@ | |||
|
|||
#include <memory> | |||
|
|||
#include "arrow/compute/api.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: arrow/compute/cast.h
is probably sufficient and will pull less headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
// Try to cast an empty version of all the columns before succceeding | ||
compute::CastOptions options; | ||
for (int i = 0; i < num_fields; i++) { | ||
ARROW_ASSIGN_OR_RAISE(auto empty_array, MakeEmptyArray(src->field(i)->type())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, you can probably call CanCast
on the pairs of types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
ArrayVector columns(num_columns); | ||
for (int i = 0; i < num_columns; i++) { | ||
ARROW_ASSIGN_OR_RAISE(columns[i], | ||
compute::Cast(*out->column(i), schema_->field(i)->type())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to check for nulls if the destination fields is non-nullable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
python/pyarrow/tests/test_cffi.py
Outdated
batch = make_batch() | ||
requested_schema = pa.schema([('ints', pa.list_(pa.int64()))]) | ||
requested_capsule = requested_schema.__arrow_c_schema__() | ||
# RecordBatch has no cast() method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a GH issue open for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrapped the Table
implementation (but could also remove that and open an issue). I considered pasting the implementation as well but assembling the options is not trivial and copying that also seemed problematic.
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
56dc004
to
ff91019
Compare
@pitrou I think I've implemented your suggestions whenever you have time to take another look! |
RecordBatchReader out | ||
|
||
if self.schema.names != target_schema.names: | ||
raise ValueError("Target schema's field names are not matching " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but you can use f-strings now rather than explicit format
calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
// Ensure all columns can be cast before succeeding | ||
for (int i = 0; i < num_fields; i++) { | ||
if (!compute::CanCast(*src->field(i)->type(), *schema->field(i)->type())) { | ||
return Status::NotImplemented("Field ", i, " cannot be cast from ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Status::TypeError
sounds better IMHO. NotImplemented
implies that the corresponding cast should be implemented some day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
python/pyarrow/table.pxi
Outdated
@@ -2995,7 +3017,7 @@ cdef class RecordBatch(_Tabular): | |||
---------- | |||
requested_schema : PyCapsule | None | |||
A PyCapsule containing a C ArrowSchema representation of a requested | |||
schema. PyArrow will attempt to cast the batch to this schema. | |||
schema. PyArrow will attempt to cast each batch to this schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember! (Reverted!)
python/pyarrow/tests/test_ipc.py
Outdated
@@ -51,16 +51,16 @@ def get_source(self): | |||
|
|||
def write_batches(self, num_batches=5, as_table=False): | |||
nrows = 5 | |||
schema = pa.schema([('one', pa.float64()), ('two', pa.utf8())]) | |||
schema = pa.schema([("one", pa.float64()), ("two", pa.utf8())]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for all these style changes? These don't seem related. Did you apply a formatting tool by mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😬 (Fixed now!)
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
requested_schema
in __arrow_c__stream__()
requested_schema
in __arrow_c_stream__()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @paleolimbot !
@jorisvandenbossche @wjones127 Does one of you want to take a quick look here? Otherwise I'll merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks a lot!
…eam__()` (apache#40070) ### Rationale for this change The `requested_schema` portion of the `__arrow_c_stream__()` protocol methods errored in all cases if passed an unequal schema. There was a note about figuring out how to check the cast before doing it and a comment in apache#40066 about how it should be done lazily. This PR (hopefully) solves both! ### What changes are included in this PR? - Added `arrow::py::CastingRecordBatchReader`, which wraps a `arrow::RecordBatchReader`, casting each batch as it is pulled. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes: the current approach adds `RecordBatchReader.cast()` as the way to access the casting reader. * Closes: apache#40066 * GitHub Issue: apache#40066 Lead-authored-by: Dewey Dunnington <dewey@fishandwhistle.net> Co-authored-by: Dewey Dunnington <dewey@voltrondata.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit d6b9051. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…eam__()` (apache#40070) ### Rationale for this change The `requested_schema` portion of the `__arrow_c_stream__()` protocol methods errored in all cases if passed an unequal schema. There was a note about figuring out how to check the cast before doing it and a comment in apache#40066 about how it should be done lazily. This PR (hopefully) solves both! ### What changes are included in this PR? - Added `arrow::py::CastingRecordBatchReader`, which wraps a `arrow::RecordBatchReader`, casting each batch as it is pulled. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes: the current approach adds `RecordBatchReader.cast()` as the way to access the casting reader. * Closes: apache#40066 * GitHub Issue: apache#40066 Lead-authored-by: Dewey Dunnington <dewey@fishandwhistle.net> Co-authored-by: Dewey Dunnington <dewey@voltrondata.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Rationale for this change
The
requested_schema
portion of the__arrow_c_stream__()
protocol methods errored in all cases if passed an unequal schema. There was a note about figuring out how to check the cast before doing it and a comment in #40066 about how it should be done lazily. This PR (hopefully) solves both!What changes are included in this PR?
arrow::py::CastingRecordBatchReader
, which wraps aarrow::RecordBatchReader
, casting each batch as it is pulled.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes: the current approach adds
RecordBatchReader.cast()
as the way to access the casting reader.requested_schema
in__arrow_c_stream__
implementations #40066requested_schema
in__arrow_c_stream__
implementations #40066