-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6321: [Python] Ability to create ExtensionBlock on conversion to pandas #5162
ARROW-6321: [Python] Ability to create ExtensionBlock on conversion to pandas #5162
Conversation
4f89d4d
to
31541b4
Compare
31541b4
to
3729177
Compare
This is ready to be reviewed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be misunderstanding the point of this PR, but it seems this can only convert a given column type and you have to pass the extension columns explicitly. Isn't this the wrong approach?
return Status::OK(); | ||
} | ||
|
||
Status GetPyResult(PyObject** output) override { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT this just duplicates the base class implementation. Why did you redefine it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PyDict_SetItemString(result, "py_array", py_array_.obj());
is different. This is putting a pyarrow array in the result dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's somewhat of a hack but it's a way to pass through the Arrow data so that it gets converted elsewhere
public: | ||
using PandasBlock::PandasBlock; | ||
|
||
// Don't create a block array here, only the placement array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're not handling the extension storage anywhere? Why is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean with "extension storage"?
The goal of this ExtensionBlock is to not convert the arrow array to a numpy array, but to pass it through as a pyarrow array to the caller of the ConvertTableToPandas
function.
What is maybe confusing is that this is called "ExtensionBlock", as it is not necessarily for arrow extension types, but meant for pandas extension arrays (and those two don't necessarily map)
@@ -1424,7 +1479,11 @@ class DataFrameBlockCreator { | |||
for (int i = 0; i < table_->num_columns(); ++i) { | |||
std::shared_ptr<ChunkedArray> col = table_->column(i); | |||
PandasBlock::type output_type = PandasBlock::OBJECT; | |||
RETURN_NOT_OK(GetPandasBlockType(*col, options_, &output_type)); | |||
if (extension_columns_.count(table_->field(i)->name())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I don't understand why we're using an explicit extension_columns
. Shouldn't we simply detect an arrow ExtensionType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also confused by this. I looked at the unit test below and there are a couple of different things going on:
- Creating pandas ExtensionArray values from built-in Arrow types
- Converting Arrow ExtensionType data
This seems to do the former but not the latter. What is the use case for the former, mainly getting IntegerArray out?
Let me try to clarify (the fact that both pandas and arrow use "extension" for potentially different things does not make it clearer ..). So the goal of the explicit
|
3729177
to
3469d84
Compare
Codecov Report
@@ Coverage Diff @@
## master #5162 +/- ##
==========================================
+ Coverage 88.79% 89.35% +0.55%
==========================================
Files 983 791 -192
Lines 132170 116735 -15435
Branches 1501 0 -1501
==========================================
- Hits 117362 104308 -13054
+ Misses 14443 12427 -2016
+ Partials 365 0 -365
Continue to review full report at Codecov.
|
Ah I read @jorisvandenbossche comments now. Since this is strictly internal and non-public-API code I am okay with it. Do you want to make any more changes to this patch beyond rebasing and getting the tests passing? |
For me it is fine to get this in. It's also included in #5512 since I needed it there. But if we are fine with the arrow_to_pandas.cc ::ExtensionBlock (which is indeed the internal part), then that makes the diff of the other PR a bit smaller. Will rebase this. |
969182d
to
fe0674b
Compare
The "Ursabot / AMD64 Conda Python 3.6" build is failing on the |
@ursabot build |
I retriggered the builds, and all green now |
fe0674b
to
891c216
Compare
891c216
to
89c225f
Compare
https://issues.apache.org/jira/browse/ARROW-6321
This adds some code to create pandas ExtensionBlocks on the conversion to pandas. The approach taken is that for this case, instead of converting the Arrow array to a numpy array that can be stored in the block, the arrow_to_pandas C++ code sents the actual Arrow array to the pyarrow compat code (no conversion), and then there can be a mechanism to convert the arrow Array to a pandas ExtensionArray called from pyarrow.
As example (to test this), I changed theFor now (to test this), I added ainteger_object_nulls
option (if triggered) to return a pandas nullable IntegerArray instead of object dtype array.extension_columns
totable_to_blockmanager
to specify which columns should be put into an ExtensionBlock. And then in the pyarrow code for now hardcoded a conversion from pyarrow integer array to pandas IntegerArray.This (hardcoded) example works:
What is missing:
ConvertTableToPandas
function which columns to convert to extension block (maybe with a similar option as the current "categorical_columns" option?) EDIT: added such a column