Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6321: [Python] Ability to create ExtensionBlock on conversion to pandas #5162

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 22, 2019

https://issues.apache.org/jira/browse/ARROW-6321

This adds some code to create pandas ExtensionBlocks on the conversion to pandas. The approach taken is that for this case, instead of converting the Arrow array to a numpy array that can be stored in the block, the arrow_to_pandas C++ code sents the actual Arrow array to the pyarrow compat code (no conversion), and then there can be a mechanism to convert the arrow Array to a pandas ExtensionArray called from pyarrow.

As example (to test this), I changed the integer_object_nulls option (if triggered) to return a pandas nullable IntegerArray instead of object dtype array. For now (to test this), I added a extension_columns to table_to_blockmanager to specify which columns should be put into an ExtensionBlock. And then in the pyarrow code for now hardcoded a conversion from pyarrow integer array to pandas IntegerArray.

This (hardcoded) example works:

In [1]: df = pd.DataFrame({'a': [1, 2, 3], 'b': np.array([0, 1, None], dtype='object')}) 
   ...: table = pa.table(df)

In [2]: table
Out[2]: 
pyarrow.Table
a: int64
b: int64
metadata
--------
{b'pandas': ...

In [3]: table.to_pandas()   # default, you get floats if there are NULLs
Out[3]: 
   a    b
0  1  0.0
1  2  1.0
2  3  NaN

In [4]: table.to_pandas(integer_object_nulls=True)
Out[4]: 
   a    b
0  1    0
1  2    1
2  3  NaN

In [5]: table.to_pandas(integer_object_nulls=True).dtypes
Out[5]: 
a    int64
b    Int64   # <--- nullable integer type (ExtensionArray)
dtype: object

What is missing:

  • a mechanism to indicate to the C++ ConvertTableToPandas function which columns to convert to extension block (maybe with a similar option as the current "categorical_columns" option?) EDIT: added such a column
  • a mechanism to know how to convert the Arrow array to a pandas ExtensionArray (this is related to https://issues.apache.org/jira/browse/ARROW-2428

@jorisvandenbossche
Copy link
Member Author

This is ready to be reviewed now.
Note, this PR still includes some custom code in pandas_compat.py to convert a pyarrow array to a pandas IntegerArray. This is of course something that should not stay, but for now is to be able to test this.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be misunderstanding the point of this PR, but it seems this can only convert a given column type and you have to pass the extension columns explicitly. Isn't this the wrong approach?

return Status::OK();
}

Status GetPyResult(PyObject** output) override {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT this just duplicates the base class implementation. Why did you redefine it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PyDict_SetItemString(result, "py_array", py_array_.obj()); is different. This is putting a pyarrow array in the result dict.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's somewhat of a hack but it's a way to pass through the Arrow data so that it gets converted elsewhere

public:
using PandasBlock::PandasBlock;

// Don't create a block array here, only the placement array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're not handling the extension storage anywhere? Why is this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with "extension storage"?
The goal of this ExtensionBlock is to not convert the arrow array to a numpy array, but to pass it through as a pyarrow array to the caller of the ConvertTableToPandas function.

What is maybe confusing is that this is called "ExtensionBlock", as it is not necessarily for arrow extension types, but meant for pandas extension arrays (and those two don't necessarily map)

@@ -1424,7 +1479,11 @@ class DataFrameBlockCreator {
for (int i = 0; i < table_->num_columns(); ++i) {
std::shared_ptr<ChunkedArray> col = table_->column(i);
PandasBlock::type output_type = PandasBlock::OBJECT;
RETURN_NOT_OK(GetPandasBlockType(*col, options_, &output_type));
if (extension_columns_.count(table_->field(i)->name())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I don't understand why we're using an explicit extension_columns. Shouldn't we simply detect an arrow ExtensionType?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also confused by this. I looked at the unit test below and there are a couple of different things going on:

  • Creating pandas ExtensionArray values from built-in Arrow types
  • Converting Arrow ExtensionType data

This seems to do the former but not the latter. What is the use case for the former, mainly getting IntegerArray out?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Sep 18, 2019

Hmm... I don't understand why we're using an explicit extension_columns. Shouldn't we simply detect an arrow ExtensionType?

Let me try to clarify (the fact that both pandas and arrow use "extension" for potentially different things does not make it clearer ..).
I named it here "ExtensionBlock" in the arrow C++ code because it is meant to create a pandas ExtensionBlock (pandas stores the data in blocks in a BlockManager, pandas.ExtensionArrays are stored in an ExtensionBlock). The "extension" here thus refers to pandas' notion of it, not necessarily arrow's notion of "extension type".

So the goal of the explicit extension_columns is meant to indicate which columns should be converted to ExtensionBlocks. The reason that I not simply use the arrow types for this (i.e. doing this when the column has an arrow extension type), is because there is not necessarily a 1 to 1 mapping of the extension concept in pandas and the extension concept in arrow. Let me give two examples:

  • Pandas has an experimental "nullable integer" type which is implemented as a pandas.ExtensionArray (basically kind of a masked array). Converting that to arrow gives you simply an arrow integer type, and not an extension type (since arrow can natively have missing values, we don't need an extension type here).
    But when you want to convert back to pandas, a user might want to opt in to create this nullable integer ExtensionArray instead of a float numpy array. So in such a case we need to convert a IntegerType (not an extension type) in ConvertTableToPandas to an ExtensionBlock.
  • Another example is fletcher, where they wrap arrow arrays inside pandas ExtensionArrays to store them directly in pandas DataFrames. Again, those are ExtensionArrays on the pandas side, but don't need to map to an extension type on the arrow side.

@pitrou
Copy link
Member

pitrou commented Sep 18, 2019

Ok, I'll admit my cluelessness on this :-) Perhaps @wesm and @xhochy want to take a look.
(it seems you'll also need to rebase)

@codecov-io
Copy link

codecov-io commented Sep 26, 2019

Codecov Report

Merging #5162 into master will increase coverage by 0.55%.
The diff coverage is 95.18%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5162      +/-   ##
==========================================
+ Coverage   88.79%   89.35%   +0.55%     
==========================================
  Files         983      791     -192     
  Lines      132170   116735   -15435     
  Branches     1501        0    -1501     
==========================================
- Hits       117362   104308   -13054     
+ Misses      14443    12427    -2016     
+ Partials      365        0     -365
Impacted Files Coverage Δ
cpp/src/arrow/python/arrow_to_pandas.h 100% <ø> (ø) ⬆️
cpp/src/arrow/python/pyarrow.cc 29.54% <100%> (+1.63%) ⬆️
python/pyarrow/table.pxi 86.07% <66.66%> (+0.05%) ⬆️
python/pyarrow/pandas_compat.py 97% <93.75%> (-0.15%) ⬇️
python/pyarrow/tests/test_pandas.py 94.54% <95.23%> (ø) ⬆️
cpp/src/arrow/python/arrow_to_pandas.cc 92.28% <97.56%> (+0.18%) ⬆️
python/pyarrow/plasma.py 58.9% <0%> (-1.37%) ⬇️
go/arrow/ipc/writer.go
go/arrow/math/uint64_amd64.go
go/arrow/memory/memory_avx2_amd64.go
... and 191 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af097e6...891c216. Read the comment docs.

@wesm
Copy link
Member

wesm commented Oct 3, 2019

Ah I read @jorisvandenbossche comments now. Since this is strictly internal and non-public-API code I am okay with it. Do you want to make any more changes to this patch beyond rebasing and getting the tests passing?

@jorisvandenbossche
Copy link
Member Author

Do you want to make any more changes to this patch beyond rebasing and getting the tests passing?

For me it is fine to get this in. It's also included in #5512 since I needed it there. But if we are fine with the arrow_to_pandas.cc ::ExtensionBlock (which is indeed the internal part), then that makes the diff of the other PR a bit smaller.

Will rebase this.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Oct 3, 2019

The "Ursabot / AMD64 Conda Python 3.6" build is failing on the arrow-flight-test C++ test, not sure if that can be related to the changes in this PR

@jorisvandenbossche
Copy link
Member Author

@ursabot build

@jorisvandenbossche
Copy link
Member Author

I retriggered the builds, and all green now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants