ARROW-6321: [Python] Ability to create ExtensionBlock on conversion to pandas #5162

jorisvandenbossche · 2019-08-22T12:00:20Z

https://issues.apache.org/jira/browse/ARROW-6321

This adds some code to create pandas ExtensionBlocks on the conversion to pandas. The approach taken is that for this case, instead of converting the Arrow array to a numpy array that can be stored in the block, the arrow_to_pandas C++ code sents the actual Arrow array to the pyarrow compat code (no conversion), and then there can be a mechanism to convert the arrow Array to a pandas ExtensionArray called from pyarrow.

~~As example (to test this), I changed the integer_object_nulls option (if triggered) to return a pandas nullable IntegerArray instead of object dtype array.~~ For now (to test this), I added a extension_columns to table_to_blockmanager to specify which columns should be put into an ExtensionBlock. And then in the pyarrow code for now hardcoded a conversion from pyarrow integer array to pandas IntegerArray.

This (hardcoded) example works:

In [1]: df = pd.DataFrame({'a': [1, 2, 3], 'b': np.array([0, 1, None], dtype='object')}) 
   ...: table = pa.table(df)

In [2]: table
Out[2]: 
pyarrow.Table
a: int64
b: int64
metadata
--------
{b'pandas': ...

In [3]: table.to_pandas()   # default, you get floats if there are NULLs
Out[3]: 
   a    b
0  1  0.0
1  2  1.0
2  3  NaN

In [4]: table.to_pandas(integer_object_nulls=True)
Out[4]: 
   a    b
0  1    0
1  2    1
2  3  NaN

In [5]: table.to_pandas(integer_object_nulls=True).dtypes
Out[5]: 
a    int64
b    Int64   # <--- nullable integer type (ExtensionArray)
dtype: object

What is missing:

a mechanism to indicate to the C++ ConvertTableToPandas function which columns to convert to extension block (maybe with a similar option as the current "categorical_columns" option?) EDIT: added such a column
a mechanism to know how to convert the Arrow array to a pandas ExtensionArray (this is related to https://issues.apache.org/jira/browse/ARROW-2428

jorisvandenbossche · 2019-09-11T10:31:31Z

This is ready to be reviewed now.
Note, this PR still includes some custom code in pandas_compat.py to convert a pyarrow array to a pandas IntegerArray. This is of course something that should not stay, but for now is to be able to test this.

pitrou

I may be misunderstanding the point of this PR, but it seems this can only convert a given column type and you have to pass the extension columns explicitly. Isn't this the wrong approach?

pitrou · 2019-09-18T15:05:33Z

cpp/src/arrow/python/arrow_to_pandas.cc

+    return Status::OK();
+  }
+
+  Status GetPyResult(PyObject** output) override {


AFAICT this just duplicates the base class implementation. Why did you redefine it?

The PyDict_SetItemString(result, "py_array", py_array_.obj()); is different. This is putting a pyarrow array in the result dict.

It's somewhat of a hack but it's a way to pass through the Arrow data so that it gets converted elsewhere

pitrou · 2019-09-18T15:13:21Z

cpp/src/arrow/python/arrow_to_pandas.cc

+ public:
+  using PandasBlock::PandasBlock;
+
+  // Don't create a block array here, only the placement array


So you're not handling the extension storage anywhere? Why is this?

What do you mean with "extension storage"?
The goal of this ExtensionBlock is to not convert the arrow array to a numpy array, but to pass it through as a pyarrow array to the caller of the ConvertTableToPandas function.

What is maybe confusing is that this is called "ExtensionBlock", as it is not necessarily for arrow extension types, but meant for pandas extension arrays (and those two don't necessarily map)

pitrou · 2019-09-18T15:16:36Z

cpp/src/arrow/python/arrow_to_pandas.cc

@@ -1424,7 +1479,11 @@ class DataFrameBlockCreator {
    for (int i = 0; i < table_->num_columns(); ++i) {
      std::shared_ptr<ChunkedArray> col = table_->column(i);
      PandasBlock::type output_type = PandasBlock::OBJECT;
-      RETURN_NOT_OK(GetPandasBlockType(*col, options_, &output_type));
+      if (extension_columns_.count(table_->field(i)->name())) {


Hmm... I don't understand why we're using an explicit extension_columns. Shouldn't we simply detect an arrow ExtensionType?

I was also confused by this. I looked at the unit test below and there are a couple of different things going on:

Creating pandas ExtensionArray values from built-in Arrow types

Converting Arrow ExtensionType data

This seems to do the former but not the latter. What is the use case for the former, mainly getting IntegerArray out?

jorisvandenbossche · 2019-09-18T16:10:50Z

Hmm... I don't understand why we're using an explicit extension_columns. Shouldn't we simply detect an arrow ExtensionType?

Let me try to clarify (the fact that both pandas and arrow use "extension" for potentially different things does not make it clearer ..).
I named it here "ExtensionBlock" in the arrow C++ code because it is meant to create a pandas ExtensionBlock (pandas stores the data in blocks in a BlockManager, pandas.ExtensionArrays are stored in an ExtensionBlock). The "extension" here thus refers to pandas' notion of it, not necessarily arrow's notion of "extension type".

So the goal of the explicit extension_columns is meant to indicate which columns should be converted to ExtensionBlocks. The reason that I not simply use the arrow types for this (i.e. doing this when the column has an arrow extension type), is because there is not necessarily a 1 to 1 mapping of the extension concept in pandas and the extension concept in arrow. Let me give two examples:

Pandas has an experimental "nullable integer" type which is implemented as a pandas.ExtensionArray (basically kind of a masked array). Converting that to arrow gives you simply an arrow integer type, and not an extension type (since arrow can natively have missing values, we don't need an extension type here).
But when you want to convert back to pandas, a user might want to opt in to create this nullable integer ExtensionArray instead of a float numpy array. So in such a case we need to convert a IntegerType (not an extension type) in ConvertTableToPandas to an ExtensionBlock.
Another example is fletcher, where they wrap arrow arrays inside pandas ExtensionArrays to store them directly in pandas DataFrames. Again, those are ExtensionArrays on the pandas side, but don't need to map to an extension type on the arrow side.

pitrou · 2019-09-18T17:22:08Z

Ok, I'll admit my cluelessness on this :-) Perhaps @wesm and @xhochy want to take a look.
(it seems you'll also need to rebase)

codecov-io · 2019-09-26T10:42:32Z

Codecov Report

Merging #5162 into master will increase coverage by 0.55%.
The diff coverage is 95.18%.

@@            Coverage Diff             @@
##           master    #5162      +/-   ##
==========================================
+ Coverage   88.79%   89.35%   +0.55%     
==========================================
  Files         983      791     -192     
  Lines      132170   116735   -15435     
  Branches     1501        0    -1501     
==========================================
- Hits       117362   104308   -13054     
+ Misses      14443    12427    -2016     
+ Partials      365        0     -365

Impacted Files	Coverage Δ
cpp/src/arrow/python/arrow_to_pandas.h	`100% <ø> (ø)`	⬆️
cpp/src/arrow/python/pyarrow.cc	`29.54% <100%> (+1.63%)`	⬆️
python/pyarrow/table.pxi	`86.07% <66.66%> (+0.05%)`	⬆️
python/pyarrow/pandas_compat.py	`97% <93.75%> (-0.15%)`	⬇️
python/pyarrow/tests/test_pandas.py	`94.54% <95.23%> (ø)`	⬆️
cpp/src/arrow/python/arrow_to_pandas.cc	`92.28% <97.56%> (+0.18%)`	⬆️
python/pyarrow/plasma.py	`58.9% <0%> (-1.37%)`	⬇️
go/arrow/ipc/writer.go
go/arrow/math/uint64_amd64.go
go/arrow/memory/memory_avx2_amd64.go
... and 191 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af097e6...891c216. Read the comment docs.

wesm · 2019-10-03T17:58:34Z

Ah I read @jorisvandenbossche comments now. Since this is strictly internal and non-public-API code I am okay with it. Do you want to make any more changes to this patch beyond rebasing and getting the tests passing?

jorisvandenbossche · 2019-10-03T18:08:33Z

Do you want to make any more changes to this patch beyond rebasing and getting the tests passing?

For me it is fine to get this in. It's also included in #5512 since I needed it there. But if we are fine with the arrow_to_pandas.cc ::ExtensionBlock (which is indeed the internal part), then that makes the diff of the other PR a bit smaller.

Will rebase this.

jorisvandenbossche · 2019-10-03T19:21:03Z

The "Ursabot / AMD64 Conda Python 3.6" build is failing on the arrow-flight-test C++ test, not sure if that can be related to the changes in this PR

jorisvandenbossche · 2019-10-04T06:14:07Z

@ursabot build

jorisvandenbossche · 2019-10-04T12:15:46Z

I retriggered the builds, and all green now

…o pandas

…o ExtensionBlock

wesm · 2019-10-08T22:06:02Z

Travis CI: https://travis-ci.org/jorisvandenbossche/arrow/builds/595300501
Appveyor: https://ci.appveyor.com/project/jorisvandenbossche/arrow/builds/27970985

jorisvandenbossche force-pushed the ARROW-6321-extension-block branch from 4f89d4d to 31541b4 Compare August 23, 2019 09:24

jorisvandenbossche force-pushed the ARROW-6321-extension-block branch from 31541b4 to 3729177 Compare September 11, 2019 10:26

jorisvandenbossche marked this pull request as ready for review September 11, 2019 10:27

pitrou reviewed Sep 18, 2019

View reviewed changes

jorisvandenbossche force-pushed the ARROW-6321-extension-block branch from 3729177 to 3469d84 Compare September 26, 2019 08:50

jorisvandenbossche mentioned this pull request Sep 26, 2019

ARROW-2428: [Python] Support pandas ExtensionArray in Table.to_pandas conversion #5512

Closed

jorisvandenbossche force-pushed the ARROW-6321-extension-block branch from 969182d to fe0674b Compare October 3, 2019 18:10

kszucs force-pushed the master branch from fc93312 to af097e6 Compare October 5, 2019 09:47

kszucs force-pushed the ARROW-6321-extension-block branch from fe0674b to 891c216 Compare October 5, 2019 10:03

jorisvandenbossche added 4 commits October 8, 2019 22:21

ARROW-6321: [Python] Ability to create ExtensionBlock on conversion t…

19bee7c

…o pandas

pass actual chunked array to python

da78d17

add extension_columns option to control which columns get converted t…

a2b0c14

…o ExtensionBlock

fixup merge

89c225f

jorisvandenbossche force-pushed the ARROW-6321-extension-block branch from 891c216 to 89c225f Compare October 8, 2019 20:21

wesm closed this in a8936d8 Oct 8, 2019

jorisvandenbossche deleted the ARROW-6321-extension-block branch October 9, 2019 06:43

asfimport mentioned this pull request Oct 8, 2019

[Python] Ability to create ExtensionBlock on conversion to pandas #16857

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-6321: [Python] Ability to create ExtensionBlock on conversion to pandas #5162

ARROW-6321: [Python] Ability to create ExtensionBlock on conversion to pandas #5162

jorisvandenbossche commented Aug 22, 2019 •

edited

Loading

jorisvandenbossche commented Sep 11, 2019

pitrou left a comment

pitrou Sep 18, 2019

jorisvandenbossche Sep 18, 2019

wesm Oct 3, 2019

pitrou Sep 18, 2019

jorisvandenbossche Sep 18, 2019

pitrou Sep 18, 2019

wesm Oct 3, 2019

jorisvandenbossche commented Sep 18, 2019 •

edited

Loading

pitrou commented Sep 18, 2019

codecov-io commented Sep 26, 2019 •

edited

Loading

wesm commented Oct 3, 2019

jorisvandenbossche commented Oct 3, 2019

jorisvandenbossche commented Oct 3, 2019 •

edited

Loading

jorisvandenbossche commented Oct 4, 2019

jorisvandenbossche commented Oct 4, 2019

wesm commented Oct 8, 2019

ARROW-6321: [Python] Ability to create ExtensionBlock on conversion to pandas #5162

ARROW-6321: [Python] Ability to create ExtensionBlock on conversion to pandas #5162

Conversation

jorisvandenbossche commented Aug 22, 2019 • edited Loading

jorisvandenbossche commented Sep 11, 2019

pitrou left a comment

Choose a reason for hiding this comment

pitrou Sep 18, 2019

Choose a reason for hiding this comment

jorisvandenbossche Sep 18, 2019

Choose a reason for hiding this comment

wesm Oct 3, 2019

Choose a reason for hiding this comment

pitrou Sep 18, 2019

Choose a reason for hiding this comment

jorisvandenbossche Sep 18, 2019

Choose a reason for hiding this comment

pitrou Sep 18, 2019

Choose a reason for hiding this comment

wesm Oct 3, 2019

Choose a reason for hiding this comment

jorisvandenbossche commented Sep 18, 2019 • edited Loading

pitrou commented Sep 18, 2019

codecov-io commented Sep 26, 2019 • edited Loading

Codecov Report

wesm commented Oct 3, 2019

jorisvandenbossche commented Oct 3, 2019

jorisvandenbossche commented Oct 3, 2019 • edited Loading

jorisvandenbossche commented Oct 4, 2019

jorisvandenbossche commented Oct 4, 2019

wesm commented Oct 8, 2019

jorisvandenbossche commented Aug 22, 2019 •

edited

Loading

jorisvandenbossche commented Sep 18, 2019 •

edited

Loading

codecov-io commented Sep 26, 2019 •

edited

Loading

jorisvandenbossche commented Oct 3, 2019 •

edited

Loading