Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add IntegerArray.__arrow_array__ for custom conversion to Arrow #28368

Merged

Conversation

jorisvandenbossche
Copy link
Member

Adding custom conversion of IntegerArray to an Arrow array, which makes that this can also be written to parquet.
Currently it is only one way, for read_parquet it will come back as int or float (depending on presence of missing values), but fixing this is also being discussed (https://issues.apache.org/jira/browse/ARROW-2428).

The tests currently require the master version of Arrow, which we don't test. I can assure that the tests pass locally for me, but is this something we want to merge without coverage on CI?
(in principle we could add Arrow master in eg the numpy-dev build or another one, but that is a bit more work, so not sure that is worth the currently limited number of tests we have relying on arrow)

@jorisvandenbossche jorisvandenbossche added Compat pandas objects compatability with Numpy or Python functions Enhancement labels Sep 10, 2019
@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Sep 10, 2019
@TomAugspurger
Copy link
Contributor

No preference on CI. If there are nightly binaries available then OK with adding them to the numpydev build.

Question: does pa.array(obj) call obj.__arrow_array__ automatically?

@jorisvandenbossche
Copy link
Member Author

Question: does pa.array(obj) call obj.arrow_array automatically?

What do you mean with "automatically" ? It will call it if the method exists, and for Series it will also check if the underlying values has the method (you can see the exact implementation here: apache/arrow#5106)

@TomAugspurger
Copy link
Contributor

Yeah, that's what I meant by automatically (rather than going through __iter__).

@WillAyd
Copy link
Member

WillAyd commented Sep 11, 2019

Do we just want to take on array as a dependency? I know that discussion has come up in the past without resolution. I don't object to it so if so could simplify some things here

@jorisvandenbossche
Copy link
Member Author

For this PR pyarrow does not need to be a hard dependency, and actually here a lazy import is enough. It's only for #28371 that a lazy import is more problematic (but that still doesn't need to make it a hard dependency).

Since those PRs are about IO to pyarrow, you only need this if you are actually using pyarrow for others reasons (eg for IPC, for parquet, ..), not for internal functionality in pandas. So I would keep the discussion of making pyarrow a hard dependency for later if we would make more use of it in pandas itself.

@jbrockmendel
Copy link
Member

#28371 does something similar for Period and Interval. Are there more coming after this? Would it make sense to collect these somewhere like compat._arrow? I don't have a problem with the placement in this PR, just thinking out loud.

@jorisvandenbossche
Copy link
Member Author

Would it make sense to collect these somewhere like compat._arrow?

The actual __arrow_array__ of course needs to be defined on the actual IntegerArray, but the implementation itself could call out to another module. For the type classes I am defining in #28371 this might make sense, but for this PR the implementation is literally one line (pa.array(self._data, mask=self._mask, type=type)), so IMO that is not worth putting in a separate function somewhere else.

But let's discuss further in #28371

@jorisvandenbossche
Copy link
Member Author

Any code comments?
I think this one is good to go, #28371 needs more discussion though (where to put it, always try to import pyarrow or not, ..)

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some stylistic comments. So is everyone else aligned on this dunder for conversion? Haven't been too involved in conversation but figured worth double checking before moving forward

@@ -19,6 +21,13 @@
from pandas.tests.extension.base import BaseOpsUtil
import pandas.util.testing as tm

try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to replace this with compat._optional.import_optional_dependency

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never used compat._optional.import_optional_dependency before, so correct if me if I am wrong: it seems that method is typically used in the code (not tests) and is meant to raise an error or return the module (so eg in functions that use an optional dependency).
So I would still need to catch the error, I think? In which case I am not sure it is necessarily clearer, or deduplicating code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see it has a raise_on_missing=False option. But in theory I then also need to specify on_version='ignore' to not have an error or warning on old pyarrow versions.

But, I can maybe actually replace it with pytest.importorskip inside the test, which seems better suited for test cases. That should also simplify it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, we have our own wrapper around that as td.skip_if_no decorator

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, simplified it with td.skip_if_no. I was actually already using it in the other test as well .. (so I was being a bit blind :-))

not _PYARROW_INSTALLED
or _PYARROW_INSTALLED
and LooseVersion(pyarrow.__version__) < LooseVersion("0.14.1.dev"),
reason="pyarrow >= 0.15.0 required",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason for this requiring 0.15.0 but the next test requiring 0.14.1.dev?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is to be able to test it with arrow master locally. We can also wait until final 0.15.0 is released, and then I can change this check. But in practice it gives the same.

@jorisvandenbossche
Copy link
Member Author

So is everyone else aligned on this dunder for conversion?

Well, it's a protocol of pyarrow, not pandas, and the decision is made in pyarrow now (it would be like pandas not liking numpy's __array__). Of course, feedback on the protocol for conversion to arrow is certainly welcome, it's very new so can still be changed.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, if you could comment on https://github.com/pandas-dev/pandas/pull/28368/files#r323810167.

I suspect that >= 0.14.1.dev is just going to be pyarrow 0.15.0 or above? No 0.14.2 planned?

@jorisvandenbossche
Copy link
Member Author

it's a protocol of pyarrow,

For context, see also #20612 (comment)

@WillAyd
Copy link
Member

WillAyd commented Sep 12, 2019 via email

@TomAugspurger
Copy link
Contributor

Probably want a release note? Or... perhaps not since pyarrow 0.15 isn't out yet?

Feel free to merge if a release note isn't needed.

@TomAugspurger TomAugspurger merged commit 34fff1f into pandas-dev:master Sep 12, 2019
@TomAugspurger
Copy link
Contributor

Thanks.

jmg7173 pushed a commit to jmg7173/pandas that referenced this pull request Sep 14, 2019
…andas-dev#28368)

* ENH: Add IntegerArray.__arrow_array__ for custom conversion to Arrow

* simplify pyarrow version check in tests

* add whatsnew
@jorisvandenbossche jorisvandenbossche deleted the integer-arrow-array branch October 23, 2019 13:51
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
…andas-dev#28368)

* ENH: Add IntegerArray.__arrow_array__ for custom conversion to Arrow

* simplify pyarrow version check in tests

* add whatsnew
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
…andas-dev#28368)

* ENH: Add IntegerArray.__arrow_array__ for custom conversion to Arrow

* simplify pyarrow version check in tests

* add whatsnew
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants