Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add and register Arrow extension types for Period and Interval #28371

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Sep 10, 2019

Related to #28368, but now for Period and Interval for which we define extension types to store them with metadata in arrow.

Still needs some more tests and fixing corner cases.
We probably also want to consolidate the pyarrow import checking somewhat.

I think a main question is how to deal with the import of pyarrow. The way I did it now makes that it is no longer a lazy import (as it was currently for the parquet functionality).

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong thoughts on importing pyarrow. It'll take a bit of time which is unfortunate, but doing things lazily seems hard.

pandas/core/arrays/period.py Outdated Show resolved Hide resolved
pandas/core/arrays/period.py Outdated Show resolved Hide resolved
pandas/core/arrays/period.py Outdated Show resolved Hide resolved
@@ -1,3 +1,5 @@
from distutils.version import LooseVersion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can compat._optional.import_optional_dependency be used to simplify things?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From quickly looking at that function, I think it can only be used for lazy imports, which is not what I am doing here (but that is exactly something that we might need to discuss)

@jbrockmendel
Copy link
Member

@jorisvandenbossche is this actionable?

@jorisvandenbossche
Copy link
Member Author

I have been updating this PR locally last week, but still thinking through how we can deal with a delayed pyarrow import.

I think a clear disadvantage of always trying to import pyarrow is import time (something we have been trying to reduce).
But the main problem of only importing pyarrow when needed here, is that those extension types need to be registered when arrow is creating the data (eg from reading a parquet file, or receiving IPC messages). So it is not necessarily pandas that knows when it is needed to have those types registered (eg when using pyarrow to read a parquet file and convert to pandas instead of pandas' read_parquet function).

So we could have a publicly exposed register_arrow_types function that does this, but that seems rather inconvenient if users have to call this manually in certain cases ..

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review November 5, 2019 16:49
@jorisvandenbossche jorisvandenbossche changed the title [WIP] ENH: add and register Arrow extension types for Period and Interval ENH: add and register Arrow extension types for Period and Interval Nov 5, 2019
@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Nov 5, 2019

Updated this, still todo:

  • Fix missing value handling for interval types
  • Clean-up conversion of numpy/pandas dtype to arrow dtype
  • Move some arrow conversion code to common utilities (will also be useful for integer/string array)
  • Further look into if a lazy import is possible

@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Nov 5, 2019
@jorisvandenbossche jorisvandenbossche added ExtensionArray Extending pandas with custom dtypes or arrays. Interval Interval data type Period Period data type labels Nov 5, 2019
@jorisvandenbossche
Copy link
Member Author

Any more feedback on the import issue? (see #28371 (comment))

@jreback
Copy link
Contributor

jreback commented Nov 14, 2019

@jorisvandenbossche the way you did the imports is fine; we must lazily import pyarrow or make it a hard dep

i think that would be better to make some functions to reduce the copy paste of what you did here in this PR. basically you isolate the arrow code (meaning you directly import it at the top of the module) in a separate module (file). then import that module conditionally.

@jorisvandenbossche
Copy link
Member Author

the way you did the imports is fine; we must lazily import pyarrow or make it a hard dep

Yes, but the lazily import can be done in two ways: always try on pandas import, or only try when someone uses the parquet functionality.

It's similar as with plotting: matplotlib is not a hard dependency, but before, we always tried to import it (and register some converters), while now we moved to only importing it when someone tries to plot.

@jbrockmendel
Copy link
Member

pyarrow import takes ~160 ms for me, but 135 of that is numpy, so would only add about 25 ms to our import time. If its easy to avoid that'd be nice, but not worth significant gymnastics

@jorisvandenbossche
Copy link
Member Author

Ah, that's interesting (I tried to time it a while ago, but seem to remember is was rather fluctuating).
In such a case, I would personally maybe prefer to always try to import (although matplotlib is also not that much slower to import ..)

@jbrockmendel
Copy link
Member

@jorisvandenbossche pls rebase

@jbrockmendel
Copy link
Member

@jorisvandenbossche pls rebase

@jorisvandenbossche
Copy link
Member Author

I will try to rebase shortly. Any final concerns on the lazy vs non-lazy import?

Right now, I would propose to keep this PR as is (i.e. non-lazy import -> always try to import pyarrow on pandas import if it is installed; Brock showed above that the additional import time (on top of importing numpy) is rather limited), but as I mentioned above, I can also refactor it to have the lazy import

@TomAugspurger
Copy link
Contributor

I’m ok with non-lazy.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok with non lazy import check

should likely do this just once and export the _PYARROW_INSTALLED variable like we do for numexpr

import pyarrow

_PYARROW_INSTALLED = True
except ImportError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u make this into a function and put in common location

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this for now into an _arrow_utils.py file in the arrays directory (open for other names), we can then put some common functions in that file as well

pandas/core/arrays/interval.py Show resolved Hide resolved
@@ -1217,3 +1279,55 @@ def maybe_convert_platform_interval(values):
values = np.asarray(values)

return maybe_convert_platform(values)


if _PYARROW_INSTALLED and LooseVersion(pyarrow.__version__) >= LooseVersion("0.15"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__PYARROW_INSTALLED needs to incorporate the version check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the version check into the separate file (and made it a variable), but kept it separate from the import check as different functionalities might need a different pyarrow version

@@ -49,6 +51,13 @@
from pandas.tseries import frequencies
from pandas.tseries.offsets import DateOffset, Tick, _delta_to_tick

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

pandas/tests/arrays/test_period.py Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member Author

There is a CI failure for the Linux py36_locale build: apparently the pyarrow install failed there (and for the other tests, this gets ignored / skipped, so that's the reason we didn't see it yet). So that raises two issues: 1) we should fix that CI env and 2) we should probably catch a more general error to avoid that a wrong pyarrow installation let the pandas import fail.

@jreback
Copy link
Contributor

jreback commented Jan 6, 2020

There is a CI failure for the Linux py36_locale build: apparently the pyarrow install failed there (and for the other tests, this gets ignored / skipped, so that's the reason we didn't see it yet). So that raises two issues: 1) we should fix that CI env and 2) we should probably catch a more general error to avoid that a wrong pyarrow installation let the pandas import fail.

  1. absolutely not

we need to see failures that are not the result of expected things

@jorisvandenbossche
Copy link
Member Author

OK, so the reason is not a failed installation, but actually that pyarrow up to version 0.12 also tries to import pandas, so you get a circular dependency which outs itself in the AttributeError.

So, one option is to increase the minimum pyarrow version to 0.13. But, the problem with that is that this will still give a confusing error message when people have pyarrow 0.12 installed.

@jreback
Copy link
Contributor

jreback commented Jan 6, 2020

OK, so the reason is not a failed installation, but actually that pyarrow up to version 0.12 also tries to import pandas, so you get a circular dependency which outs itself in the AttributeError.

So, one option is to increase the minimum pyarrow version to 0.13. But, the problem with that is that this will still give a confusing error message when people have pyarrow 0.12 installed.

we already bumped to 0.12 for various things, you could push this to 0.13 (no objection). these conversions require a higher version anyhow? (e.g. interval/period)?

@jorisvandenbossche
Copy link
Member Author

OK, for now (to get this PR in a mergeable state), I moved away from always trying to import. This will give some realistic corner cases (like reading a parquet file with pyarrow instead of with pandas will not give the extension type), but we can see this new feature as experimental anyway, and to be improved later.

By not importing it by default, we avoid the circular import problem with old pyarrow versions. Bumping the required pyarrow version might solve this, but that still causes pyarrow to become un-importable, if you install pyarrow 0.12 and new pandas side by side with a very cryptic error message.
I might still try to add something more clever (based on pkg_resources to check the installed pyarrow version, to avoid needing to import pyarrow to know if it is recent enough to import, but that can go in a separate PR).

@jorisvandenbossche
Copy link
Member Author

The remaining "failure" is codecov because the functionality I added is not run in the coverage build (only with arrow master)

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions:

  1. Do we have any tests that ensure pyarrow is imported with pandas? IIRC that was somewhat important for getting the types registered? (just read ENH: add and register Arrow extension types for Period and Interval #28371 (comment))
  2. What's the behavior for __arrow_array__ pyarrow 0.15 and earlier? Is it just not called, so we don't need to worry about checking the pyarrow version within there? Or will the line from pandas.core.arrays._arrow_utils import ArrowIntervalType raise an ImportError?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jan 8, 2020

What's the behavior for arrow_array pyarrow 0.15 and earlier? Is it just not called, so we don't need to worry about checking the pyarrow version within there?

Indeed, it should never be called with versions older than 0.15 (unless you would manually call the method, but the method is only called by pyarrow starting from 0.15)

@TomAugspurger
Copy link
Contributor

Great, +1 then.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. some small questions. ping on green.

(:meth:`~DataFrame.to_parquet` / :func:`read_parquet`) using the `'pyarrow'` engine
now preserve those data types with pyarrow >= 1.0.0 (:issue:`20612`).
now preserve those data types with pyarrow >= 0.16.0 (:issue:`20612`, :issue:`28371`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be 0.15?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the pandas -> arrow conversion protocol was already included in 0.15, but the other way (for a full roundtrip) only landed after 0.15. It was just decided that the next arrow release will be 0.16 and not 1.0, so therefore changed the text here.

# Arrow interaction


pyarrow_skip = td.skip_if_no("pyarrow", min_version="0.15.1.dev")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, the pandas -> arrow part already works on 0.15 in general, but due to a "bug" in pyarrow (due to the use of .values, see apache/arrow#5753), period and interval ExtensionArrays specifically don't work yet (because .values returns an object array for those, and not an EA). So therefore those tests for period and interval also only work with pyarrow higher than 0.15.

@jorisvandenbossche jorisvandenbossche merged commit 2198f51 into pandas-dev:master Jan 9, 2020
@jorisvandenbossche jorisvandenbossche deleted the arrow-extension-types branch January 9, 2020 08:34
fsaintjacques pushed a commit to apache/arrow that referenced this pull request Jan 9, 2020
…ith pandas master

https://issues.apache.org/jira/browse/ARROW-7527

Period dtype is now supported in pandas <-> arrow conversions with pandas master (pandas-dev/pandas#28371)

Closes #6147 from jorisvandenbossche/ARROW-7527 and squashes the following commits:

a64da2c <Joris Van den Bossche> ARROW-7527:  Fix pandas/feather tests for unsupported types with pandas master

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: François Saint-Jacques <fsaintjacques@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Interval Interval data type Period Period data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants