Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POC] CLN: use ExtensionBlock for datetime tz data #27072

Conversation

jorisvandenbossche
Copy link
Member

Trying out to simply use the ExtensionBlock instead of the current DatetimeTZBlock for datetime tz data (with the exception of one override to have .values backwards compatible), to see what breaks.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jun 27, 2019

The failing tests that I started to look at, all indicate in some way to a bug or a missing aspect of the EA interface

pandas/tests/extension/test_datetime.py::TestMethods::test_combine_le[US/Central]

  • The function passed to combine results in bools, but this gets coerced to DatetimeArray instead of keeping them as bool. EAs needs some mechanism to not convert certain types (also eg for apply we first want to try to convert back to original type, but if that fails still return the result).
    This is actually the responsability of _from_sequence (I think), but this is currently implemented too liberally for DatetimeArray (it basically does a general to_datetime parsing, so also allows strings, etc)

pandas/tests/frame/test_alter_axes.py::TestDataFrameAlterAxes::test_convert_dti_to_series

  • This one (and a bunch of others as well) fails if using directly ExtensionBlock instead of the class overriding external_values. If the result of .values changes to a tz-aware DatetimeArray instead of naive numpy array, that breaks quite some tests. But for now preventing that with the limited subclass with one override.

pandas/tests/frame/test_combine_concat.py::TestDataFrameConcatCommon::test_concat_multiple_tzs

  • _concat_same_type assumes exactly the same dtypes, so different tz violates this. For Period you have a similar issue, but we still have custom code in concat.py to handle our different EAs, which is of course not a general solution for external EAs (so a short-term fix for datetimetz would be to also special case this in the concat code as is done for all the others, long term we need to think about a better solution).

pandas/tests/frame/test_quantile.py::TestDataFrameQuantile::test_quantile_box

  • Block.quantile has a special case for datetimetz that now got skipped because self.is_datetimetz is False (but we can change that if we would want).

pandas/tests/frame/test_timeseries.py::TestDataFrameTimeSeriesMethods::test_diff_datetime_axis0[UTC]
pandas/tests/frame/test_timeseries.py::TestDataFrameTimeSeriesMethods::test_diff_datetime_axis1[UTC]

  • DatetimeTZBlock has a custom diff method. The general Block.diff is not working for ExtensionBlock, eg also for IntegerArray this fails:

    In [2]: pd.DataFrame({'a': pd.Series([1, 2, 3, 4, 5], dtype='Int64')}).diff()
    ...
    IndexError: list assignment index out of range
    

    Because 1D data is passed to algos.diff, and the axis that is specified assumes 2D data.

    So ExtensionBlock.diff should probably be added.

pandas/tests/series/test_datetime_values.py::TestSeriesDatetimeValues::test_setitem_with_different_tz

pandas/tests/series/test_missing.py::TestSeriesMissingData::test_datetime64_tz_fillna

  • filling with non-matching timezone (similar as issue above)

pandas/tests/series/test_quantile.py::TestSeriesQuantile::test_quantile_box

  • similar quantile failure as above

pandas/tests/series/indexing/test_indexing.py::test_basic_getitem_setitem_corner

  • For some reason that I didn't figure out yet, the indexing Engine now throws a different error ("SystemError: <class 'pandas._libs.hashtable._memoryviewslice'> returned a result with an error set") that I didn't figure out yet

@jorisvandenbossche jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jun 27, 2019
@jreback
Copy link
Contributor

jreback commented Jul 11, 2019

closing as I think @jbrockmendel will be doing some work on this.

@jorisvandenbossche
Copy link
Member Author

@jbrockmendel is that correct?

I need to convert the points above in issues, as we should discuss what to do with them.

@jbrockmendel
Copy link
Member

is that correct?

My last few PRs have been in/around the _try_coerce_args and _can_hold_element methods in Block subclasses. To the extent that that moves towards this goal, sure.

I expect that ExtensionBlock._can_hold_element will end up being something like:

def _can_hold_elemenrt(self, element):
     try:
          self._holder._from_sequence([element], dtype=self.dtype)
      except (ValueError, TypeError):
          return False
      return True

or even simpler if we have a validate_fill_value on the EA.

@jbrockmendel
Copy link
Member

@jorisvandenbossche some thoughts here:

  • pls rebase just to keep from that being a hassle down the road
  • instead of DatetimeTZBlock(ExtensionBlock), maybe DatetimeTZBlock(DatetimeBlock, ExtensionBlock) would be an easier intermediate step?
  • A lot of the special treatment for DatetimeTZBlock.values comes down to needing to pass ndarray to only a few places: nanops.nanpercentile and groupby reductions are the ones I see in tests. getting those methods to support DTA will make either of our approaches easier

@jreback
Copy link
Contributor

jreback commented Sep 8, 2019

closing as stale, likely needs a bit of work to get passing

@jbrockmendel
Copy link
Member

@jorisvandenbossche is this actually going anywhere? trying to clear the queue

@jbrockmendel
Copy link
Member

Closing to clear the queue. pls reopen if/when it becomes actionable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants