Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: tighten DTA/TDA _from_sequence signature #37179

Closed
wants to merge 28 commits into from

Conversation

jbrockmendel
Copy link
Member

cc @jorisvandenbossche

need to decide how to handle test_from_2d_ndarray_with_dtype which this currently xfails

@jreback jreback added the API - Consistency Internal Consistency of API/Behavior label Oct 17, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a substantial (but good) api change, or is this just on master as _from_sequence is actually strict? (in 1.1.x)

@jreback jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Oct 20, 2020
@jbrockmendel
Copy link
Member Author

this looks like a substantial (but good) api change, or is this just on master as _from_sequence is actually strict? (in 1.1.x)

not just on master, this makes _from_sequence substantially stricter than 1.1.x

@jorisvandenbossche
Copy link
Member

Thanks for looking at this!

Now, for the actual behavioural change, _from_sequence is still being used for eg pd.array(.., dtype="datetime64[ns"]) as well, so we first need to decide what behaviour we want for that.
Otherwise could also keep this in a _from_sequence_strict (and rename _from_sequence_not_strict to _from_sequence)

pass
elif scalars.dtype == object:
if isinstance(scalars, ABCMultiIndex):
raise TypeError("Cannot create a DatetimeArray from MultiIndex")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this tested?

# with inferred_type as above?
pass
else:
msg = f"dtype {scalars.dtype} cannot be converted to datetime64[ns]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you ensure we have a test that hits this

# TODO: should go through from_sequence_of_strings?
pass
else:
raise TypeError(data.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you ensure we have a test that hits hits

@jbrockmendel
Copy link
Member Author

just pushed with a couple new tests that get us to full coverage on the affected methods

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of comments on is this added code tested

@@ -2869,6 +2869,7 @@ def test_from_tzaware_mixed_object_array(self):
]
assert (res.dtypes == expected_dtypes).all()

@pytest.mark.xfail(reason="DatetimeArray._from_sequence no longer accepts i8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a note for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the question is if we want to disable/deprecate this or find a way to make it work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this something we want to support or disallow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the question is not so much about _from_sequence, but rather about our public constructors (Series(), Index(), array()).
Currently, those support converting integers to datetime dtype. For example:

pd.Series([1, 2, 3], dtype="datetime64[ns]")

works fine (and the same for the other two).
So the question is if we want to keep this kind of behaviour. Personally I would say there is not much harm in allowing it (also given that eg numpy and pyarrow have the same behaviour), it's only when there are timezones that there could be a bit more ambiguity potentially (but eg for timedelta that's already not relevant).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree. Any thoughts on where to catch this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on where to catch this case?

Not fully sure what you mean, but (until #33254 is resolved) doing such conversions is the responsibility of _from_sequence (for pd.array), so if we want to keep this capability, at the moment it is _from_sequence that needs to accept ints.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fully sure what you mean

I mean that we could make this Series construction work without loosening the _from_sequence restrictions by and instead calling [??] in this case.

so if we want to keep this capability, at the moment it is _from_sequence that needs to accept ints.

To be clear, is this what you are suggesting we do? My impression from previous conversations is that you are unambiguously in favor of this tightening.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My earlier suggestion at #37179 (comment) was to use _from_sequence/_from_sequence_strict (instead of _from_sequence_non_strict/_from_sequence), until we decide on better naming for this. So keeping the non-strict behaviour of _from_sequence for now.
That already cleans up the code (separating both behaviours more cleanly), without being blocked on API discussions.

My impression from previous conversations is that you are unambiguously in favor of this tightening.

Yes, for sure. But before we can implement that, we still need to come up with an API for the non-strict conversion as well, as that is still something we need to support. That's the whole discussion of #33254

@jreback
Copy link
Contributor

jreback commented Nov 4, 2020

if u can rebase; i think a couple of questions

@jbrockmendel
Copy link
Member Author

cc @WillAyd most recent commit made small edits to json code, can you take a look

@WillAyd
Copy link
Member

WillAyd commented Nov 5, 2020

lgtm - I think the explicitness here is nice

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche any objection to this as a temporary workaround? im pretty sure It will allow us to clean up a bunch of code in .isin and .equals xref #37528

@jbrockmendel
Copy link
Member Author

updated per discussion so that _from_sequence remains non-strict while a new _from_sequence_strict has the hopefully-future behavior of from_sequence.

im hoping to use from_sequence strict to simplify+de-duplicate .equals and .isin

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

General question: in the tslib conversion functions, those are all "generic" in the sense of accepting a mixture of string, datetime, Timestamp, int, etc (like array_to_datetime seems to do) ? We don't have one that is more strict on the input that could be used for this purpose? (instead of checking up front with infer_dtype)

pandas/tests/arrays/test_array.py Outdated Show resolved Hide resolved
pandas/core/arrays/timedeltas.py Outdated Show resolved Hide resolved
pandas/core/arrays/timedeltas.py Outdated Show resolved Hide resolved
@jbrockmendel
Copy link
Member Author

We don't have one that is more strict on the input that could be used for this purpose?

Good idea, we can use tslibs.conversion.datetime_to_datetime64. AFAICT we dont have an equivalent for timedelta, but it wouldn't be hard to implement one. libperiod.extract_ordinals can also be adapted to do the same thing for PeriodArray (separate PR).

Will update.

@jbrockmendel
Copy link
Member Author

mothballing to clear the queue while i address comments locally

@jbrockmendel jbrockmendel added the Mothballed Temporarily-closed PR the author plans to return to label Nov 21, 2020
@jbrockmendel jbrockmendel deleted the ref-signatures-2 branch November 20, 2021 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior ExtensionArray Extending pandas with custom dtypes or arrays. Mothballed Temporarily-closed PR the author plans to return to
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants