Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: tighten DTA/TDA _from_sequence signature #37179

Closed
wants to merge 28 commits into from
Closed
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
eab1030
EA: tighten TimedeltaArray._from_sequence signature
jbrockmendel Sep 29, 2020
137fdf1
Merge branch 'master' of https://github.com/pandas-dev/pandas into st…
jbrockmendel Oct 1, 2020
e54ed75
EA: Tighten signature on DatetimeArray._from_sequence
jbrockmendel Sep 29, 2020
3d2a7ab
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Oct 3, 2020
96a8475
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Oct 5, 2020
d49d819
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Oct 8, 2020
f693ed4
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Oct 9, 2020
058338a
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Oct 16, 2020
77e7c21
API: restrict DTA/TDA _from_sequence
jbrockmendel Oct 16, 2020
7e97d03
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Oct 23, 2020
f2a2aaf
test
jbrockmendel Oct 24, 2020
00d19e3
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Oct 31, 2020
bc532be
lint fixup
jbrockmendel Oct 31, 2020
2e612c4
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Nov 4, 2020
b8db5c1
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Nov 4, 2020
cad1104
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Nov 4, 2020
8f1b25d
Use _from_sequence_of_strings where appropriate
jbrockmendel Nov 5, 2020
fb41950
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Nov 5, 2020
9f72ad8
workaround for i8 case
jbrockmendel Nov 5, 2020
54c87da
mypy fixup
jbrockmendel Nov 5, 2020
d87533b
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Nov 9, 2020
158f2b2
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Nov 11, 2020
b071b5f
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Nov 19, 2020
bffacb1
from_sequence -> from_sequence_strict
jbrockmendel Nov 19, 2020
0153a51
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Nov 20, 2020
5acede8
mypy fixup
jbrockmendel Nov 20, 2020
3b25043
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Nov 20, 2020
8c87760
revert test edits
jbrockmendel Nov 20, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions pandas/core/arrays/datetimelike.py
Original file line number Diff line number Diff line change
Expand Up @@ -1595,6 +1595,24 @@ def _with_freq(self, freq):
# Shared Constructor Helpers


def ensure_arraylike(scalars, copy: bool) -> Tuple[Any, bool]:
"""
Convert non-arraylike scalar sequences to ndarray.
"""
if not hasattr(scalars, "dtype"):
copy = False
if np.ndim(scalars) == 0:
scalars = list(scalars)

scalars = np.asarray(scalars)
if len(scalars) == 0:
# Without casting, we would have float64 and so would reject later
# in from_sequence
scalars = scalars.astype(object)

return scalars, copy


def validate_periods(periods):
"""
If a `periods` argument is passed to the Datetime/Timedelta Array/Index
Expand Down
35 changes: 34 additions & 1 deletion pandas/core/arrays/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,12 @@
pandas_dtype,
)
from pandas.core.dtypes.dtypes import DatetimeTZDtype
from pandas.core.dtypes.generic import ABCIndexClass, ABCPandasArray, ABCSeries
from pandas.core.dtypes.generic import (
ABCIndexClass,
ABCMultiIndex,
ABCPandasArray,
ABCSeries,
)
from pandas.core.dtypes.missing import isna

from pandas.core.algorithms import checked_add_with_arr
Expand Down Expand Up @@ -302,6 +307,34 @@ def _simple_new(

@classmethod
def _from_sequence(cls, scalars, dtype=None, copy: bool = False):

scalars, copy = dtl.ensure_arraylike(scalars, copy)

if scalars.dtype.kind == "M":
pass
elif scalars.dtype == object:
if isinstance(scalars, ABCMultiIndex):
raise TypeError("Cannot create a DatetimeArray from MultiIndex")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this tested?


inferred = lib.infer_dtype(scalars)
if inferred in ["datetime64", "date", "datetime", "empty"]:
pass
else:
msg = f"{inferred} scalars cannot be converted to datetime64[ns]"
raise TypeError(msg)
elif is_string_dtype(scalars.dtype):
# TODO: should go through from_sequence_of_strings?
pass
elif (
is_categorical_dtype(scalars.dtype) and scalars.categories.dtype.kind == "M"
):
# TODO: Could also use Categorical[object]
# with inferred_type as above?
pass
else:
msg = f"dtype {scalars.dtype} cannot be converted to datetime64[ns]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you ensure we have a test that hits this

raise TypeError(msg)

return cls._from_sequence_not_strict(scalars, dtype=dtype, copy=copy)

@classmethod
Expand Down
16 changes: 16 additions & 0 deletions pandas/core/arrays/timedeltas.py
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,22 @@ def _from_sequence(
if dtype:
_validate_td64_dtype(dtype)

data, copy = dtl.ensure_arraylike(data, copy)

if data.dtype.kind == "m":
pass
elif data.dtype == object:
inferred = lib.infer_dtype(data)
if inferred in ["timedelta64", "timedelta", "empty"]:
pass
else:
raise TypeError(inferred)
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
elif is_string_dtype(data.dtype):
# TODO: should go through from_sequence_of_strings?
pass
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
else:
raise TypeError(data.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you ensure we have a test that hits hits


data, inferred_freq = sequence_to_td64ns(data, copy=copy, unit=None)
freq, _ = dtl.validate_inferred_freq(None, inferred_freq, False)

Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/arithmetic/test_period.py
Original file line number Diff line number Diff line change
Expand Up @@ -1189,7 +1189,7 @@ def test_parr_add_sub_td64_nat(self, box_with_array, transpose):
"other",
[
np.array(["NaT"] * 9, dtype="m8[ns]"),
TimedeltaArray._from_sequence(["NaT"] * 9),
TimedeltaArray._from_sequence([np.timedelta64("NaT", "ns")] * 9),
],
)
def test_parr_add_sub_tdt64_nat_array(self, box_with_array, other):
Expand Down
34 changes: 16 additions & 18 deletions pandas/tests/arrays/test_array.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@
(
[pd.Period("2000", "D"), pd.Period("2001", "D")],
"Period[D]",
period_array(["2000", "2001"], freq="D"),
period_array([pd.Timestamp("2000"), pd.Timestamp("2001")], freq="D"),
),
# Period dtype
(
Expand All @@ -57,11 +57,6 @@
period_array(["2000"], freq="D"),
),
# Datetime (naive)
(
[1, 2],
np.dtype("datetime64[ns]"),
DatetimeArray._from_sequence(np.array([1, 2], dtype="datetime64[ns]")),
),
(
np.array([1, 2], dtype="datetime64[ns]"),
None,
Expand All @@ -70,41 +65,42 @@
(
pd.DatetimeIndex(["2000", "2001"]),
np.dtype("datetime64[ns]"),
DatetimeArray._from_sequence(["2000", "2001"]),
DatetimeArray._from_sequence([pd.Timestamp("2000"), pd.Timestamp("2001")]),
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
),
(
pd.DatetimeIndex(["2000", "2001"]),
None,
DatetimeArray._from_sequence(["2000", "2001"]),
DatetimeArray._from_sequence([pd.Timestamp("2000"), pd.Timestamp("2001")]),
),
(
["2000", "2001"],
np.dtype("datetime64[ns]"),
DatetimeArray._from_sequence(["2000", "2001"]),
DatetimeArray._from_sequence([pd.Timestamp("2000"), pd.Timestamp("2001")]),
),
# Datetime (tz-aware)
(
["2000", "2001"],
pd.DatetimeTZDtype(tz="CET"),
DatetimeArray._from_sequence(
["2000", "2001"], dtype=pd.DatetimeTZDtype(tz="CET")
[pd.Timestamp("2000"), pd.Timestamp("2001")],
dtype=pd.DatetimeTZDtype(tz="CET"),
),
),
# Timedelta
(
["1H", "2H"],
np.dtype("timedelta64[ns]"),
TimedeltaArray._from_sequence(["1H", "2H"]),
TimedeltaArray._from_sequence([pd.Timedelta("1H"), pd.Timedelta("2H")]),
),
(
pd.TimedeltaIndex(["1H", "2H"]),
np.dtype("timedelta64[ns]"),
TimedeltaArray._from_sequence(["1H", "2H"]),
TimedeltaArray._from_sequence([pd.Timedelta("1H"), pd.Timedelta("2H")]),
),
(
pd.TimedeltaIndex(["1H", "2H"]),
None,
TimedeltaArray._from_sequence(["1H", "2H"]),
TimedeltaArray._from_sequence([pd.Timedelta("1H"), pd.Timedelta("2H")]),
),
# Category
(["a", "b"], "category", pd.Categorical(["a", "b"])),
Expand Down Expand Up @@ -184,11 +180,11 @@ def test_array_copy():
# datetime
(
[pd.Timestamp("2000"), pd.Timestamp("2001")],
DatetimeArray._from_sequence(["2000", "2001"]),
DatetimeArray._from_sequence([pd.Timestamp("2000"), pd.Timestamp("2001")]),
),
(
[datetime.datetime(2000, 1, 1), datetime.datetime(2001, 1, 1)],
DatetimeArray._from_sequence(["2000", "2001"]),
DatetimeArray._from_sequence([pd.Timestamp("2000"), pd.Timestamp("2001")]),
),
(
np.array([1, 2], dtype="M8[ns]"),
Expand All @@ -202,7 +198,8 @@ def test_array_copy():
(
[pd.Timestamp("2000", tz="CET"), pd.Timestamp("2001", tz="CET")],
DatetimeArray._from_sequence(
["2000", "2001"], dtype=pd.DatetimeTZDtype(tz="CET")
[pd.Timestamp("2000"), pd.Timestamp("2001")],
dtype=pd.DatetimeTZDtype(tz="CET"),
),
),
(
Expand All @@ -211,13 +208,14 @@ def test_array_copy():
datetime.datetime(2001, 1, 1, tzinfo=cet),
],
DatetimeArray._from_sequence(
["2000", "2001"], dtype=pd.DatetimeTZDtype(tz=cet)
[pd.Timestamp("2000"), pd.Timestamp("2001")],
dtype=pd.DatetimeTZDtype(tz=cet),
),
),
# timedelta
(
[pd.Timedelta("1H"), pd.Timedelta("2H")],
TimedeltaArray._from_sequence(["1H", "2H"]),
TimedeltaArray._from_sequence([pd.Timedelta("1H"), pd.Timedelta("2H")]),
),
(
np.array([1, 2], dtype="m8[ns]"),
Expand Down
31 changes: 18 additions & 13 deletions pandas/tests/arrays/test_datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,11 @@ def test_from_sequence_invalid_type(self):
with pytest.raises(TypeError, match="Cannot create a DatetimeArray"):
DatetimeArray._from_sequence(mi)

msg = "mixed scalars cannot be converted to datetime64"
with pytest.raises(TypeError, match=msg):
# GH#37179
DatetimeArray._from_sequence(mi._values)

def test_only_1dim_accepted(self):
arr = np.array([0, 1, 2, 3], dtype="M8[h]").astype("M8[ns]")

Expand Down Expand Up @@ -72,10 +77,10 @@ def test_mixing_naive_tzaware_raises(self, meth):
def test_from_pandas_array(self):
arr = pd.array(np.arange(5, dtype=np.int64)) * 3600 * 10 ** 9

result = DatetimeArray._from_sequence(arr)._with_freq("infer")
result = pd.DatetimeIndex(arr, freq="infer")

expected = pd.date_range("1970-01-01", periods=5, freq="H")._data
tm.assert_datetime_array_equal(result, expected)
expected = pd.date_range("1970-01-01", periods=5, freq="H")
tm.assert_index_equal(result, expected)

def test_mismatched_timezone_raises(self):
arr = DatetimeArray(
Expand Down Expand Up @@ -164,7 +169,7 @@ def test_cmp_dt64_arraylike_tznaive(self, all_compare_operators):
class TestDatetimeArray:
def test_astype_to_same(self):
arr = DatetimeArray._from_sequence(
["2000"], dtype=DatetimeTZDtype(tz="US/Central")
[pd.Timestamp("2000")], dtype=DatetimeTZDtype(tz="US/Central")
)
result = arr.astype(DatetimeTZDtype(tz="US/Central"), copy=False)
assert result is arr
Expand Down Expand Up @@ -197,7 +202,7 @@ def test_astype_int(self, dtype):

def test_tz_setter_raises(self):
arr = DatetimeArray._from_sequence(
["2000"], dtype=DatetimeTZDtype(tz="US/Central")
[pd.Timestamp("2000")], dtype=DatetimeTZDtype(tz="US/Central")
)
with pytest.raises(AttributeError, match="tz_localize"):
arr.tz = "UTC"
Expand Down Expand Up @@ -441,14 +446,14 @@ def test_shift_value_tzawareness_mismatch(self):
class TestSequenceToDT64NS:
def test_tz_dtype_mismatch_raises(self):
arr = DatetimeArray._from_sequence(
["2000"], dtype=DatetimeTZDtype(tz="US/Central")
[pd.Timestamp("2000")], dtype=DatetimeTZDtype(tz="US/Central")
)
with pytest.raises(TypeError, match="data is already tz-aware"):
sequence_to_dt64ns(arr, dtype=DatetimeTZDtype(tz="UTC"))

def test_tz_dtype_matches(self):
arr = DatetimeArray._from_sequence(
["2000"], dtype=DatetimeTZDtype(tz="US/Central")
[pd.Timestamp("2000")], dtype=DatetimeTZDtype(tz="US/Central")
)
result, _, _ = sequence_to_dt64ns(arr, dtype=DatetimeTZDtype(tz="US/Central"))
tm.assert_numpy_array_equal(arr._data, result)
Expand All @@ -461,12 +466,12 @@ def arr1d(self, tz_naive_fixture):
dtype = DatetimeTZDtype(tz=tz) if tz is not None else np.dtype("M8[ns]")
arr = DatetimeArray._from_sequence(
[
"2000-01-03",
"2000-01-03",
"NaT",
"2000-01-02",
"2000-01-05",
"2000-01-04",
pd.Timestamp("2000-01-03"),
pd.Timestamp("2000-01-03"),
pd.NaT,
pd.Timestamp("2000-01-02"),
pd.Timestamp("2000-01-05"),
pd.Timestamp("2000-01-04"),
],
dtype=dtype,
)
Expand Down
14 changes: 13 additions & 1 deletion pandas/tests/arrays/test_timedeltas.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,16 @@ def test_copy(self):
assert arr._data is not data
assert arr._data.base is not data

def test_from_sequence_invalid_dtypes(self):
# GH#37179
data = np.arange(5, dtype=np.float64)
with pytest.raises(TypeError, match="float64"):
TimedeltaArray._from_sequence(data)

with pytest.raises(TypeError, match="floating"):
# object-dtype array of floats
TimedeltaArray._from_sequence(data.astype(object))


class TestTimedeltaArray:
# TODO: de-duplicate with test_npsum below
Expand Down Expand Up @@ -203,7 +213,9 @@ def test_sum_empty(self, skipna):
assert result == Timedelta(0)

def test_min_max(self):
arr = TimedeltaArray._from_sequence(["3H", "3H", "NaT", "2H", "5H", "4H"])
vals = ["3H", "3H", "NaT", "2H", "5H", "4H"]
vals = [Timedelta(x) for x in vals]
arr = TimedeltaArray._from_sequence(vals)

result = arr.min()
expected = Timedelta("2H")
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/extension/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,7 @@ def test_cast_category_to_extension_dtype(self, expected):
)
def test_consistent_casting(self, dtype, expected):
# GH 28448
result = Categorical("2015-01-01").astype(dtype)
result = Categorical(Timestamp("2015-01-01")).astype(dtype)
assert result == expected


Expand Down
1 change: 1 addition & 0 deletions pandas/tests/frame/test_constructors.py
Original file line number Diff line number Diff line change
Expand Up @@ -2869,6 +2869,7 @@ def test_from_tzaware_mixed_object_array(self):
]
assert (res.dtypes == expected_dtypes).all()

@pytest.mark.xfail(reason="DatetimeArray._from_sequence no longer accepts i8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a note for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the question is if we want to disable/deprecate this or find a way to make it work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this something we want to support or disallow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the question is not so much about _from_sequence, but rather about our public constructors (Series(), Index(), array()).
Currently, those support converting integers to datetime dtype. For example:

pd.Series([1, 2, 3], dtype="datetime64[ns]")

works fine (and the same for the other two).
So the question is if we want to keep this kind of behaviour. Personally I would say there is not much harm in allowing it (also given that eg numpy and pyarrow have the same behaviour), it's only when there are timezones that there could be a bit more ambiguity potentially (but eg for timedelta that's already not relevant).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree. Any thoughts on where to catch this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on where to catch this case?

Not fully sure what you mean, but (until #33254 is resolved) doing such conversions is the responsibility of _from_sequence (for pd.array), so if we want to keep this capability, at the moment it is _from_sequence that needs to accept ints.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fully sure what you mean

I mean that we could make this Series construction work without loosening the _from_sequence restrictions by and instead calling [??] in this case.

so if we want to keep this capability, at the moment it is _from_sequence that needs to accept ints.

To be clear, is this what you are suggesting we do? My impression from previous conversations is that you are unambiguously in favor of this tightening.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My earlier suggestion at #37179 (comment) was to use _from_sequence/_from_sequence_strict (instead of _from_sequence_non_strict/_from_sequence), until we decide on better naming for this. So keeping the non-strict behaviour of _from_sequence for now.
That already cleans up the code (separating both behaviours more cleanly), without being blocked on API discussions.

My impression from previous conversations is that you are unambiguously in favor of this tightening.

Yes, for sure. But before we can implement that, we still need to come up with an API for the non-strict conversion as well, as that is still something we need to support. That's the whole discussion of #33254

def test_from_2d_ndarray_with_dtype(self):
# GH#12513
array_dim2 = np.arange(10).reshape((5, 2))
Expand Down