API: EA interface - strictness of _from_sequence #33254

jorisvandenbossche · 2020-04-03T09:41:00Z

Copying part of the discussion out of #32586 into a more specific issue.

Problem statement: currently, _from_sequence is not very explicit in what it should accept as scalars. In practice, this means that it is mostly very liberal, as it is also used under the hood when creating an array of any list-like of objects in pd.array(.., dtype).
However, in some cases we need a more strict version that only accepts actual scalars of the array (meaning, the type of values you get from array[0] or array.max() in case it supports that kind of reductions). This causes some issues like #31108.

So, what should _from_sequence accept? Should it only be sequences that are unambiguously this dtype?

I think it will be useful to have a "strict" version that basically only accepts instances of ExtensionDtype.type or NA values. But we also still need a "liberal" method for the other use cases like pd.array(.., dtype).

The strict version would be used when, for some reason, we go through object dtype (or a list of scalars, or something equivalent). For example in groupby, where we assemble a list of scalars from the reductions into a new column.
From a testing point of view, that would mean we can test that EA._the_strict_method(np.asarray(EA, dtype=object), dtype=EA.dtype) and EA._the_strict_method(list(EA), dtype=EA.dtype) can roundtrip faithfully.

Assuming we agree that we need a strict version for certain use cases, I think there are two main options:

Keep _from_sequence as is, and add a new _from_scalars method that is more strict (that in the base class can call _from_sequence initially for backwards compatibility). We can use _from_scalars in those cases where we need the strict version, and keep using _from_sequence elsewhere (eg in pd.array(.., dtype=))
Update the expectation in our spec that _from_sequence should only accept a sequence of scalars of the array's type (so make _from_sequence the strict method), and use the astype machinery for construction. Basically, the current flexible _from_sequence would then be equivalent to casting an object ndarray (or generally any type) to your EA dtype.

Are there preferences? (or other options?)

From a backwards compatibility point of view, I think both are similar (in both cases you need to update a method (_from_scalars or _from_sequence), and in both cases initially the flexible version will still be used as fallback until the update is done).

The second option of course requires an update to the astype machinery (#22384), which doesn't exist today, but on the other hand is also something we need to do at some point eventually (but a much bigger topic to solve).

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2020-04-03T09:41:59Z

@TomAugspurger replied (#32586 (comment))

I think my preference is for _from_sequence to be the "parse this sequence into an array" and for _from_scalars to be the strict version.

jorisvandenbossche · 2020-04-03T09:47:28Z

@jbrockmendel replied (#32586 (comment))

Are there preferences? (or other options?)

I would rather avoid having adding another constructor (im already not wild about _from_sequence_of_strings).

IIRC a big part of the reason why _from_sequence became less strict than originally intended is because DTA/TDA/PA _from_sequence took on a bunch of cases that the DTI/TDI/PI constructors handle, in particular ndarray[i8] which by strict rules shouldnt be accepted. Most of the internal usages of those methods have now been removed, so it should be viable to tighten those three from_sequence methods. If we do that, what will be left in the way of making _from_sequence strict(er)?

For example, _from_sequence is also used in pd.array(.., dtype=) and in astype.

Maybe something like fromtype that astype can defer to? There should also be some relationship between fromtype and _from_factorized

jorisvandenbossche · 2020-04-03T09:48:38Z

Maybe something like fromtype that astype can defer to?

Well, see the option number 2 above, which proposes to defer this to the astype machinery. So that will involve some ways to have a casting method from a certain dtype to the EA dtype in question. The main problem to solve this issue now, is that the general astyping machinery does not yet exist.

There should also be some relationship between fromtype and _from_factorized

I don't think this is necessarily required. The values for factorize can be completely internal to the array (not to be used by users).

jorisvandenbossche · 2020-04-03T13:18:22Z

I am experimenting a bit with the this, and I am running into the following issue:

Right now, _from_sequence can also be called without passing the original dtype (since it is a class method on the array class, and the dtype is not a required argument).
In practice, this means that for example SparseArray can create a sparse array with any values dtype, and not only the original dtype.

Practical example where a Sparse[int] dtype array results in a Sparse[bool] array:

In [24]: a = pd.arrays.SparseArray([0, 1, 0]) 

In [25]: a
Out[25]: 
[0, 1, 0]
Fill: 0
IntIndex
Indices: array([1], dtype=int32)

In [27]: s = pd.Series(a) 

In [31]: s.combine(1, lambda x, y: x == y)  
Out[31]: 
0    False
1     True
2    False
dtype: Sparse[bool, False]

Now, if we pass the actual dtype instance to _from_sequence and follow that, the above would not be possible, and only if the values fit in a new Sparse[int] dtype, the extension array dtype can be preserved.

Slightly related, at #32586 (comment), @crepererum raises another point regarding the current _from_sequence interface:

_from_sequence (with dtype): how strict should the EA cast (e.g. ints to bools, None to bools, floats to ints, ...)?

_from_sequence (without dtype): what are the rules here? (e.g. ints and floats => floats, always use widest type found in data, should python date objects be casted to numpy, ...)

So in any case (also for the future liberal method), we should clarify those questions. Should there be a difference between whether a dtype is passed or not? And when should we internally pass a dtype or not?

jbrockmendel · 2020-04-03T17:09:56Z

Should there be a difference between whether a dtype is passed or not? And when should we internally pass a dtype or not?

How often do we pass dtype in practice? If it is close-to-always, we might simplify the code by making it a required argument.

topper-123 · 2020-04-11T08:41:29Z

Maybe not use _from_scalars, but add a fastpath parameter to _from_sequence?

jorisvandenbossche · 2020-04-11T08:45:54Z

A keyword is certainly an option (I wouldn't call it fastpath, since it is not necessarily faster, but something like strict). One problem though is that adding a new keyword is not necessarily backwards compatible with already existing EAs.

topper-123 · 2020-04-11T09:31:25Z

That's a good point, though a seperate method would have the same problem though (except if it routes to _from_sequence but that would break the strictness requirement...)?

jorisvandenbossche · 2020-04-11T11:30:49Z

except if it routes to _from_sequence but that would break the strictness requirement.

Yes, that's the idea (see the point 1. in the top post).
Basically anything backwards compatible will initially "break" the strictness requirement until the EA is updated to implement the strict behaviour, I don't think there is a way around that (but in the end, it will not be worse than what we have now).

topper-123 · 2020-04-16T18:11:36Z

Sorry for closing this, I pressed the wrong button.

Are we close to a decision on this? The docs on text data says: "We recommend using StringDtype to store text data.", but there are problems in 1.0 and master like these below:

>>> pd.Series([1, "x"], dtype="string")
ValueError: StringArray requires a sequence of strings or pandas.NA
>>> pd.Series([1, "x"]).astype("string")
ValueError: StringArray requires a sequence of strings or pandas.NA
# using str in general just works
>>> pd.Series([1, "x"], dtype=str)  # works
>>> pd.Series([1, "x"], dtype=str).astype("string")  # works
>>> pd.Series([1, "x"]).astype(str).astype("string")  # works
# arbitrary conversion to StringDtype doesn't work
>>> pd.Series([1, 2]).astype("string")
ValueError: StringArray requires a sequence of strings or pandas.NA
>>> pd.Series([1, 2]).astype(str).astype("string")  # works

Those differences are IMO quite unintuitive for a user who expects StringDtype to be a "better str".

I've made a PR on this in #33465. It lets go of the strictness of _from_sequence and in return makes StringDtype feel more natural to work with. It could use a few more tests maybe, but work in all cases that I've tried it on.

topper-123 · 2020-05-02T20:04:37Z

Gentle ping.

jbrockmendel · 2020-10-05T20:33:15Z

Trying to tighten what DatetimeArray/TimedeltaArray _from_sequence accept following #36718, #36731. The main blocker is test_from_2d_ndarray_with_dtype

array_dim2 = np.arange(10).reshape((5, 2))
df = DataFrame(array_dim2, dtype="datetime64[ns, UTC]")
    
expected = DataFrame(array_dim2).astype("datetime64[ns, UTC]")

The astype call here goes through Block.astype -> astype_nansafe -> DatetimeArray.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)

A few options come to mind:

When dtype is passed explicitly, be less strict (or have another kwarg) (as discussed above)
New method like from_dtype behave like a reverse-astype
say the test is wrong; integers.astype(datetimes) isn't meaningful, integers.view(datetimes) is (this would likely be a PITA deprecation/change, but is likely More Correct)
Instead of calling from_sequence, do something like pd.array(arr).astype(dtype) and rely IntegerArray/PandasArray to get it right (they raise ATM, consistent with 3)

Thoughts @jorisvandenbossche ?

jorisvandenbossche · 2020-11-05T16:51:32Z

I think many of those questions also circle back to the casting discussion at #22384. If we have a general way to do "from dtype -> to dtype" for all EA dtypes (eg extension dtypes can register such conversions), then the generic _from_sequence method could be seen as the "object -> EAdtype" casting.

jorisvandenbossche · 2020-12-04T19:39:45Z

I am trying to revive my old branch for this.
Any thoughts on #33254 (comment) ?

jbrockmendel · 2020-12-04T21:08:49Z

IIUC the issue is cases where the dtype for a sequence of scalars is amibiguous (despite them being valid for a strict _from_sequence). The options that come to mind are:

if no dtype is passed, make a reasonable inference
if no dtype is passed, resist the temptation to guess, raise
make dtype required in all cases

My inclination is to lean towards being as strict as possible, in part bc it is easier to subsequently relax than the other way around.

Also not obvious how we would enforce it for 3rd party EAs.

jbrockmendel · 2022-10-17T20:51:12Z

I've been poking at changing DatetimeArray._from_sequence to be more strict and seeing how much it breaks the world. So far just changing the behavior for ndarray[object] arguments. The main problem that comes up is in astype_nansafe where we call dtype._construct_array_type()._from_sequence(...) where we want e.g. int64 or strings to Just Work (xref #22384).

One option that would handle the .astype situation (and I think some construction situations) is to be strict when dtype is not specified and non-strict otherwise.

Really at this point it is very clear we need something strict and something non-strict.

jorisvandenbossche added API Design ExtensionArray Extending pandas with custom dtypes or arrays. labels Apr 3, 2020

jorisvandenbossche mentioned this issue Apr 3, 2020

EA: revisit interface #32586

Closed

TomAugspurger mentioned this issue Apr 10, 2020

API: more permissive conversion to StringDtype #33465

Merged

5 tasks

topper-123 closed this as completed Apr 11, 2020

jorisvandenbossche reopened this Apr 11, 2020

lukestanbra mentioned this issue May 21, 2020

read_csv fails with TypeError: object cannot be converted to an IntegerDtype yet succeeds when reading chunks #25472

Closed

This was referenced Sep 29, 2020

EA: Tighten signature on DatetimeArray._from_sequence #36718

Merged

EA: tighten TimedeltaArray._from_sequence signature #36731

Merged

jorisvandenbossche mentioned this issue Nov 5, 2020

API: tighten DTA/TDA _from_sequence signature #37179

Closed

jbrockmendel mentioned this issue Dec 1, 2020

ENH: improve the resulting dtype for groupby operations on nullable dtypes #37494

Open

jorisvandenbossche mentioned this issue Dec 5, 2020

API: add EA._from_scalars / stricter casting of result values back to EA dtype #38315

Closed

2 tasks

siboehm mentioned this issue Mar 22, 2021

BUG: StringArray.astype() raises for datetime, period, boolean & timedelta with missing values #40566

Open

3 tasks

jorisvandenbossche mentioned this issue Aug 10, 2021

API: astype mechanism for extension arrays #22384

Open

jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Oct 17, 2022

jbrockmendel mentioned this issue Oct 18, 2022

BUG: groupby.agg with udf incorrectly inferring pyarrow timestamp dtype #49163

Closed

jbrockmendel mentioned this issue May 4, 2023

ENH: EA._from_scalars #53089

Merged

5 tasks

jbrockmendel closed this as completed in #53089 Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: EA interface - strictness of _from_sequence #33254

API: EA interface - strictness of _from_sequence #33254

jorisvandenbossche commented Apr 3, 2020

jorisvandenbossche commented Apr 3, 2020

jorisvandenbossche commented Apr 3, 2020

jorisvandenbossche commented Apr 3, 2020

jorisvandenbossche commented Apr 3, 2020

jbrockmendel commented Apr 3, 2020

topper-123 commented Apr 11, 2020

jorisvandenbossche commented Apr 11, 2020

topper-123 commented Apr 11, 2020

jorisvandenbossche commented Apr 11, 2020

topper-123 commented Apr 16, 2020 •

edited

Loading

topper-123 commented May 2, 2020

jbrockmendel commented Oct 5, 2020

jorisvandenbossche commented Nov 5, 2020

jorisvandenbossche commented Dec 4, 2020

jbrockmendel commented Dec 4, 2020

jbrockmendel commented Oct 17, 2022

API: EA interface - strictness of _from_sequence #33254

API: EA interface - strictness of _from_sequence #33254

Comments

jorisvandenbossche commented Apr 3, 2020

jorisvandenbossche commented Apr 3, 2020

jorisvandenbossche commented Apr 3, 2020

jorisvandenbossche commented Apr 3, 2020

jorisvandenbossche commented Apr 3, 2020

jbrockmendel commented Apr 3, 2020

topper-123 commented Apr 11, 2020

jorisvandenbossche commented Apr 11, 2020

topper-123 commented Apr 11, 2020

jorisvandenbossche commented Apr 11, 2020

topper-123 commented Apr 16, 2020 • edited Loading

topper-123 commented May 2, 2020

jbrockmendel commented Oct 5, 2020

jorisvandenbossche commented Nov 5, 2020

jorisvandenbossche commented Dec 4, 2020

jbrockmendel commented Dec 4, 2020

jbrockmendel commented Oct 17, 2022

topper-123 commented Apr 16, 2020 •

edited

Loading