Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: EA interface - strictness of _from_sequence #33254

Closed
jorisvandenbossche opened this issue Apr 3, 2020 · 16 comments · Fixed by #53089
Closed

API: EA interface - strictness of _from_sequence #33254

jorisvandenbossche opened this issue Apr 3, 2020 · 16 comments · Fixed by #53089
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@jorisvandenbossche
Copy link
Member

Copying part of the discussion out of #32586 into a more specific issue.

Problem statement: currently, _from_sequence is not very explicit in what it should accept as scalars. In practice, this means that it is mostly very liberal, as it is also used under the hood when creating an array of any list-like of objects in pd.array(.., dtype).
However, in some cases we need a more strict version that only accepts actual scalars of the array (meaning, the type of values you get from array[0] or array.max() in case it supports that kind of reductions). This causes some issues like #31108.

So, what should _from_sequence accept? Should it only be sequences that are unambiguously this dtype?

I think it will be useful to have a "strict" version that basically only accepts instances of ExtensionDtype.type or NA values. But we also still need a "liberal" method for the other use cases like pd.array(.., dtype).

The strict version would be used when, for some reason, we go through object dtype (or a list of scalars, or something equivalent). For example in groupby, where we assemble a list of scalars from the reductions into a new column.
From a testing point of view, that would mean we can test that EA._the_strict_method(np.asarray(EA, dtype=object), dtype=EA.dtype) and EA._the_strict_method(list(EA), dtype=EA.dtype) can roundtrip faithfully.


Assuming we agree that we need a strict version for certain use cases, I think there are two main options:

  1. Keep _from_sequence as is, and add a new _from_scalars method that is more strict (that in the base class can call _from_sequence initially for backwards compatibility). We can use _from_scalars in those cases where we need the strict version, and keep using _from_sequence elsewhere (eg in pd.array(.., dtype=))

  2. Update the expectation in our spec that _from_sequence should only accept a sequence of scalars of the array's type (so make _from_sequence the strict method), and use the astype machinery for construction. Basically, the current flexible _from_sequence would then be equivalent to casting an object ndarray (or generally any type) to your EA dtype.

Are there preferences? (or other options?)

From a backwards compatibility point of view, I think both are similar (in both cases you need to update a method (_from_scalars or _from_sequence), and in both cases initially the flexible version will still be used as fallback until the update is done).

The second option of course requires an update to the astype machinery (#22384), which doesn't exist today, but on the other hand is also something we need to do at some point eventually (but a much bigger topic to solve).

@jorisvandenbossche jorisvandenbossche added API Design ExtensionArray Extending pandas with custom dtypes or arrays. labels Apr 3, 2020
@jorisvandenbossche
Copy link
Member Author

@TomAugspurger replied (#32586 (comment))

I think my preference is for _from_sequence to be the "parse this sequence into an array" and for _from_scalars to be the strict version.

@jorisvandenbossche
Copy link
Member Author

@jbrockmendel replied (#32586 (comment))

Are there preferences? (or other options?)

I would rather avoid having adding another constructor (im already not wild about _from_sequence_of_strings).

IIRC a big part of the reason why _from_sequence became less strict than originally intended is because DTA/TDA/PA _from_sequence took on a bunch of cases that the DTI/TDI/PI constructors handle, in particular ndarray[i8] which by strict rules shouldnt be accepted. Most of the internal usages of those methods have now been removed, so it should be viable to tighten those three from_sequence methods. If we do that, what will be left in the way of making _from_sequence strict(er)?

For example, _from_sequence is also used in pd.array(.., dtype=) and in astype.

Maybe something like fromtype that astype can defer to? There should also be some relationship between fromtype and _from_factorized

@jorisvandenbossche
Copy link
Member Author

Maybe something like fromtype that astype can defer to?

Well, see the option number 2 above, which proposes to defer this to the astype machinery. So that will involve some ways to have a casting method from a certain dtype to the EA dtype in question. The main problem to solve this issue now, is that the general astyping machinery does not yet exist.

There should also be some relationship between fromtype and _from_factorized

I don't think this is necessarily required. The values for factorize can be completely internal to the array (not to be used by users).

@jorisvandenbossche
Copy link
Member Author

I am experimenting a bit with the this, and I am running into the following issue:

Right now, _from_sequence can also be called without passing the original dtype (since it is a class method on the array class, and the dtype is not a required argument).
In practice, this means that for example SparseArray can create a sparse array with any values dtype, and not only the original dtype.

Practical example where a Sparse[int] dtype array results in a Sparse[bool] array:

In [24]: a = pd.arrays.SparseArray([0, 1, 0]) 

In [25]: a
Out[25]: 
[0, 1, 0]
Fill: 0
IntIndex
Indices: array([1], dtype=int32)

In [27]: s = pd.Series(a) 

In [31]: s.combine(1, lambda x, y: x == y)  
Out[31]: 
0    False
1     True
2    False
dtype: Sparse[bool, False]

Now, if we pass the actual dtype instance to _from_sequence and follow that, the above would not be possible, and only if the values fit in a new Sparse[int] dtype, the extension array dtype can be preserved.


Slightly related, at #32586 (comment), @crepererum raises another point regarding the current _from_sequence interface:

  • _from_sequence (with dtype): how strict should the EA cast (e.g. ints to bools, None to bools, floats to ints, ...)?
  • _from_sequence (without dtype): what are the rules here? (e.g. ints and floats => floats, always use widest type found in data, should python date objects be casted to numpy, ...)

So in any case (also for the future liberal method), we should clarify those questions. Should there be a difference between whether a dtype is passed or not? And when should we internally pass a dtype or not?

@jbrockmendel
Copy link
Member

Should there be a difference between whether a dtype is passed or not? And when should we internally pass a dtype or not?

How often do we pass dtype in practice? If it is close-to-always, we might simplify the code by making it a required argument.

@topper-123
Copy link
Contributor

Maybe not use _from_scalars, but add a fastpath parameter to _from_sequence?

@jorisvandenbossche
Copy link
Member Author

A keyword is certainly an option (I wouldn't call it fastpath, since it is not necessarily faster, but something like strict). One problem though is that adding a new keyword is not necessarily backwards compatible with already existing EAs.

@topper-123
Copy link
Contributor

That's a good point, though a seperate method would have the same problem though (except if it routes to _from_sequence but that would break the strictness requirement...)?

@jorisvandenbossche
Copy link
Member Author

except if it routes to _from_sequence but that would break the strictness requirement.

Yes, that's the idea (see the point 1. in the top post).
Basically anything backwards compatible will initially "break" the strictness requirement until the EA is updated to implement the strict behaviour, I don't think there is a way around that (but in the end, it will not be worse than what we have now).

@topper-123
Copy link
Contributor

topper-123 commented Apr 16, 2020

Sorry for closing this, I pressed the wrong button.

Are we close to a decision on this? The docs on text data says: "We recommend using StringDtype to store text data.", but there are problems in 1.0 and master like these below:

>>> pd.Series([1, "x"], dtype="string")
ValueError: StringArray requires a sequence of strings or pandas.NA
>>> pd.Series([1, "x"]).astype("string")
ValueError: StringArray requires a sequence of strings or pandas.NA
# using str in general just works
>>> pd.Series([1, "x"], dtype=str)  # works
>>> pd.Series([1, "x"], dtype=str).astype("string")  # works
>>> pd.Series([1, "x"]).astype(str).astype("string")  # works
# arbitrary conversion to StringDtype doesn't work
>>> pd.Series([1, 2]).astype("string")
ValueError: StringArray requires a sequence of strings or pandas.NA
>>> pd.Series([1, 2]).astype(str).astype("string")  # works

Those differences are IMO quite unintuitive for a user who expects StringDtype to be a "better str".

I've made a PR on this in #33465. It lets go of the strictness of _from_sequence and in return makes StringDtype feel more natural to work with. It could use a few more tests maybe, but work in all cases that I've tried it on.

@topper-123
Copy link
Contributor

Gentle ping.

@jbrockmendel
Copy link
Member

Trying to tighten what DatetimeArray/TimedeltaArray _from_sequence accept following #36718, #36731. The main blocker is test_from_2d_ndarray_with_dtype

array_dim2 = np.arange(10).reshape((5, 2))
df = DataFrame(array_dim2, dtype="datetime64[ns, UTC]")
    
expected = DataFrame(array_dim2).astype("datetime64[ns, UTC]")

The astype call here goes through Block.astype -> astype_nansafe -> DatetimeArray.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)

A few options come to mind:

  1. When dtype is passed explicitly, be less strict (or have another kwarg) (as discussed above)
  2. New method like from_dtype behave like a reverse-astype
  3. say the test is wrong; integers.astype(datetimes) isn't meaningful, integers.view(datetimes) is (this would likely be a PITA deprecation/change, but is likely More Correct)
  4. Instead of calling from_sequence, do something like pd.array(arr).astype(dtype) and rely IntegerArray/PandasArray to get it right (they raise ATM, consistent with 3)

Thoughts @jorisvandenbossche ?

@jorisvandenbossche
Copy link
Member Author

I think many of those questions also circle back to the casting discussion at #22384. If we have a general way to do "from dtype -> to dtype" for all EA dtypes (eg extension dtypes can register such conversions), then the generic _from_sequence method could be seen as the "object -> EAdtype" casting.

@jorisvandenbossche
Copy link
Member Author

I am trying to revive my old branch for this.
Any thoughts on #33254 (comment) ?

@jbrockmendel
Copy link
Member

IIUC the issue is cases where the dtype for a sequence of scalars is amibiguous (despite them being valid for a strict _from_sequence). The options that come to mind are:

  1. if no dtype is passed, make a reasonable inference
  2. if no dtype is passed, resist the temptation to guess, raise
  3. make dtype required in all cases

My inclination is to lean towards being as strict as possible, in part bc it is easier to subsequently relax than the other way around.

Also not obvious how we would enforce it for 3rd party EAs.

@jbrockmendel
Copy link
Member

I've been poking at changing DatetimeArray._from_sequence to be more strict and seeing how much it breaks the world. So far just changing the behavior for ndarray[object] arguments. The main problem that comes up is in astype_nansafe where we call dtype._construct_array_type()._from_sequence(...) where we want e.g. int64 or strings to Just Work (xref #22384).

One option that would handle the .astype situation (and I think some construction situations) is to be strict when dtype is not specified and non-strict otherwise.

Really at this point it is very clear we need something strict and something non-strict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
3 participants