Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: inconsistency in casting back to its EA dtype (try_cast_to_ea) #31108

Closed
jorisvandenbossche opened this issue Jan 17, 2020 · 1 comment · Fixed by #53089
Closed

API: inconsistency in casting back to its EA dtype (try_cast_to_ea) #31108

jorisvandenbossche opened this issue Jan 17, 2020 · 1 comment · Fixed by #53089
Labels
API - Consistency Internal Consistency of API/Behavior Bug ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 17, 2020

With the current logic we have in try_cast_to_ea:

try:
result = cls_or_instance._from_sequence(obj, dtype=dtype)
except Exception:
# We can't predict what downstream EA constructors may raise
result = obj
return result

it's easy to get inconsistencies, and it depends critically on what _from_sequence accepts as valid scalar.

For example, we now have this

>>> s = pd.Series([0, 1, 2], dtype="Int64") 
>>> s.combine(0, lambda x, y: x == y)      
0     True
1    False
2    False
dtype: bool

However, if the IntegerArray constructor gets changed slightly to be more willing to accept all kinds of boolean like values (see #31104 for a current inconsistency on this aspect, and this is what happens in #30282), you can get:

>>> s = pd.Series([0, 1, 2], dtype="Int64")  
>>> s.combine(0, lambda x, y: x == y)       
0    1
1    0
2    0
dtype: Int64

So how "forgiving" the EA constructor is, will determine the output type.

Here the example is with combine, but the try_cast_to_ea function is also used in a part of groupby.

@jorisvandenbossche jorisvandenbossche added ExtensionArray Extending pandas with custom dtypes or arrays. API - Consistency Internal Consistency of API/Behavior labels Jan 17, 2020
@jorisvandenbossche
Copy link
Member Author

Another example (hypothetical, but it still illustrates the problem I think). If we would not have had the and dtype.kind != 'M' check before calling try_cast_to_ea at

if is_extension_array_dtype(dtype) and dtype.kind != "M":
), we would see the following behaviour:

>>> df = pd.DataFrame({'key': ['a', 'b', 'a', 'b'], 
...                    'val': pd.date_range("2012-01-01", periods=4, tz="UTC")})

>>> df.groupby('key')['val'].agg(lambda x: x.mean().year)  # agg func that returns int
key
a   1970-01-01 00:00:00.000002012+00:00
b   1970-01-01 00:00:00.000002012+00:00
Name: val, dtype: datetime64[ns, UTC]

because we allow to create datetime64 array from integers. Here we guard for that explicitly, but that seems a bit code smell, and can also happen for other EA types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Bug ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
2 participants