API: make CategoricalIndex._concat consistent with pd.concat #41626

jbrockmendel · 2021-05-23T02:58:55Z

Index._concat (used by Index.append) is thin wrapper around concat_compat. It is overriden by CategoricalIndex so that CategoricalDtype is retained more often than it is in concat_compat. We should make these match.

If we just rip CategoricalIndex._concat, we break 6 tests, all of which boil down to:

    def test_append_category_objects(self, ci):
        # with objects
        result = ci.append(Index(["c", "a"]))
        expected = CategoricalIndex(list("aabbcaca"), categories=ci.categories)
>       tm.assert_index_equal(result, expected, exact=True)

If we go the other way and change concat_compat, we break 6 different tests, all of which involve all-empty arrays or arrays that can be losslessly cast to the Categorical's dtype, e.g (edited for legibility)

    def test_concat_empty_series_dtype_category_with_array(self):
        # GH#18515
        left = Series(np.array([]), dtype="category")
        right = Series(dtype="float64")
        result = concat([left, right])
>        assert result.dtype == "float64"


    def test_concat_categorical_coercion(self):
        # GH 13524
    
        # category + not-category => not-category
        s1 = Series([1, 2, np.nan], dtype="category")
        s2 = Series([2, 1, 2])
    
        exp = Series([1, 2, np.nan, 2, 1, 2], dtype="object")
>       tm.assert_series_equal(pd.concat([s1, s2], ignore_index=True), exp)
E       AssertionError: Attributes of Series are different
E       
E       Attribute "dtype" are different
E       [left]:  CategoricalDtype(categories=[1, 2], ordered=False)
E       [right]: object

Changing concat_compat results in much more convenient behavior, but it is textbook "values-dependent behavior" that in general we want to avoid (cc @jorisvandenbossche)

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2021-06-02T20:21:57Z

I vaguely recall some discussions around changing the default behavior of pd.concat to union categories when provided with multiple CategoricalDtype objects, rather than casting to object. IMO, we should address that first (through a deprecation cycle). IIUC it'd then be easier to make the two consistent.

jbrockmendel · 2021-06-03T15:33:21Z

that looks similar but i think may be orthogonal. in all of the affected tests cases i think we're dealing with one Categorical and one non-Categorical

jreback · 2021-06-24T13:15:41Z

IIUC we should strive to improve concat_compat to make this do better inference, e.g.

If we go the other way and change concat_compat, we break 6 different tests, all of which involve all-empty arrays or arrays that can be losslessly cast to the Categorical's dtype, e.g (edited for legibility)

is what would do. I think is a strict improvement.

jbrockmendel · 2021-06-26T02:46:09Z

@jorisvandenbossche want to weigh in here (before i get started on a PR)? one of the options here is value-dependent behavior

jorisvandenbossche · 2021-07-11T23:13:57Z

I think I would opt for preserving the strict behaviour of Series. Although it is certainly tempting to make an exception. But having the behavior depend on which numbers are present (eg in the last test example) really doesn't sound ideal. The user can always cast to the dtype of the first object for doing the concat.

(the case of concatting with an empty other Series is something that could be addressed separately, IMO, eg by having a "null" dtype for empty Series)

Other idea: if we find it onerous for the user to cast all arguments passed to concat/append themselves to ensure consistent dtypes, we could also add a keyword argument to concat/append that would do that for you. But this would then be a more general solution (for all dtypes), instead of adding a special case only for categorical dtype.

jbrockmendel · 2022-04-15T17:13:02Z

Possibly related: #12509, #14016, #15332, #24093, #24845, #25019, #37480, #44099, #42840

jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 23, 2021

jbrockmendel added API - Consistency Internal Consistency of API/Behavior Reshaping Concat, Merge/Join, Stack/Unstack, Explode Categorical Categorical Data Type and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 6, 2021

jbrockmendel mentioned this issue Aug 4, 2021

API: make concat_compat match CategoricalIndex._concat #42892

Closed

1 task

mroeschke added the Needs Discussion Requires discussion from core team before further action label Aug 21, 2021

jbrockmendel mentioned this issue Dec 20, 2021

RLS: 1.4 #41957

Closed

jbrockmendel mentioned this issue Jan 21, 2022

API: Series[EA].fillna fallback behavior with incompatible value #45153

Open

jbrockmendel mentioned this issue Apr 15, 2022

RLS: 2.0 #46776

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: make CategoricalIndex._concat consistent with pd.concat #41626

API: make CategoricalIndex._concat consistent with pd.concat #41626

jbrockmendel commented May 23, 2021

TomAugspurger commented Jun 2, 2021

jbrockmendel commented Jun 3, 2021

jreback commented Jun 24, 2021

jbrockmendel commented Jun 26, 2021

jorisvandenbossche commented Jul 11, 2021

jbrockmendel commented Apr 15, 2022

API: make CategoricalIndex._concat consistent with pd.concat #41626

API: make CategoricalIndex._concat consistent with pd.concat #41626

Comments

jbrockmendel commented May 23, 2021

TomAugspurger commented Jun 2, 2021

jbrockmendel commented Jun 3, 2021

jreback commented Jun 24, 2021

jbrockmendel commented Jun 26, 2021

jorisvandenbossche commented Jul 11, 2021

jbrockmendel commented Apr 15, 2022