Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: make CategoricalIndex._concat consistent with pd.concat #41626

Open
jbrockmendel opened this issue May 23, 2021 · 6 comments
Open

API: make CategoricalIndex._concat consistent with pd.concat #41626

jbrockmendel opened this issue May 23, 2021 · 6 comments
Labels
API - Consistency Internal Consistency of API/Behavior Categorical Categorical Data Type Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@jbrockmendel
Copy link
Member

Index._concat (used by Index.append) is thin wrapper around concat_compat. It is overriden by CategoricalIndex so that CategoricalDtype is retained more often than it is in concat_compat. We should make these match.

If we just rip CategoricalIndex._concat, we break 6 tests, all of which boil down to:

    def test_append_category_objects(self, ci):
        # with objects
        result = ci.append(Index(["c", "a"]))
        expected = CategoricalIndex(list("aabbcaca"), categories=ci.categories)
>       tm.assert_index_equal(result, expected, exact=True)

If we go the other way and change concat_compat, we break 6 different tests, all of which involve all-empty arrays or arrays that can be losslessly cast to the Categorical's dtype, e.g (edited for legibility)

    def test_concat_empty_series_dtype_category_with_array(self):
        # GH#18515
        left = Series(np.array([]), dtype="category")
        right = Series(dtype="float64")
        result = concat([left, right])
>        assert result.dtype == "float64"


    def test_concat_categorical_coercion(self):
        # GH 13524
    
        # category + not-category => not-category
        s1 = Series([1, 2, np.nan], dtype="category")
        s2 = Series([2, 1, 2])
    
        exp = Series([1, 2, np.nan, 2, 1, 2], dtype="object")
>       tm.assert_series_equal(pd.concat([s1, s2], ignore_index=True), exp)
E       AssertionError: Attributes of Series are different
E       
E       Attribute "dtype" are different
E       [left]:  CategoricalDtype(categories=[1, 2], ordered=False)
E       [right]: object

Changing concat_compat results in much more convenient behavior, but it is textbook "values-dependent behavior" that in general we want to avoid (cc @jorisvandenbossche)

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 23, 2021
@TomAugspurger
Copy link
Contributor

I vaguely recall some discussions around changing the default behavior of pd.concat to union categories when provided with multiple CategoricalDtype objects, rather than casting to object. IMO, we should address that first (through a deprecation cycle). IIUC it'd then be easier to make the two consistent.

@jbrockmendel
Copy link
Member Author

that looks similar but i think may be orthogonal. in all of the affected tests cases i think we're dealing with one Categorical and one non-Categorical

@jbrockmendel jbrockmendel added API - Consistency Internal Consistency of API/Behavior Reshaping Concat, Merge/Join, Stack/Unstack, Explode Categorical Categorical Data Type and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 6, 2021
@jreback
Copy link
Contributor

jreback commented Jun 24, 2021

IIUC we should strive to improve concat_compat to make this do better inference, e.g.

If we go the other way and change concat_compat, we break 6 different tests, all of which involve all-empty arrays or arrays that can be losslessly cast to the Categorical's dtype, e.g (edited for legibility)

is what would do. I think is a strict improvement.

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche want to weigh in here (before i get started on a PR)? one of the options here is value-dependent behavior

@jorisvandenbossche
Copy link
Member

I think I would opt for preserving the strict behaviour of Series. Although it is certainly tempting to make an exception. But having the behavior depend on which numbers are present (eg in the last test example) really doesn't sound ideal. The user can always cast to the dtype of the first object for doing the concat.

(the case of concatting with an empty other Series is something that could be addressed separately, IMO, eg by having a "null" dtype for empty Series)

Other idea: if we find it onerous for the user to cast all arguments passed to concat/append themselves to ensure consistent dtypes, we could also add a keyword argument to concat/append that would do that for you. But this would then be a more general solution (for all dtypes), instead of adding a special case only for categorical dtype.

@jbrockmendel
Copy link
Member Author

Possibly related: #12509, #14016, #15332, #24093, #24845, #25019, #37480, #44099, #42840

@jbrockmendel jbrockmendel mentioned this issue Apr 15, 2022
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Categorical Categorical Data Type Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants