Skip to content

BUG: creating Categorical from pandas Index/Series with "object" dtype infers string #62080

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions doc/source/user_guide/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1178,3 +1178,32 @@ Use ``copy=True`` to prevent such a behaviour or simply don't reuse ``Categorica
This also happens in some cases when you supply a NumPy array instead of a ``Categorical``:
using an int array (e.g. ``np.array([1,2,3,4])``) will exhibit the same behavior, while using
a string array (e.g. ``np.array(["a","b","c","a"])``) will not.

.. note::

When constructing a :class:`pandas.Categorical` from a pandas :class:`Series` or
:class:`Index` with ``dtype='object'``, the dtype of the categories will be
preserved as ``object``. When constructing from a NumPy array
with ``dtype='object'`` or a raw Python sequence, pandas will infer the most
specific dtype for the categories (for example, ``str`` if all elements are strings).

.. ipython:: python
pd.options.future.infer_string = True
ser = pd.Series(["foo", "bar", "baz"], dtype="object")
idx = pd.Index(["foo", "bar", "baz"], dtype="object")
arr = np.array(["foo", "bar", "baz"], dtype="object")
pylist = ["foo", "bar", "baz"]
cat_from_ser = pd.Categorical(ser)
cat_from_idx = pd.Categorical(idx)
cat_from_arr = pd.Categorical(arr)
cat_from_list = pd.Categorical(pylist)
# Series/Index with object dtype: preserve object dtype
assert cat_from_ser.categories.dtype == "object"
assert cat_from_idx.categories.dtype == "object"
# Numpy array or list: infer string dtype
assert cat_from_arr.categories.dtype == "str"
assert cat_from_list.categories.dtype == "str"
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -690,7 +690,7 @@ Categorical
- Bug in :meth:`Categorical.astype` where ``copy=False`` would still trigger a copy of the codes (:issue:`62000`)
- Bug in :meth:`DataFrame.pivot` and :meth:`DataFrame.set_index` raising an ``ArrowNotImplementedError`` for columns with pyarrow dictionary dtype (:issue:`53051`)
- Bug in :meth:`Series.convert_dtypes` with ``dtype_backend="pyarrow"`` where empty :class:`CategoricalDtype` :class:`Series` raised an error or got converted to ``null[pyarrow]`` (:issue:`59934`)
-
- Bug in :class:`Categorical` where constructing from a pandas :class:`Series` or :class:`Index` with ``dtype='object'`` did not preserve the categories' dtype as ``object``; now the dtype is preserved as ``object`` for these cases, while numpy arrays and Python sequences with ``dtype='object'`` continue to infer the most specific dtype (for example, ``str`` if all elements are strings).

Datetimelike
^^^^^^^^^^^^
Expand Down
13 changes: 10 additions & 3 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -457,6 +457,11 @@ def __init__(
codes = arr.indices.to_numpy()
dtype = CategoricalDtype(categories, values.dtype.pyarrow_dtype.ordered)
else:
# Check for pandas Series/ Index with object dtye
preserve_object_dtpe = False
if isinstance(values, (ABCSeries, ABCIndex)):
if getattr(values.dtype, "name", None) == "object":
preserve_object_dtpe = True
if not isinstance(values, ABCIndex):
# in particular RangeIndex xref test_index_equal_range_categories
values = sanitize_array(values, None)
Expand All @@ -465,15 +470,17 @@ def __init__(
except TypeError as err:
codes, categories = factorize(values, sort=False)
if dtype.ordered:
# raise, as we don't have a sortable data structure and so
# the user should give us one by specifying categories
raise TypeError(
"'values' is not ordered, please "
"explicitly specify the categories order "
"by passing in a categories argument."
) from err

# we're inferring from values
# If we should prserve object dtype, force categories to object dtype
if preserve_object_dtpe:
from pandas import Index

categories = Index(categories, dtype=object, copy=False)
dtype = CategoricalDtype(categories, dtype.ordered)

elif isinstance(values.dtype, CategoricalDtype):
Expand Down
25 changes: 25 additions & 0 deletions pandas/tests/extension/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,31 @@ def test_array_repr(self, data, size):
def test_groupby_extension_agg(self, as_index, data_for_grouping):
super().test_groupby_extension_agg(as_index, data_for_grouping)

def test_categorical_preserve_object_dtype_from_pandas(self):
import numpy as np

import pandas as pd

pd.options.future.infer_string = True

ser = pd.Series(["foo", "bar", "baz"], dtype="object")
idx = pd.Index(["foo", "bar", "baz"], dtype="object")
arr = np.array(["foo", "bar", "baz"], dtype="object")
pylist = ["foo", "bar", "baz"]

cat_from_ser = Categorical(ser)
cat_from_idx = Categorical(idx)
cat_from_arr = Categorical(arr)
cat_from_list = Categorical(pylist)

# Series/Index with object dtype: preserve object dtype
assert cat_from_ser.categories.dtype == "object"
assert cat_from_idx.categories.dtype == "object"

# Numpy array or list: infer string dtype
assert cat_from_arr.categories.dtype == "str"
assert cat_from_list.categories.dtype == "str"


class Test2DCompat(base.NDArrayBacked2DTests):
def test_repr_2d(self, data):
Expand Down
Loading