Skip to content

BUG: creating Categorical from pandas Index/Series with "object" dtype infers string #62080

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

niruta25
Copy link
Contributor

@niruta25 niruta25 commented Aug 9, 2025

Preserve object dtype for categories when constructing Categorical from pandas objects

This PR fixes an inconsistency in how pandas infers the dtype of categories when constructing a Categorical from different input types:

When constructing a Categorical from a pandas Series or Index with dtype="object", the categories' dtype is now preserved as object.
When constructing from a NumPy array with dtype="object" or a raw Python sequence, pandas continues to infer the most specific dtype for the categories (e.g., str if all elements are strings).
This change brings the behavior of Categorical in line with how Series and Index handle dtype preservation, making the API more consistent and predictable.

Example

pd.options.future.infer_string = True

ser = pd.Series(["foo", "bar", "baz"], dtype="object")
idx = pd.Index(["foo", "bar", "baz"], dtype="object")
arr = np.array(["foo", "bar", "baz"], dtype="object")
pylist = ["foo", "bar", "baz"]

cat_from_ser = pd.Categorical(ser)
cat_from_idx = pd.Categorical(idx)
cat_from_arr = pd.Categorical(arr)
cat_from_list = pd.Categorical(pylist)

# Series/Index with object dtype: preserve object dtype
assert cat_from_ser.categories.dtype == "object"
assert cat_from_idx.categories.dtype == "object"

# Numpy array or list: infer string dtype
assert cat_from_arr.categories.dtype == "str"
assert cat_from_list.categories.dtype == "str"

Documentation and release notes have been updated.
Closes: #61778

@niruta25 niruta25 changed the title Niruta issue61778 BUG: creating Categorical from pandas Index/Series with "object" dtype infers string Aug 9, 2025
@niruta25
Copy link
Contributor Author

niruta25 commented Aug 9, 2025

@jbrockmendel Regarding this bug, the change to always preserve object dtype for categories when constructing a Categorical from a pandas Series or Index with dtype="object" is a behavioral change that affects a wide range of pandas internals and user-facing APIs. Hence I am seeing a lot of failures.

I see two ways to resolve without changing overall behavior.

  1. Only Preserve object Dtype When All Elements Are Not Strings
  • If the input is a pandas Series/Index with dtype="object", only preserve object dtype for categories if not all elements are strings.
  • If all elements are strings, allow inference to str (the current behavior).
  1. Add a Keyword Argument to Categorical (e.g., preserve_object_dtype=False)
  • Add an explicit option to the Categorical constructor to preserve the object dtype for categories.
  • Default to the current behavior, but allow users to opt in to preservation.

Let me know your thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG?: creating Categorical from pandas Index/Series with "object" dtype infers string
1 participant