Setting with enlargement on categorical data #25383

0phoff · 2019-02-20T09:33:10Z

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.DataFrame.from_dict({'reg': [0,1,2], 'cat':pd.Categorical(['a','b','b'], categories=['a','b','c','d'])})
print(df.dtypes)  # reg is int64, cat is categorical

df.loc[3] = (3, 'c')  # add row with categorical value that exist in categories
print(df.dtypes)  # reg is int64, cat is **object**

Problem description

There is no warning whatsoever, but still the dtype changes. In this dummy example this means we lose all information about the fact that 'd' is also a possible value. (So simply doing astype('category') wouldn't work here.)

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

I couldn't seem to find an issue about this. However I did find a few related things like performing concat and append on categoricals also changes dtypes. I would love these functions to have a keyword to control that behaviour (eg. perform union of categories), but this is a different issue that has already been discussed... (just letting you know that there are people out there who would love this feature, instead of having to meddle with pandas.api.types.union_categoricals)

Expected Output

Keep the categorical dtype if the added value is in the list of categories, throw an error/warning otherwise.
If people don't care about the categorical, they can always call .astype('object') before adding the row?

I think this solution is also in the spirit of 'explicit is better than implicit`?

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-33-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0
pytest: 4.1.1
pip: 18.1
setuptools: 40.2.0
Cython: 0.29.2
numpy: 1.16.0
scipy: 1.1.0
pyarrow: 0.12.0
xarray: None
IPython: 6.5.0
sphinx: 1.7.9
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.9
feather: 0.4.0
matplotlib: 2.2.3
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.5
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-02-20T15:36:07Z

Not sure I see the issue here - from the code posted it looks like you are trying to mix tuples with categorical data which should be an object.

Do you mean to be using the add_categories method:

http://pandas.pydata.org/pandas-docs/stable//user_guide/categorical.html#appending-new-categories

jreback · 2019-02-20T15:41:33Z

this seems
likely the same issues as you mentioned above; append and concat are used in indexing expansion

the core issue should be addressed before this

note that indexing expansion is pretty inefficient and might be removed in the future ; better to explicitly append (which is also inefficient if doing it many times but it’s more obvious what is happening)

0phoff · 2019-02-20T16:10:48Z

Not sure I see the issue here - from the code posted it looks like you are trying to mix tuples with categorical data which should be an object.

You can use .loc[non-existing index] = ('colval1', 'colval2', ...) to set a new row, which is what I'm doing.
Not sure if you can wrap such a value in a categorical, but if that's the case, it still seems quite a burden to do.

add_categories is not what I want. I do not want to add an extra possible category, I want to add an extra row of data in a dataframe that uses one or more categorical columns.

this seems
likely the same issues as you mentioned above; append and concat are used in indexing expansion

the core issue should be addressed before this

I don't know enough of the pandas internals, but it seems kind of logical. I think overall support for these kinds of merging operations with categoricals is lacking in pandas.

note that indexing expansion is pretty inefficient and might be removed in the future ; better to explicitly append (which is also inefficient if doing it many times but it’s more obvious what is happening)

I thought it was just some sugar coating on top of append() with a nicer syntax?
Is it that much more compute time, besides checking whether the index is already in the dataframe?

WillAyd added Needs Info Clarification about behavior needed to assess issue Categorical Categorical Data Type labels Feb 20, 2019

mroeschke added Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Info Clarification about behavior needed to assess issue labels Mar 8, 2020

This was referenced May 8, 2020

BUG: adding a value not in the Categories does not raise a ValueError on a Series when adding the value to a new index #33952

Closed

Add fix to raise error when category value 'x' is not predefined but is assigned through df.loc[..]=x #34011

Closed

mroeschke added the Bug label Jun 28, 2020

jreback added this to the Contributions Welcome milestone Nov 4, 2020

simonjayhawkins mentioned this issue Nov 12, 2020

API: should setitem-with-expansion _ever_ raise? #37774

Closed

jbrockmendel added the setitem-with-expansion label Jan 8, 2022

simonjayhawkins mentioned this issue Jul 9, 2022

BUG: Setting incompatible values into ea column raises instead of casting to object #47577

Closed

3 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting with enlargement on categorical data #25383

Setting with enlargement on categorical data #25383

0phoff commented Feb 20, 2019

WillAyd commented Feb 20, 2019

jreback commented Feb 20, 2019

0phoff commented Feb 20, 2019

Setting with enlargement on categorical data #25383

Setting with enlargement on categorical data #25383

Comments

0phoff commented Feb 20, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

WillAyd commented Feb 20, 2019

jreback commented Feb 20, 2019

0phoff commented Feb 20, 2019

Output of `pd.show_versions()`