Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting with enlargement on categorical data #25383

Open
0phoff opened this issue Feb 20, 2019 · 3 comments
Open

Setting with enlargement on categorical data #25383

0phoff opened this issue Feb 20, 2019 · 3 comments
Labels
Bug Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves setitem-with-expansion

Comments

@0phoff
Copy link

0phoff commented Feb 20, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.DataFrame.from_dict({'reg': [0,1,2], 'cat':pd.Categorical(['a','b','b'], categories=['a','b','c','d'])})
print(df.dtypes)  # reg is int64, cat is categorical

df.loc[3] = (3, 'c')  # add row with categorical value that exist in categories
print(df.dtypes)  # reg is int64, cat is **object**

Problem description

There is no warning whatsoever, but still the dtype changes. In this dummy example this means we lose all information about the fact that 'd' is also a possible value. (So simply doing astype('category') wouldn't work here.)

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

I couldn't seem to find an issue about this. However I did find a few related things like performing concat and append on categoricals also changes dtypes. I would love these functions to have a keyword to control that behaviour (eg. perform union of categories), but this is a different issue that has already been discussed... (just letting you know that there are people out there who would love this feature, instead of having to meddle with pandas.api.types.union_categoricals)

Expected Output

Keep the categorical dtype if the added value is in the list of categories, throw an error/warning otherwise.
If people don't care about the categorical, they can always call .astype('object') before adding the row?

I think this solution is also in the spirit of 'explicit is better than implicit`?

Output of pd.show_versions()

INSTALLED VERSIONS ------------------

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-33-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0
pytest: 4.1.1
pip: 18.1
setuptools: 40.2.0
Cython: 0.29.2
numpy: 1.16.0
scipy: 1.1.0
pyarrow: 0.12.0
xarray: None
IPython: 6.5.0
sphinx: 1.7.9
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.9
feather: 0.4.0
matplotlib: 2.2.3
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.5
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd
Copy link
Member

WillAyd commented Feb 20, 2019

Not sure I see the issue here - from the code posted it looks like you are trying to mix tuples with categorical data which should be an object.

Do you mean to be using the add_categories method:

http://pandas.pydata.org/pandas-docs/stable//user_guide/categorical.html#appending-new-categories

@WillAyd WillAyd added Needs Info Clarification about behavior needed to assess issue Categorical Categorical Data Type labels Feb 20, 2019
@jreback
Copy link
Contributor

jreback commented Feb 20, 2019

this seems
likely the same issues as you mentioned above; append and concat are used in indexing expansion

the core issue should be addressed before this

note that indexing expansion is pretty inefficient and might be removed in the future ; better to explicitly append (which is also inefficient if doing it many times but it’s more obvious what is happening)

@0phoff
Copy link
Author

0phoff commented Feb 20, 2019

Not sure I see the issue here - from the code posted it looks like you are trying to mix tuples with categorical data which should be an object.

You can use .loc[non-existing index] = ('colval1', 'colval2', ...) to set a new row, which is what I'm doing.
Not sure if you can wrap such a value in a categorical, but if that's the case, it still seems quite a burden to do.

add_categories is not what I want. I do not want to add an extra possible category, I want to add an extra row of data in a dataframe that uses one or more categorical columns.


this seems
likely the same issues as you mentioned above; append and concat are used in indexing expansion

the core issue should be addressed before this

I don't know enough of the pandas internals, but it seems kind of logical. I think overall support for these kinds of merging operations with categoricals is lacking in pandas.

note that indexing expansion is pretty inefficient and might be removed in the future ; better to explicitly append (which is also inefficient if doing it many times but it’s more obvious what is happening)

I thought it was just some sugar coating on top of append() with a nicer syntax?
Is it that much more compute time, besides checking whether the index is already in the dataframe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves setitem-with-expansion
Projects
None yet
5 participants