Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

categorical plots - unused categories mess up element spacing and width #3736

Open
Gabriel-Kissin opened this issue Jul 23, 2024 · 4 comments

Comments

@Gabriel-Kissin
Copy link

Several of seaborn's functions for plotting categorical data don't cope well when the categories list includes unused categories.

I've noticed two main issues:

  1. element width shrinks
  2. element spacing doesn't match the x-axis.

It doesn't make a difference if you use vertical or horizontal orientation.

The issue only occurs when the same feature is used for the categorical x/y variable and for the hue. If no hue is provided, or if the hue uses a different feature, there is no issue.

The issues occur for sns.barplot, sns.boxplot, sns.boxenplot, sns.violinplot. Whereas sns.pointplot, sns.stripplot, sns.swarmplot are fine.

I've reproduced the issue with the penguins dataset we all know and love from the seaborn docs. In the following MRE, the first col is the raw penguins data. The second col is after converting it to categorical (also works fine). The final col is after adding an unused category to the data, which causes the above two issues:

image

It looks as though it is failing to recognise that the hue and y are the same, so it makes space on the plot within each y for all the hues. This is what makes each element a) get squeezed, and b) not align nicely with the y ticks. Presumably the unused category is somehow the cause of the confusion.

Code to generate the above plot:

import matplotlib.pyplot as plt
import seaborn as sns

penguins = sns.load_dataset("penguins")

plotters = [sns.barplot, sns.boxplot, sns.boxenplot, sns.violinplot, 
            sns.pointplot, sns.stripplot, sns.swarmplot]

# with horizontal orientation
fig, axs = plt.subplots(ncols=3, nrows=len(plotters), figsize=(16, 3*len(plotters)), sharex=True, sharey=False)
kwargs = dict(data=penguins, x="body_mass_g", y="island", hue="island", legend=False,)

# If no hue is provided, or if the hue uses a different feature, there is no issue.
# kwargs = dict(data=penguins, x="body_mass_g", y="island", hue="sex", legend=True,)
# kwargs = dict(data=penguins, x="body_mass_g", y="island", legend=False,)

# same issue with vertical orientation
# fig, axs = plt.subplots(ncols=3, nrows=len(plotters), figsize=(16, 3*len(plotters)), sharex=False, sharey=True)
# kwargs = dict(data=penguins, x="island", y="body_mass_g", hue="island", legend=False,)

for i, plotter in enumerate(plotters):

    axs[i, 1].set_title(plotter.__name__)

    plotter(ax=axs[i, 0], **kwargs)

    cat_cols = penguins.select_dtypes('O').columns
    penguins[cat_cols] = penguins[cat_cols].astype('category')
    plotter(ax=axs[i, 1], **kwargs)

    penguins["island"] = penguins["island"].cat.add_categories(['Uninhabited Island '])
    plotter(ax=axs[i, 2], **kwargs)
    penguins["island"] = penguins["island"].cat.remove_unused_categories()


plt.tight_layout()
plt.show()

Many thanks as always for the superb library!

@mwaskom
Copy link
Owner

mwaskom commented Jul 23, 2024

I think you want to set dodge=False here.

@Gabriel-Kissin
Copy link
Author

Right - that indeed fixes it, thanks! - though perhaps the default dodge='auto' should recognise that the hue and categorical / orient variable are still the same, and therefore set dodge=False automatically?

@mwaskom
Copy link
Owner

mwaskom commented Jul 30, 2024

Yeah — determining whether dodge is needed is a surprisingly hard problem. Here's the code that's currently doing it; not sure why it isn't working with your example.

@jhncls
Copy link

jhncls commented Aug 1, 2024

The reason that _dodge_needed() doesn't work as expected seems to be pandas' .value_counts() behaving differently when one or multiple columns are counted. With one column, there is a value count for each of the categories. With multiple columns, the categories are ignored, and only non-zero counts of combinations are reported.

Using following modified dataframe for testing:

import seaborn as sns

penguins = sns.load_dataset('penguins')
penguins['island'] = penguins['island'].astype('category')
penguins['island'] = penguins['island'].cat.add_categories(['Uninhabited Island'])
penguins['hue_col'] = penguins['island']

Then penguins[['island']].value_counts() gives a series with one index:

island            
Biscoe                168
Dream                 124
Torgersen              52
Uninhabited Island      0
Name: count, dtype: int64

And penguins[['island', 'hue_col']].value_counts() gives a series with two indices, counting the pairs:

island     hue_col  
Biscoe     Biscoe       168
Dream      Dream        124
Torgersen  Torgersen     52
Name: count, dtype: int64

Changing the test in _dodge_needed() from return orient.size != paired.size to
return np.count_nonzero(orient) != np.count_nonzero(paired) would probably solve the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants