Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series unique on categorical results in incorrect empty result #20484

Closed
2 tasks done
daviskirk opened this issue Dec 27, 2024 · 5 comments
Closed
2 tasks done

Series unique on categorical results in incorrect empty result #20484

daviskirk opened this issue Dec 27, 2024 · 5 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@daviskirk
Copy link
Contributor

daviskirk commented Dec 27, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

This is a pretty pervasive bug in a codebase I'm working with but I can't get a minimally reproducing example from scratch.
However, if anyone has any tips as to get a deeper representation of the data (i.e. the "raw" data), I will update the issue accordingly.

I'm not getting a panic but with certain dataframes with categorical columns I'm having problems with pl.col("myCatColumn").unique() in polars 1.18.

I don't have a reproducible example yet, but the bug is something like this:

df.select(pl.col("myCatColumn").unique())

results in

shape: (0,)
Series: 'myCatColumn' [cat]
[
]

Same thing for:

df["myCatColumn"].unique()

resulting in

shape: (0,)
Series: 'myCatColumn' [cat]
[
]

but showing the full series works:

df["myCatColumn"]
shape: (3,)
Series: 'myCatColumn' [cat]
[
	"v1"
	"v1"
	"v1"
]

Log output

No response

Issue description

Interestingly, slicing with a length of exactly one results in another empty result:

df["myCatColumn"].slice(0,1).unique()
shape: (0,)
Series: 'myCatColumn' [cat]
[
]

results in

shape: (1,)
Series: 'myCatColumn' [cat]
[
	"v1"
]

and

df.select(pl.col("myCatColumn").to_physical())
shape: (14,)
Series: 'myCatColumn' [u32]
[
	0
	0
	0
]

There is no change for:

df["myCatColumn"].rechunk()
df["myCatColumn"].clone()

Finally:

df["myCatColumn"].unique(maintain_order=True)

results in the correct result:

shape: (1,)
Series: 'myCatColumn' [cat]
[
	"v1"
]

Potentially important additional info:

Expected behavior

Unique on a series / column should produce the correct result (in this case a single value "v1").

Installed versions

--------Version info---------
Polars:              1.18.0
Index type:          UInt32
Platform:            Linux-6.8.0-50-generic-x86_64-with-glibc2.39
Python:              3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.3.1
gevent               <not installed>
google.auth          2.29.0
great_tables         <not installed>
matplotlib           <not installed>
nest_asyncio         <not installed>
numpy                1.24.3
openpyxl             3.1.2
pandas               2.1.4
pyarrow              16.0.0
pydantic             1.10.13
pyiceberg            <not installed>
sqlalchemy           1.4.39
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@daviskirk daviskirk added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Dec 27, 2024
@mcrumiller
Copy link
Contributor

Any chance you could work on getting an example?

@mcrumiller
Copy link
Contributor

mcrumiller commented Dec 29, 2024

Ok, I found a minimal repro:

import polars as pl

# local: ok
s = pl.Series(["a"], dtype=pl.Categorical)
s.extend(s).unique()
# shape: (1,)
# Series: '' [cat]
# [
#         "a"
# ]

# global: fail
with pl.StringCache():
    s = pl.Series(["a"], dtype=pl.Categorical)
    s.extend(s).unique()
# shape: (0,)
# Series: '' [cat]
# [
# ]

@daviskirk
Copy link
Contributor Author

Thanks! Also interesting:

with pl.StringCache():
    print(pl.Series(["a"], dtype=pl.Categorical).extend(pl.Series(["b"], dtype=pl.Categorical)).unique())
# shape: (2,)
# Series: '' [cat]
# [
# 	"a"
# 	null
# ]
with pl.StringCache():
    print(pl.Series(["a"], dtype=pl.Categorical).extend(pl.Series(["b", "c", "d"], dtype=pl.Categorical)).unique())
# shape: (2,)
# Series: '' [cat]
# [
# 	"a"
# 	"b"
# 	"c"
# 	null
# ]

@mcrumiller
Copy link
Contributor

mcrumiller commented Dec 29, 2024

I thought the issue might be when there are multiple chunks, but that's not necessarily the case:

import polars as pl

with pl.StringCache():
    pl.Series(["a", None], dtype=pl.Categorical).unique()

# shape: (1,)
# Series: '' [cat]
# [
#         null
# ]

@coastalwhite this appears to stem from some issues in https://github.com/pola-rs/polars/blame/main/crates/polars-compute/src/unique/primitive.rs. I think some of the bit manipulation may have an off-by-one error somewhere, possible due to the treatment of the null (largest) bit, but I'm not sure. You're much more familiar with this than I am--would you mind taking a look?

Update: the kernel code looks fine actually, I believe we're just missing a + 1 in the categorical unique code. Nope, there is an issue.

@ritchie46
Copy link
Member

fixed by #20524

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants