-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Series unique on categorical results in incorrect empty result #20484
Comments
Any chance you could work on getting an example? |
Ok, I found a minimal repro: import polars as pl
# local: ok
s = pl.Series(["a"], dtype=pl.Categorical)
s.extend(s).unique()
# shape: (1,)
# Series: '' [cat]
# [
# "a"
# ]
# global: fail
with pl.StringCache():
s = pl.Series(["a"], dtype=pl.Categorical)
s.extend(s).unique()
# shape: (0,)
# Series: '' [cat]
# [
# ] |
Thanks! Also interesting: with pl.StringCache():
print(pl.Series(["a"], dtype=pl.Categorical).extend(pl.Series(["b"], dtype=pl.Categorical)).unique())
# shape: (2,)
# Series: '' [cat]
# [
# "a"
# null
# ] with pl.StringCache():
print(pl.Series(["a"], dtype=pl.Categorical).extend(pl.Series(["b", "c", "d"], dtype=pl.Categorical)).unique())
# shape: (2,)
# Series: '' [cat]
# [
# "a"
# "b"
# "c"
# null
# ] |
I thought the issue might be when there are multiple chunks, but that's not necessarily the case: import polars as pl
with pl.StringCache():
pl.Series(["a", None], dtype=pl.Categorical).unique()
# shape: (1,)
# Series: '' [cat]
# [
# null
# ] @coastalwhite this appears to stem from some issues in https://github.com/pola-rs/polars/blame/main/crates/polars-compute/src/unique/primitive.rs. I think some of the bit manipulation may have an off-by-one error somewhere, possible due to the treatment of the null (largest) bit, but I'm not sure. You're much more familiar with this than I am--would you mind taking a look?
|
fixed by #20524 |
Checks
Reproducible example
This is a pretty pervasive bug in a codebase I'm working with but I can't get a minimally reproducing example from scratch.
However, if anyone has any tips as to get a deeper representation of the data (i.e. the "raw" data), I will update the issue accordingly.
I'm not getting a panic but with certain dataframes with categorical columns I'm having problems with
pl.col("myCatColumn").unique()
in polars 1.18.I don't have a reproducible example yet, but the bug is something like this:
results in
Same thing for:
resulting in
but showing the full series works:
Log output
No response
Issue description
Interestingly, slicing with a length of exactly one results in another empty result:
results in
and
There is no change for:
Finally:
results in the correct result:
Potentially important additional info:
pl.enable_string_cache()
Expected behavior
Unique on a series / column should produce the correct result (in this case a single value "v1").
Installed versions
The text was updated successfully, but these errors were encountered: