-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not propagate nulls in clip
bounds
#13714
Comments
Strong opinion that None/null bounds should NOT propagate! Given
Feels suuuuper strange and unexpected. Also just tried some example and there seems to be quite some inconsistencies between pl.DataFrame(
{
"a": [1, 2, 3, 4],
"min": None,
"max": None,
}
).with_columns(
clip_none_none=pl.col("a").clip(None, None),
clip_col_none_none=pl.col("a").clip(
pl.col("min"),
pl.col("max"),
),
clip_min_none=pl.col("a").clip(2, None),
clip_col_min_none=pl.col("a").clip(
pl.col("min").fill_null(2),
pl.col("max"),
),
clip_none_max=pl.col("a").clip(None, 3),
clip_col_none_max=pl.col("a").clip(
pl.col("min"),
pl.col("max").fill_null(3),
),
clip_min_max=pl.col("a").clip(2, 3),
clip_col_min_max=pl.col("a").clip(
pl.col("min").fill_null(2),
pl.col("max").fill_null(3),
),
)
shape: (4, 11)
┌─────┬──────┬──────┬───────────┬───────────────┬──────────┬──────────────┬──────────┬──────────────┬─────────┬─────────────┐
│ a ┆ min ┆ max ┆ none_none ┆ col_none_none ┆ min_none ┆ col_min_none ┆ none_max ┆ col_none_max ┆ min_max ┆ col_min_max │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ null ┆ null ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪══════╪══════╪═══════════╪═══════════════╪══════════╪══════════════╪══════════╪══════════════╪═════════╪═════════════╡
│ 1 ┆ null ┆ null ┆ 1 ┆ null ┆ 2 ┆ null ┆ 1 ┆ null ┆ 2 ┆ 2 │
│ 2 ┆ null ┆ null ┆ 2 ┆ null ┆ 2 ┆ null ┆ 2 ┆ null ┆ 2 ┆ 2 │
│ 3 ┆ null ┆ null ┆ 3 ┆ null ┆ 3 ┆ null ┆ 3 ┆ null ┆ 3 ┆ 3 │
│ 4 ┆ null ┆ null ┆ 4 ┆ null ┆ 4 ┆ null ┆ 3 ┆ null ┆ 3 ┆ 3 │
└─────┴──────┴──────┴───────────┴───────────────┴──────────┴──────────────┴──────────┴──────────────┴─────────┴─────────────┘ |
Not setting a bound is (currently) different from encountering |
My point is that there are inconsistencies when using a literal vs polars literal vs a column in the bounds pl.DataFrame({"x": [0], "min": None}).with_columns(
clip_col=pl.col("x").clip(lower_bound=pl.col("min")),
clip_lit=pl.col("x").clip(lower_bound=None),
clip_li2=pl.col("x").clip(lower_bound=pl.lit(None)),
)
shape: (1, 5)
┌─────┬──────┬──────────┬──────────┬───────────┐
│ x ┆ min ┆ clip_col ┆ clip_lit ┆ clip_lit2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ null ┆ i64 ┆ i64 ┆ i64 │
╞═════╪══════╪══════════╪══════════╪═══════════╡
│ 0 ┆ null ┆ null ┆ 0 ┆ null │
└─────┴──────┴──────────┴──────────┴───────────┘ The three columns should imo definitely return the same value! |
I would expect lower_bound = None means the same thing as lower_bound = -Inf and upper_bound = None means upper_bound = Inf, just from the math point of view, the bound doesn't present means the variable is unbounded. |
Can we please continue this discussion or get more people involved if necessary? I see 2 problems here:
1) Inconsistentpl.DataFrame({"x": [1, 2, 3]}).with_columns(
plexpr_none=pl.col("x").clip(lower_bound=pl.lit(None)),
python_none=pl.col("x").clip(lower_bound=None),
)
shape: (3, 3)
┌─────┬─────────────┬─────────────┐
│ x ┆ plexpr_none ┆ python_none │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════════════╪═════════════╡
│ 1 ┆ null ┆ 1 │
│ 2 ┆ null ┆ 2 │
│ 3 ┆ null ┆ 3 │
└─────┴─────────────┴─────────────┘ These two variants should imo clearly lead to the same result, but they treat the input completely differently.
This is a big problem because it is inconsistent and super unintuitive from a user perspective, and is different from any other polars function I can think of! 2) Logical
|
However, I do understand the sentiment. |
@arnabanimesh In SQL you have Our consistent principle is that |
It seems it is not the case in mysql atleast: Link If we are doing this, we can have a |
yes! 😃 df = pd.DataFrame({'x': [5, 5, 5, 5]})
s = pd.Series([2, None, 10, None])
df.assign(clipped=df.clip(lower=s, axis=0))
x clipped
0 5 5
1 5 5
2 5 10
3 5 5 |
It is in Postgres and DuckDB. It does appear to have been standardized to propagate nulls in SQL 2023, but I doubt the wisdom of that.
I would not be opposed to this, although with a better name. |
I honestly don't understand how this is consistent and how to explain this to users. If we have
I would argue the user expects Not sure if I am weird here but I seriously don't understand 😨 😆 |
In Python, we often parse We do have a specialized |
If you were to do Overall I'm a bit more tempted to not add |
thanks for the explanation, good to know 😄 👍 |
I also don't think we should propagate nulls. And I also don't think it's worth a parameter. If you want to propagate null bounds you could use pl.when(pl.col("min").is_not_null() | pl.col("max").is_not_null()).then(pl.col("value").clip("min", "max")) So we should change the default behaviour as a breaking change. This will also make |
clip
bounds propagate nulls?clip
bounds
Checks
Reproducible example
Log output
Issue description
If the bound contains a null value, it is currently treated as "bound is unknown, so the result of the operation is unknown".
You could also argue that a null value means "no bound is specified, so the original value must be retained".
I'm not sure about this one - should this be changed?
Expected behavior
Preserve original value?
Installed versions
main
The text was updated successfully, but these errors were encountered: