Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not propagate nulls in clip bounds #13714

Closed
2 tasks done
stinodego opened this issue Jan 14, 2024 · 15 comments · Fixed by #14413
Closed
2 tasks done

Do not propagate nulls in clip bounds #13714

stinodego opened this issue Jan 14, 2024 · 15 comments · Fixed by #14413
Assignees
Labels
accepted Ready for implementation breaking Change that breaks backwards compatibility bug Something isn't working P-medium Priority: medium python Related to Python Polars
Milestone

Comments

@stinodego
Copy link
Member

stinodego commented Jan 14, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame({"a": [0, 1, 2], "min": [1, None, 1]})
result = df.select(pl.col("a").clip(pl.col("min")))
print(result)

Log output

shape: (3, 1)
┌──────┐
│ a    │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ null │
│ 2    │
└──────┘

Issue description

If the bound contains a null value, it is currently treated as "bound is unknown, so the result of the operation is unknown".

You could also argue that a null value means "no bound is specified, so the original value must be retained".

I'm not sure about this one - should this be changed?

Expected behavior

Preserve original value?

shape: (3, 1)
┌──────┐
│ a    │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 1    │
│ 2    │
└──────┘

Installed versions

main

@stinodego stinodego added bug Something isn't working python Related to Python Polars needs triage Awaiting prioritization by a maintainer labels Jan 14, 2024
@Julian-J-S
Copy link
Contributor

Strong opinion that None/null bounds should NOT propagate!

Given [1, 2, 3, 4]

  • clip(lower_bound=2, upper_bound=None) -> [2, 2, 3, 4] fixed lower & don't care about upper bound
  • clip(lower_bound=None, upper_bound=3) -> [1, 2, 3, 3] fixed upper & don't care about lower bound
  • clip(lower_bound=None, upper_bound=None) -> [null, null, null, null] I expect don't care about bounds but "drops" all values?

Feels suuuuper strange and unexpected.

Also just tried some example and there seems to be quite some inconsistencies between clip(min, max) and clip(pl.col('min'), pl.col('max')). They should probably lead to the same results?:

pl.DataFrame(
    {
        "a": [1, 2, 3, 4],
        "min": None,
        "max": None,
    }
).with_columns(
    clip_none_none=pl.col("a").clip(None, None),
    clip_col_none_none=pl.col("a").clip(
        pl.col("min"),
        pl.col("max"),
    ),

    clip_min_none=pl.col("a").clip(2, None),
    clip_col_min_none=pl.col("a").clip(
        pl.col("min").fill_null(2),
        pl.col("max"),
    ),

    clip_none_max=pl.col("a").clip(None, 3),
    clip_col_none_max=pl.col("a").clip(
        pl.col("min"),
        pl.col("max").fill_null(3),
    ),

    clip_min_max=pl.col("a").clip(2, 3),
    clip_col_min_max=pl.col("a").clip(
        pl.col("min").fill_null(2),
        pl.col("max").fill_null(3),
    ),
)

shape: (4, 11)
┌─────┬──────┬──────┬───────────┬───────────────┬──────────┬──────────────┬──────────┬──────────────┬─────────┬─────────────┐
│ aminmaxnone_nonecol_none_nonemin_nonecol_min_nonenone_maxcol_none_maxmin_maxcol_min_max │
│ ---------------------------------         │
│ i64nullnulli64i64i64i64i64i64i64i64         │
╞═════╪══════╪══════╪═══════════╪═══════════════╪══════════╪══════════════╪══════════╪══════════════╪═════════╪═════════════╡
│ 1nullnull1null2null1null22           │
│ 2nullnull2null2null2null22           │
│ 3nullnull3null3null3null33           │
│ 4nullnull4null4null3null33           │
└─────┴──────┴──────┴───────────┴───────────────┴──────────┴──────────────┴──────────┴──────────────┴─────────┴─────────────┘

@stinodego
Copy link
Member Author

stinodego commented Jan 14, 2024

Also just tried some example and there seems to be quite some inconsistencies between clip(min, max) and clip(pl.col('min'), pl.col('max')). They should probably lead to the same results?:

Not setting a bound is (currently) different from encountering null in a bound column. clip(None, None) is a no-op and will give you back the original column.

@Julian-J-S
Copy link
Contributor

Julian-J-S commented Jan 14, 2024

My point is that there are inconsistencies when using a literal vs polars literal vs a column in the bounds

pl.DataFrame({"x": [0], "min": None}).with_columns(
    clip_col=pl.col("x").clip(lower_bound=pl.col("min")),
    clip_lit=pl.col("x").clip(lower_bound=None),
    clip_li2=pl.col("x").clip(lower_bound=pl.lit(None)),
)

shape: (1, 5)
┌─────┬──────┬──────────┬──────────┬───────────┐
│ xminclip_colclip_litclip_lit2 │
│ ---------------       │
│ i64nulli64i64i64       │
╞═════╪══════╪══════════╪══════════╪═══════════╡
│ 0nullnull0null      │
└─────┴──────┴──────────┴──────────┴───────────┘

The three columns should imo definitely return the same value!
None, pl.lit(None) and pl.col(x) which evaluates to None should lead to the same output imo.
All are using an "empty" lower bound

@Sage0614
Copy link

I would expect lower_bound = None means the same thing as lower_bound = -Inf and upper_bound = None means upper_bound = Inf, just from the math point of view, the bound doesn't present means the variable is unbounded.

@Julian-J-S
Copy link
Contributor

Can we please continue this discussion or get more people involved if necessary? I see 2 problems here:

  1. Inconsistent
  2. Logical

1) Inconsistent

pl.DataFrame({"x": [1, 2, 3]}).with_columns(
    plexpr_none=pl.col("x").clip(lower_bound=pl.lit(None)),
    python_none=pl.col("x").clip(lower_bound=None),
)
shape: (3, 3)
┌─────┬─────────────┬─────────────┐
│ xplexpr_nonepython_none │
│ ---------         │
│ i64i64i64         │
╞═════╪═════════════╪═════════════╡
│ 1null1           │
│ 2null2           │
│ 3null3           │
└─────┴─────────────┴─────────────┘

These two variants should imo clearly lead to the same result, but they treat the input completely differently.

  • expressions: None lower/upper bound -> result = null
  • python vars: None lower/upper bound -> result = no bound

This is a big problem because it is inconsistent and super unintuitive from a user perspective, and is different from any other polars function I can think of!

2) Logical

  • non-existent (null/none) bound should mean "open" / infinite. Why should a null bound "delete" the values?
  • there is no good solution if nulls propagate but you do not want that
    .clip(lower_bound=pl.col('min').fill_null(-999999)) which number to choose? python numbers have no bounds.

@arnabanimesh
Copy link
Contributor

min(null,1) should not be 1.

clip bounds should propagate nulls, otherwise it is performing actions which other users may not have asked for. clip should only deal with integers and floats. You can use other functions to handle nulls before or after clipping.

However, I do understand the sentiment. clip can have a strict flag which throws an error if there are nulls or nans in the column prompting the user to handle the situation.

@orlp
Copy link
Collaborator

orlp commented Jan 18, 2024

min(null,1) should not be 1.

@arnabanimesh In SQL you have LEAST and GREATEST functions, and they do in fact behave as LEAST(null, 1) == 1.


Our consistent principle is that null indicates that the data is simply missing. It should be treated as if it doesn't exist. I would thus argue that a null argument to clip indicates no bound. Thus I would say, clip bounds should not propagate nulls, only the main argument should.

@orlp orlp added breaking Change that breaks backwards compatibility P-medium Priority: medium and removed needs triage Awaiting prioritization by a maintainer labels Jan 18, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jan 18, 2024
@arnabanimesh
Copy link
Contributor

@arnabanimesh In SQL you have LEAST and GREATEST functions, and they do in fact behave as LEAST(null, 1) == 1.

It seems it is not the case in mysql atleast: Link

If we are doing this, we can have a clip_nulls flag in the clip function and have it default to False so that it is not a breaking change.

@Julian-J-S
Copy link
Contributor

Our consistent principle is that null indicates that the data is simply missing. It should be treated as if it doesn't exist. I would thus argue that a null argument to clip indicates no bound. Thus I would say, clip bounds should not propagate nulls, only the main argument should.

yes! 😃
Pandas also does it like this https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.clip.html
image

df = pd.DataFrame({'x': [5, 5, 5, 5]})
s = pd.Series([2, None, 10, None])
df.assign(clipped=df.clip(lower=s, axis=0))
   x  clipped
0  5        5
1  5        5
2  5       10
3  5        5

@orlp
Copy link
Collaborator

orlp commented Jan 18, 2024

It seems it is not the case in mysql atleast: Link

It is in Postgres and DuckDB. It does appear to have been standardized to propagate nulls in SQL 2023, but I doubt the wisdom of that.

If we are doing this, we can have a clip_nulls flag in the clip function and have it default to False so that it is not a breaking change.

I would not be opposed to this, although with a better name.

@Julian-J-S
Copy link
Contributor

If we are doing this, we can have a clip_nulls flag in the clip function and have it default to False so that it is not a breaking change.

I would not be opposed to this, although with a better name

I honestly don't understand how this is consistent and how to explain this to users.

If we have .clip(lower_bound=...)

  • None -> ignore the bound
  • pl.lit(None) or pl.col("min") -> propagate null

I would argue the user expects None & pl.lit(None) & columns with null to behave the same like in (all?!) other polars functions. If I pass in the value 4 somewhere I expect the same if I enter pl.lit(4) or pl.col("filled_with_4")

Not sure if I am weird here but I seriously don't understand 😨 😆

@stinodego
Copy link
Member Author

stinodego commented Jan 18, 2024

Not sure if I am weird here but I seriously don't understand

In Python, we often parse None input as 'input was not specified'. It is definitely not always the same as lit(None) which is an expression of data type Null.

We do have a specialized no_default input type which is used when None is a valid input value. Maybe it should be used here.

@orlp
Copy link
Collaborator

orlp commented Jan 18, 2024

In Python, we often parse None input as 'input was not specified'. It is definitely not always the same as lit(None) which is an expression of data type Null.

If you were to do clip(None, upper, propagate_nulls=True), then the output would be full-null. I do think indeed as @stinodego mentions we need the no_default argument here if we were to add propagate_nulls.

Overall I'm a bit more tempted to not add propagate_nulls than to add the no_default interface. Users can always use pl.when if necessary.

@Julian-J-S
Copy link
Contributor

In Python, we often parse None input as 'input was not specified'. It is definitely not always the same as lit(None) which is an expression of data type Null.

We do have a specialized no_default input type which is used when None is a valid input value. Maybe it should be used here.

thanks for the explanation, good to know 😄 👍

@stinodego
Copy link
Member Author

I also don't think we should propagate nulls. And I also don't think it's worth a parameter. If you want to propagate null bounds you could use when like:

pl.when(pl.col("min").is_not_null() | pl.col("max").is_not_null()).then(pl.col("value").clip("min", "max"))

So we should change the default behaviour as a breaking change. This will also make None input behave the same as pl.lit(None) so we don't have to change the input parsing.

@stinodego stinodego added this to the 1.0.0 milestone Jan 18, 2024
@stinodego stinodego changed the title Should clip bounds propagate nulls? Do not propagate nulls in clip bounds May 23, 2024
@stinodego stinodego self-assigned this May 23, 2024
@stinodego stinodego moved this from Ready to Blocked in Backlog May 26, 2024
@github-project-automation github-project-automation bot moved this from Blocked to Done in Backlog Jun 4, 2024
@c-peters c-peters added the accepted Ready for implementation label Jun 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation breaking Change that breaks backwards compatibility bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants