Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: Change default offset in group_by_dynamic from 'negative every' to 'zero' #16658

Merged
merged 1 commit into from
Jun 4, 2024

Conversation

MarcoGorelli
Copy link
Collaborator

@MarcoGorelli MarcoGorelli commented Jun 1, 2024

Background

Currently, the start of the first window in group_by_dynamic is determined as follows:

  1. take the earliest datapoint
  2. truncate it by every
  3. apply offset
  4. shift the window backwards by offset until the earliest datapoint is in or in front of it
  5. empty windows are skipped

The for typical case (where every == period), steps 5 and 3 cancel each other out, and the "offset defaults to negative every" just makes the logic harder to teach and understand. I think that the "negative every" default was originally introduced because at first there was no step 4). Now that step 4) exists, then defaulting to -every is no longer necessary to ensure that the earliest datapoint is included in at least one window.

Impact

For the "base case" (where period == every), this makes no difference

The only case when changing the default of offset makes a difference is if period is greater than every, or if closed='both'. And in that case, I'd say that the difference is for the better. I'll show an example - suppose you start with

import polars as pl
from datetime import date
df = pl.DataFrame({'ts': [date(2020, 1, 1), date(2020, 1, 2), date(2020, 1, 3)], 'value': [1,2,3]})
df.group_by_dynamic('ts', every='1d', period='2d').agg('value')

You might expect that the windows will be:

  • [2020-01-01, 2020-01-03)
  • [2020-01-02, 2020-01-04)
  • [2020-01-03, 2020-01-05)

But actually, because offset defaults to "negative every", the result includes an extra

  • [2020-12-31, 2020-01-02)
    at the front, and it doesn't get skipped because period > every and so it contains the first datapoint.

Furthermore, if we look at the failing tests, then the new output matches the originally-reported expected output (when provided):

Summary

  • for most users, this PR changes nothing
  • for those affected, the change in this PR aligns things with users' original expectations

Example

Before:

>>> from datetime import date
>>> df = pl.DataFrame({
...     "ts": [date(2020, 1, 1), date(2020, 1, 2), date(2020, 1, 3)],
...     "value": [1, 2, 3],
... })
>>> df.group_by_dynamic("ts", every="1d", period="2d").agg("value")
shape: (4, 2)
┌────────────┬───────────┐
│ ts         ┆ value     │
│ ---        ┆ ---       │
│ date       ┆ list[i64] │
╞════════════╪═══════════╡
│ 2019-12-31 ┆ [1]       │
│ 2020-01-01 ┆ [1, 2]    │
│ 2020-01-02 ┆ [2, 3]    │
│ 2020-01-03 ┆ [3]       │
└────────────┴───────────┘

After:

>>> df.group_by_dynamic("ts", every="1d", period="2d").agg("value")
shape: (3, 2)
┌────────────┬───────────┐
│ ts         ┆ value     │
│ ---        ┆ ---       │
│ date       ┆ list[i64] │
╞════════════╪═══════════╡
│ 2020-01-01 ┆ [1, 2]    │
│ 2020-01-02 ┆ [2, 3]    │
│ 2020-01-03 ┆ [3]       │
└────────────┴───────────┘

@github-actions github-actions bot added breaking Change that breaks backwards compatibility enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jun 1, 2024
@MarcoGorelli MarcoGorelli force-pushed the break-group-by-dynamic branch from 6d7f7c3 to 21ad10d Compare June 1, 2024 15:59
Copy link

codecov bot commented Jun 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.52%. Comparing base (b0845e3) to head (21ad10d).

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16658      +/-   ##
==========================================
- Coverage   81.52%   81.52%   -0.01%     
==========================================
  Files        1414     1414              
  Lines      186398   186398              
  Branches     3014     3014              
==========================================
- Hits       151957   151953       -4     
- Misses      33911    33914       +3     
- Partials      530      531       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MarcoGorelli MarcoGorelli added the do not merge This pull requests should not be merged right now label Jun 1, 2024
@MarcoGorelli
Copy link
Collaborator Author

MarcoGorelli commented Jun 1, 2024

I just talked this over with an ml engineer, and it should be possible to deprecate this in such as way that it's not a sudden breaking change, and won't be too annoying to most users:

  • in the base case when someone uses group_by_dynamic and only specifies every, don't emit any warning
  • if closed='both' or closed='right' or period!=None, then emit a warning that the default of offset will change, and tell them to specify it to silence the warning

@Julian-J-S
Copy link
Contributor

yes, please! 😎

  • much more intuitive & easier to explain
  • probably almost always what people expect

@alexander-beedie
Copy link
Collaborator

I just talked this over with an ml engineer, and it should be possible to deprecate this in such as way that it's not a sudden breaking change, and won't be too annoying to most users:

  • in the base case when someone uses group_by_dynamic and only specifies every, don't emit any warning

  • if closed='both' or closed='right' or period!=None, then emit a warning that the default of offset will change, and tell them to specify it to silence the warning

Sounds good! 👌

@MarcoGorelli
Copy link
Collaborator Author

Thanks both for your feedback!

I was slightly torn on how to go about this one:

  • on the one hand, I'm all for following process and deprecation cycles
  • on the other hand, warnings that defaults are going to change are pretty annoying, and the current default does seem unexpected

Still though:

Special cases aren't special enough to break the rules.

so...will make a separate PR to start the deprecation cycle

@ritchie46
Copy link
Member

so...will make a separate PR to start the deprecation cycle

I think that if we want to do this, we just should now. Rather sooner, than later.

@MarcoGorelli MarcoGorelli marked this pull request as ready for review June 4, 2024 08:04
@MarcoGorelli MarcoGorelli added this to the 1.0.0 milestone Jun 4, 2024
@ritchie46 ritchie46 merged commit 420233b into pola-rs:main Jun 4, 2024
15 checks passed
@stinodego stinodego changed the title feat!: change default of offset in group_by_dynamic from "negative every" to "zero" feat!: Change default of offset in group_by_dynamic from "negative every" to "zero" Jun 11, 2024
@stinodego stinodego changed the title feat!: Change default of offset in group_by_dynamic from "negative every" to "zero" feat!: Change default offset in group_by_dynamic from "negative every" to "zero" Jun 12, 2024
@stinodego stinodego changed the title feat!: Change default offset in group_by_dynamic from "negative every" to "zero" feat!: Change default offset in group_by_dynamic from 'negative every' to 'zero' Jun 12, 2024
@stinodego stinodego removed the do not merge This pull requests should not be merged right now label Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Change that breaks backwards compatibility enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants