-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rolling_mean returns the wrong value when there are too many rows #21358
Comments
thanks @Jack0Chan - to expedite resolution, could you make a reproducible example please? https://matthewrocklin.com/minimal-bug-reports.html |
Having downloaded your data, it seems that the change happens between:
And that's where there's a large negative number: In [107]: df[-3618:]
Out[107]:
shape: (3_618, 1)
┌───────────────┐
│ debug_data │
│ --- │
│ f64 │
╞═══════════════╡
│ -33802.276428 │
│ 0.704235 │
│ 0.701813 │
│ 0.701813 │
│ 0.701813 │
│ … │
│ 0.817196 │
│ 0.817196 │
│ 0.817196 │
│ 0.811929 │
│ 0.815339 │
└───────────────┘ You're correct that this shouldn't influence the results - the rolling mean for the last row should only depend on the last 100 elements Also, this reproduces even without |
@MarcoGorelli I recognize this error as being the same one from #21099 (I think). The issue is that the rolling sum/mean kernels operate in O(n) time by maintaining the sum of the current window, incrementing that sum by new values entering the window and decrementing by values leaving the window with each iteration. The result is that floating point errors accumulate throughout the entire process. This is usually fine, except for when we see huge changes in magnitude, as is the case here. polars/crates/polars-arrow/src/legacy/kernels/rolling/nulls/sum.rs Lines 57 to 91 in fb28c07
Line 77 is where the issue here is. This also happens in the no-nulls kernel. Also--in this particular case, where you noticed the big change in magnitude from index -3618 to -3619, after the >>> idx = 6381
>>> s[idx:idx+3]
shape: (3,)
Series: 'debug_data' [f64]
[
-34321.028632
-33802.276428
0.704235
]
>>> s.pow(3)[idx:idx+3]
Series: 'debug_data' [f64]
[
-4.0428e13
-3.8622e13
0.349262
] I'm not sure what a good fix here is. |
thanks @mcrumiller ! removing "high prio" then, given the size of the values, floating point errors seem expected? |
Someone with more experience like @orlp or @ritchie46 should verify my diagnosis, and whether it's reasonable/acceptable for the rolling implementation to differ somewhat substantially from the direct calculation in the face of floating point issues. This is the second time in a short while that people have posted issues concerning precision with the rolling calculation. If it continues to be a point of contention for users it might be worth addressing. One option may be to include a heuristic in the update function that triggers |
I have run into a similar problem where the error of the last rows in the result of rolling mean/sum goes way too big. |
thanks @p12r34c56 - indeed, pandas does return In [18]: df = pd.read_csv('debug_data.csv')
In [19]: df['debug_data'].pow(3).rolling(100, min_periods=20).mean().tail()
Out[19]:
9567 0.602615
9568 0.601556
9569 0.600425
9570 0.599189
9571 0.598021
Name: debug_data, dtype: float64 duckdb: In [25]: duckdb.sql("""
...: with cte as (from df select *, row_number() over () -1 as index)
...: from cte
...: select
...: case when (count(debug_data) over rolling_window)>=20
...: then mean(pow(debug_data, 3)) over rolling_window
...: else null
...: end as debug_data
...: window rolling_window as (order by index rows between 99 preceding and current row)
...: order by index
...: """).pl()
Out[25]:
shape: (9_572, 1)
┌────────────┐
│ debug_data │
│ --- │
│ f64 │
╞════════════╡
│ null │
│ null │
│ null │
│ null │
│ null │
│ … │
│ 0.602615 │
│ 0.601556 │
│ 0.600425 │
│ 0.599189 │
│ 0.598021 │
└────────────┘ |
Checks
Reproducible example
debug_data.csv
Log output
Issue description
rolling_mean
afterpow(3)
(Exp1), I got a value significantly different from direct calculation (Exp3), if the number of rows is to large.5_299_200
number of rows, and I got-4122.788527
instead of the correct value0.598021
. In fact, I got different wrong values for different number of rows.pow(3)
topow(2)
, polars performs as exected.Expected behavior
Exp1 should have the same result as Exp3.
Installed versions
The text was updated successfully, but these errors were encountered: