Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rolling_mean returns the wrong value when there are too many rows #21358

Closed
2 tasks done
Jack0Chan opened this issue Feb 20, 2025 · 7 comments · Fixed by #21413
Closed
2 tasks done

rolling_mean returns the wrong value when there are too many rows #21358

Jack0Chan opened this issue Feb 20, 2025 · 7 comments · Fixed by #21413
Assignees
Labels
A-rolling accepted Ready for implementation bug Something isn't working python Related to Python Polars

Comments

@Jack0Chan
Copy link

Jack0Chan commented Feb 20, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

debug_data.csv

import polars as pl

# 10_000 rows
df = pl.read_csv('/home/guangyao/work/bug_of_polars_rolling/debug_data.csv')

# Exp1: pow3 and rolling mean
rolling_version_wrong = df.select(pl.all().pow(3).rolling_mean(window_size=100, min_samples=20))
print(rolling_version_wrong[-1].item()) # get 0.806879

# Exp2: slice [-1000:], pow3 and rolling mean
rolling_version_correct = df[-1000:].select(pl.all().pow(3).rolling_mean(window_size=100, min_samples=20))
print(rolling_version_correct[-1].item()) # get 0.598021

# Exp3: slice the last 100 rows, pow3 and mean
direct_version = df[-100:].select(pl.all().pow(3).mean())
print(direct_version.item()) # get 0.598021

# Exp4: remove pow3, all get the same value 0.840279 at the last row
rolling_version1 = df.select(pl.all().rolling_mean(window_size=100, min_samples=20, center=False))
rolling_version2 = df[-1000:].select(pl.all().rolling_mean(window_size=100, min_samples=20))
direct_version = df[-100:].select(pl.all().mean())

Log output

0.8068792806886891
0.5980209629812755
0.598020962981293

Issue description

  1. While calculating rolling_mean after pow(3) (Exp1), I got a value significantly different from direct calculation (Exp3), if the number of rows is to large.
  2. Originally, there are 5_299_200 number of rows, and I got -4122.788527 instead of the correct value 0.598021. In fact, I got different wrong values for different number of rows.
  3. If change pow(3) to pow(2), polars performs as exected.

Expected behavior

Exp1 should have the same result as Exp3.

Installed versions

--------Version info---------
Polars:              1.19.0
Index type:          UInt32
Platform:            Linux-6.2.0-26-generic-x86_64-with-glibc2.35
Python:              3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                1.35.53
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.9.0
gevent               <not installed>
google.auth          2.38.0
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             3.1.5
pandas               2.2.3
pyarrow              17.0.0
pydantic             2.9.2
pyiceberg            <not installed>
sqlalchemy           2.0.32
torch                2.6.0+cu124
xlsx2csv             <not installed>
xlsxwriter           3.2.2
@Jack0Chan Jack0Chan added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 20, 2025
@MarcoGorelli
Copy link
Collaborator

thanks @Jack0Chan - to expedite resolution, could you make a reproducible example please? https://matthewrocklin.com/minimal-bug-reports.html

@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Feb 20, 2025

Having downloaded your data, it seems that the change happens between:

  • df[-3618:].select(pl.all().pow(3).rolling_mean(window_size=100, min_samples=20)).tail(1) # incorrect
  • df[-3617:].select(pl.all().pow(3).rolling_mean(window_size=100, min_samples=20)).tail(1) # correct

And that's where there's a large negative number:

In [107]: df[-3618:]
Out[107]:
shape: (3_618, 1)
┌───────────────┐
│ debug_data    │
│ ---           │
│ f64           │
╞═══════════════╡
│ -33802.276428 │
│ 0.704235      │
│ 0.701813      │
│ 0.701813      │
│ 0.701813      │
│ …             │
│ 0.817196      │
│ 0.817196      │
│ 0.817196      │
│ 0.811929      │
│ 0.815339      │
└───────────────┘

You're correct that this shouldn't influence the results - the rolling mean for the last row should only depend on the last 100 elements

Also, this reproduces even without min_counts=20

@MarcoGorelli MarcoGorelli added P-high Priority: high A-rolling and removed needs triage Awaiting prioritization by a maintainer labels Feb 20, 2025
@mcrumiller
Copy link
Contributor

mcrumiller commented Feb 20, 2025

@MarcoGorelli I recognize this error as being the same one from #21099 (I think). The issue is that the rolling sum/mean kernels operate in O(n) time by maintaining the sum of the current window, incrementing that sum by new values entering the window and decrementing by values leaving the window with each iteration. The result is that floating point errors accumulate throughout the entire process. This is usually fine, except for when we see huge changes in magnitude, as is the case here.

unsafe fn update(&mut self, start: usize, end: usize) -> Option<T> {
// if we exceed the end, we have a completely new window
// so we recompute
let recompute_sum = if start >= self.last_end {
true
} else {
// remove elements that should leave the window
let mut recompute_sum = false;
for idx in self.last_start..start {
// SAFETY:
// we are in bounds
let valid = self.validity.get_bit_unchecked(idx);
if valid {
let leaving_value = self.slice.get_unchecked(idx);
// if the leaving value is nan we need to recompute the window
if T::is_float() && !leaving_value.is_finite() {
recompute_sum = true;
break;
}
self.sum = self.sum.map(|v| v - *leaving_value)
} else {
// null value leaving the window
self.null_count -= 1;
// self.sum is None and the leaving value is None
// if the entering value is valid, we might get a new sum.
if self.sum.is_none() {
recompute_sum = true;
break;
}
}
}
recompute_sum
};

Line 77 is where the issue here is. This also happens in the no-nulls kernel.

Also--in this particular case, where you noticed the big change in magnitude from index -3618 to -3619, after the pow(3) that change is orders of magnitude larger:

>>> idx = 6381
>>> s[idx:idx+3]
shape: (3,)
Series: 'debug_data' [f64]
[
        -34321.028632
        -33802.276428
        0.704235
]
>>> s.pow(3)[idx:idx+3]
Series: 'debug_data' [f64]
[
        -4.0428e13
        -3.8622e13
        0.349262
]

I'm not sure what a good fix here is.

@MarcoGorelli MarcoGorelli removed the P-high Priority: high label Feb 20, 2025
@MarcoGorelli
Copy link
Collaborator

thanks @mcrumiller !

removing "high prio" then, given the size of the values, floating point errors seem expected?

@mcrumiller
Copy link
Contributor

mcrumiller commented Feb 20, 2025

Someone with more experience like @orlp or @ritchie46 should verify my diagnosis, and whether it's reasonable/acceptable for the rolling implementation to differ somewhat substantially from the direct calculation in the face of floating point issues.

This is the second time in a short while that people have posted issues concerning precision with the rolling calculation. If it continues to be a point of contention for users it might be worth addressing. One option may be to include a heuristic in the update function that triggers recompute_sum = true if the leaving_value and the current sum are different enough that these issues may arise.

@p12r34c56
Copy link

I have run into a similar problem where the error of the last rows in the result of rolling mean/sum goes way too big.
For reference, pandas uses kahan summation for rolling.sum/mean aggregations, which looks to me like an easy fix with minimal performance impact though I am not sure how exactly rolling mean is implemented in polars.

@MarcoGorelli
Copy link
Collaborator

thanks @p12r34c56 - indeed, pandas does return 0.598021:

In [18]: df = pd.read_csv('debug_data.csv')

In [19]: df['debug_data'].pow(3).rolling(100, min_periods=20).mean().tail()
Out[19]:
9567    0.602615
9568    0.601556
9569    0.600425
9570    0.599189
9571    0.598021
Name: debug_data, dtype: float64

duckdb:

In [25]: duckdb.sql("""
    ...: with cte as (from df select *, row_number() over () -1 as index)
    ...: from cte
    ...: select
    ...:     case when (count(debug_data) over rolling_window)>=20
    ...:          then mean(pow(debug_data, 3)) over rolling_window
    ...:          else null
    ...:          end as debug_data
    ...: window rolling_window as (order by index rows between 99 preceding and current row)
    ...: order by index
    ...: """).pl()
Out[25]:
shape: (9_572, 1)
┌────────────┐
│ debug_data │
│ ---        │
│ f64        │
╞════════════╡
│ null       │
│ null       │
│ null       │
│ null       │
│ null       │
│ …          │
│ 0.602615   │
│ 0.601556   │
│ 0.600425   │
│ 0.599189   │
│ 0.598021   │
└────────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rolling accepted Ready for implementation bug Something isn't working python Related to Python Polars
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants