rolling_means gives negative values from polars for some reason #11146

kennyzli · 2023-09-16T00:05:33Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

I have other examples with large data set, but this is the result after I strip off most of the code and with very small data-set now. I can't tell why as rolling_mean give me negative values when we try to use that to calculate mean for our training data, especially there is heading 0s and trailing 0s.

        import polars as pl
        data_frame = pl.DataFrame(
            {
                'value': [
                    0,
                    290.57,
                    107,
                    172,
                    124.25,
                    304,
                    379.5,
                    347.35,
                    1516.41,
                    386.12,
                    226.5,
                    294.62,
                    125.5,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                ]
            }
        )
        result_df = data_frame.with_columns(
            pl.col('value').rolling_mean(window_size=8, min_periods=1).alias('test_col')
        )
        assert (result_df.filter(pl.col('test_col') < 0).shape[0] == 0)

Log output

got errors as there is neg in the list, print out the data_frame and found out below console error msg. 

result_df.filter(pl.col('test_col') < 0)
shape: (1, 2)
┌───────┬─────────────┐
│ value ┆ test_col    │
│ ---   ┆ ---         │
│ f64   ┆ f64         │
╞═══════╪═════════════╡
│ 0.0   ┆ -5.6843e-14 │
└───────┴─────────────┘

Issue description

rolling_mean method give negative values for some special data cases. especially when there is heading 0s and trailing 0s

Expected behavior

all 0 or positive values

Installed versions

Python 3.8.10 (default, Nov  4 2022, 16:37:45)
[Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import polars as pl
>>> pl.show_versions()
--------Version info---------
Polars:      0.18.4
Index type:  UInt32
Platform:    macOS-13.5-arm64-arm-64bit
Python:      3.8.10 (default, Nov  4 2022, 16:37:45)
[Clang 14.0.0 (clang-1400.0.29.202)]

----Optional dependencies----
numpy:       1.22.4
pandas:      2.0.3
pyarrow:     12.0.1
connectorx:  0.3.1
deltalake:   0.8.1
fsspec:      2022.7.1
matplotlib:  <not installed>
xlsx2csv:    0.8.1
xlsxwriter:  3.1.0

The text was updated successfully, but these errors were encountered:

reswqa · 2023-09-16T17:27:53Z

More simplify reproduce:

pl.Series( [1.51, 1, 0, 0]).rolling_sum(window_size=2, min_periods=1)

Series: '' [f64]
[
	1.51
	2.51
	1.0
	-2.2204e-16
]

This may be related to the accuracy of floating-point calculations.

Julian-J-S · 2023-09-18T09:31:15Z

interesting "bug", good catch!

First thought ofc is the classic floating point problems.
BUT 0 (zero) hat a precise presentation which makes this more interesting.

Probably some performance optimization that uses previous calculations in following ones.
This will cause floating point inaccuracies to be carried throughout all following calculations.

ser = pl.Series(
    [
        0,
        1,
        1.51,  # no precise float representation! (actual value: 1.5099999904632568359375)
        1,
        0,
        0,
    ]
).rolling_mean(window_size=2, min_periods=1)
[
	0.0
	0.5
	1.255
	1.255
	0.5
	-1.1102e-16
]

for val in ser:
    print(f'{val:.20f}')
0.00000000000000000000  # correct
0.50000000000000000000  # correct
1.25499999999999989342  # floating point error
1.25499999999999989342  # floating point error
0.49999999999999988898  # NOT correct (1 + 0) / 2 (should be precise)
-0.00000000000000011102  # NOT correct (0 + 0) / 2 (should be precise)

Is this an example where the performance improvements justify the small loss in accuracy?
Would love to know if this is an performance optimization and how much this improves (or maybe an actual bug?)

ritchie46 · 2023-09-19T08:11:25Z

I am inclined to say this may be expected from floating point arithmetic. We could however take a look if we can improve numerical stability without paying (too much) in performance.

Bidek56 · 2023-09-20T12:51:14Z

This is how floating point arithmetic works. :) This is normal.
If you don't like the trailing numbers at the end, defined the series using pl.Float32 but you will precision.

Julian-J-S · 2023-09-20T13:26:00Z

This is how floating point arithmetic works. :) This is normal. If you don't like the trailing numbers at the end, defined the series using pl.Float32 but you will precision.

This is not entirely true! ;)

(0.0 + 0.0) / 2.0 is exactly 0.0 using floats
(0.0 + 1.0) / 2.0 is exactly 0.5 using floats

I assume the inaccuracy comes from some kind of optimization.
Lets assume your window size is 100 you dont need do calculate the sum of 100 values every time and divide by 100. Instead you can subtract the first of the 100 values and add the new one on every line. But does this neccessarily lead to inaccuracies??

I am asking because pandas does not have this "problem"!

import pandas as pd

ser = pd.Series(
    data=[
        0,
        1,
        1.51,  # no precise float representation! (actual value: 1.5099999904632568359375)
        1,
        0,
        0,
    ]
)

ser.rolling(window=2).mean().apply(lambda x: f'{x:.20f}')
0                       nan
1    0.50000000000000000000
2    1.25499999999999989342
3    1.25499999999999989342
4    0.50000000000000000000
5    0.00000000000000000000
dtype: object

Bidek56 · 2023-09-20T13:49:03Z

It's probably related to float32 vs. float64. When I defined the series using pl.Float32, I get results similar to Pandas.

kennyzli · 2023-09-20T22:38:17Z

I don't think float32 matters. here is the test

>>> data_frame.with_columns( pl.col('value').cast(pl.Float32).rolling_mean(window_size=8, min_periods=1).alias('test_col'))
shape: (21, 2)
┌────────┬────────────┐
│ value  ┆ test_col   │
│ ---    ┆ ---        │
│ f64    ┆ f32        │
╞════════╪════════════╡
│ 0.0    ┆ 0.0        │
│ 290.57 ┆ 145.285004 │
│ 107.0  ┆ 132.523331 │
│ 172.0  ┆ 142.392502 │
│ …      ┆ …          │
│ 0.0    ┆ 80.827484  │
│ 0.0    ┆ 52.514984  │
│ 0.0    ┆ 15.687485  │
│ 0.0    ┆ -0.000015  │
└────────┴────────────┘

>>> data_frame.with_columns( pl.col('value').rolling_mean(window_size=8, min_periods=1).cast(pl.Float32).alias('test_col'))
shape: (21, 2)
┌────────┬─────────────┐
│ value  ┆ test_col    │
│ ---    ┆ ---         │
│ f64    ┆ f32         │
╞════════╪═════════════╡
│ 0.0    ┆ 0.0         │
│ 290.57 ┆ 145.285004  │
│ 107.0  ┆ 132.523331  │
│ 172.0  ┆ 142.392502  │
│ …      ┆ …           │
│ 0.0    ┆ 80.827499   │
│ 0.0    ┆ 52.514999   │
│ 0.0    ┆ 15.6875     │
│ 0.0    ┆ -5.6843e-14 │
└────────┴─────────────┘

Bidek56 · 2023-09-21T00:48:23Z

This code works fine for me using pl.Float32 using Polars 0.19.3 on a MacOS

ser = pl.Series("foo",
    [
        0,
        1,
        1.51,  # no precise float representation! (actual value: 1.5099999904632568359375)
        1,
        0,
        0,
    ], pl.Float32
).rolling_mean(window_size=2, min_periods=1)

I get:

shape: (6,)
Series: 'foo' [f32]
[
	0.0
	0.5
	1.255
	1.255
	0.5
	0.0
]

oilandwater · 2023-11-17T10:01:46Z

Same thing happens when I use rolling_sum
This can cause errors when you have to do some additional calculation with results of rolling_sum or rolling_mean

kennyzli added bug Something isn't working python Related to Python Polars labels Sep 16, 2023

stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024

ritchie46 mentioned this issue Feb 23, 2025

fix: Use Kahan summation for rolling sum kernels. Fix numerical stability issues #21413

Merged

ritchie46 closed this as completed in #21413 Feb 23, 2025

c-peters assigned ritchie46 Feb 24, 2025

c-peters added the accepted Ready for implementation label Feb 24, 2025

c-peters added this to Backlog Feb 24, 2025

c-peters moved this to Done in Backlog Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rolling_means gives negative values from polars for some reason #11146

rolling_means gives negative values from polars for some reason #11146

kennyzli commented Sep 16, 2023 •

edited

Loading

reswqa commented Sep 16, 2023

Julian-J-S commented Sep 18, 2023

ritchie46 commented Sep 19, 2023

Bidek56 commented Sep 20, 2023

Julian-J-S commented Sep 20, 2023

Bidek56 commented Sep 20, 2023

kennyzli commented Sep 20, 2023 •

edited

Loading

Bidek56 commented Sep 21, 2023

oilandwater commented Nov 17, 2023

rolling_means gives negative values from polars for some reason #11146

rolling_means gives negative values from polars for some reason #11146

Comments

kennyzli commented Sep 16, 2023 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

reswqa commented Sep 16, 2023

Julian-J-S commented Sep 18, 2023

ritchie46 commented Sep 19, 2023

Bidek56 commented Sep 20, 2023

Julian-J-S commented Sep 20, 2023

Bidek56 commented Sep 20, 2023

kennyzli commented Sep 20, 2023 • edited Loading

Bidek56 commented Sep 21, 2023

oilandwater commented Nov 17, 2023

kennyzli commented Sep 16, 2023 •

edited

Loading

kennyzli commented Sep 20, 2023 •

edited

Loading