Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rolling_means gives negative values from polars for some reason #11146

Closed
2 tasks done
kennyzli opened this issue Sep 16, 2023 · 9 comments · Fixed by #21413
Closed
2 tasks done

rolling_means gives negative values from polars for some reason #11146

kennyzli opened this issue Sep 16, 2023 · 9 comments · Fixed by #21413
Assignees
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@kennyzli
Copy link

kennyzli commented Sep 16, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

I have other examples with large data set, but this is the result after I strip off most of the code and with very small data-set now. I can't tell why as rolling_mean give me negative values when we try to use that to calculate mean for our training data, especially there is heading 0s and trailing 0s.

        import polars as pl
        data_frame = pl.DataFrame(
            {
                'value': [
                    0,
                    290.57,
                    107,
                    172,
                    124.25,
                    304,
                    379.5,
                    347.35,
                    1516.41,
                    386.12,
                    226.5,
                    294.62,
                    125.5,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                ]
            }
        )
        result_df = data_frame.with_columns(
            pl.col('value').rolling_mean(window_size=8, min_periods=1).alias('test_col')
        )
        assert (result_df.filter(pl.col('test_col') < 0).shape[0] == 0)

Log output

got errors as there is neg in the list, print out the data_frame and found out below console error msg. 

result_df.filter(pl.col('test_col') < 0)
shape: (1, 2)
┌───────┬─────────────┐
│ value ┆ test_col    │
│ ---   ┆ ---         │
│ f64   ┆ f64         │
╞═══════╪═════════════╡
│ 0.0   ┆ -5.6843e-14 │
└───────┴─────────────┘

Issue description

rolling_mean method give negative values for some special data cases. especially when there is heading 0s and trailing 0s

Expected behavior

all 0 or positive values

Installed versions

Python 3.8.10 (default, Nov  4 2022, 16:37:45)
[Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import polars as pl
>>> pl.show_versions()
--------Version info---------
Polars:      0.18.4
Index type:  UInt32
Platform:    macOS-13.5-arm64-arm-64bit
Python:      3.8.10 (default, Nov  4 2022, 16:37:45)
[Clang 14.0.0 (clang-1400.0.29.202)]

----Optional dependencies----
numpy:       1.22.4
pandas:      2.0.3
pyarrow:     12.0.1
connectorx:  0.3.1
deltalake:   0.8.1
fsspec:      2022.7.1
matplotlib:  <not installed>
xlsx2csv:    0.8.1
xlsxwriter:  3.1.0
@kennyzli kennyzli added bug Something isn't working python Related to Python Polars labels Sep 16, 2023
@reswqa
Copy link
Collaborator

reswqa commented Sep 16, 2023

More simplify reproduce:

pl.Series( [1.51, 1, 0, 0]).rolling_sum(window_size=2, min_periods=1)
Series: '' [f64]
[
	1.51
	2.51
	1.0
	-2.2204e-16
]

This may be related to the accuracy of floating-point calculations.

@Julian-J-S
Copy link
Contributor

interesting "bug", good catch!

First thought ofc is the classic floating point problems.
BUT 0 (zero) hat a precise presentation which makes this more interesting.

Probably some performance optimization that uses previous calculations in following ones.
This will cause floating point inaccuracies to be carried throughout all following calculations.

ser = pl.Series(
    [
        0,
        1,
        1.51,  # no precise float representation! (actual value: 1.5099999904632568359375)
        1,
        0,
        0,
    ]
).rolling_mean(window_size=2, min_periods=1)
[
	0.0
	0.5
	1.255
	1.255
	0.5
	-1.1102e-16
]

for val in ser:
    print(f'{val:.20f}')
0.00000000000000000000  # correct
0.50000000000000000000  # correct
1.25499999999999989342  # floating point error
1.25499999999999989342  # floating point error
0.49999999999999988898  # NOT correct (1 + 0) / 2 (should be precise)
-0.00000000000000011102  # NOT correct (0 + 0) / 2 (should be precise)

Is this an example where the performance improvements justify the small loss in accuracy?
Would love to know if this is an performance optimization and how much this improves (or maybe an actual bug?)

@ritchie46
Copy link
Member

I am inclined to say this may be expected from floating point arithmetic. We could however take a look if we can improve numerical stability without paying (too much) in performance.

@Bidek56
Copy link
Contributor

Bidek56 commented Sep 20, 2023

This is how floating point arithmetic works. :) This is normal.
If you don't like the trailing numbers at the end, defined the series using pl.Float32 but you will precision.

@Julian-J-S
Copy link
Contributor

This is how floating point arithmetic works. :) This is normal. If you don't like the trailing numbers at the end, defined the series using pl.Float32 but you will precision.

This is not entirely true! ;)

  • (0.0 + 0.0) / 2.0 is exactly 0.0 using floats
  • (0.0 + 1.0) / 2.0 is exactly 0.5 using floats

I assume the inaccuracy comes from some kind of optimization.
Lets assume your window size is 100 you dont need do calculate the sum of 100 values every time and divide by 100. Instead you can subtract the first of the 100 values and add the new one on every line. But does this neccessarily lead to inaccuracies??

I am asking because pandas does not have this "problem"!

import pandas as pd

ser = pd.Series(
    data=[
        0,
        1,
        1.51,  # no precise float representation! (actual value: 1.5099999904632568359375)
        1,
        0,
        0,
    ]
)

ser.rolling(window=2).mean().apply(lambda x: f'{x:.20f}')
0                       nan
1    0.50000000000000000000
2    1.25499999999999989342
3    1.25499999999999989342
4    0.50000000000000000000
5    0.00000000000000000000
dtype: object

@Bidek56
Copy link
Contributor

Bidek56 commented Sep 20, 2023

It's probably related to float32 vs. float64. When I defined the series using pl.Float32, I get results similar to Pandas.

@kennyzli
Copy link
Author

kennyzli commented Sep 20, 2023

I don't think float32 matters. here is the test

>>> data_frame.with_columns( pl.col('value').cast(pl.Float32).rolling_mean(window_size=8, min_periods=1).alias('test_col'))
shape: (21, 2)
┌────────┬────────────┐
│ value  ┆ test_col   │
│ ---    ┆ ---        │
│ f64    ┆ f32        │
╞════════╪════════════╡
│ 0.0    ┆ 0.0        │
│ 290.57 ┆ 145.285004 │
│ 107.0  ┆ 132.523331 │
│ 172.0  ┆ 142.392502 │
│ …      ┆ …          │
│ 0.0    ┆ 80.827484  │
│ 0.0    ┆ 52.514984  │
│ 0.0    ┆ 15.687485  │
│ 0.0    ┆ -0.000015  │
└────────┴────────────┘

>>> data_frame.with_columns( pl.col('value').rolling_mean(window_size=8, min_periods=1).cast(pl.Float32).alias('test_col'))
shape: (21, 2)
┌────────┬─────────────┐
│ value  ┆ test_col    │
│ ---    ┆ ---         │
│ f64    ┆ f32         │
╞════════╪═════════════╡
│ 0.0    ┆ 0.0         │
│ 290.57 ┆ 145.285004  │
│ 107.0  ┆ 132.523331  │
│ 172.0  ┆ 142.392502  │
│ …      ┆ …           │
│ 0.0    ┆ 80.827499   │
│ 0.0    ┆ 52.514999   │
│ 0.0    ┆ 15.6875     │
│ 0.0    ┆ -5.6843e-14 │
└────────┴─────────────┘

@Bidek56
Copy link
Contributor

Bidek56 commented Sep 21, 2023

This code works fine for me using pl.Float32 using Polars 0.19.3 on a MacOS

ser = pl.Series("foo",
    [
        0,
        1,
        1.51,  # no precise float representation! (actual value: 1.5099999904632568359375)
        1,
        0,
        0,
    ], pl.Float32
).rolling_mean(window_size=2, min_periods=1)

I get:

shape: (6,)
Series: 'foo' [f32]
[
	0.0
	0.5
	1.255
	1.255
	0.5
	0.0
]

@oilandwater
Copy link

Same thing happens when I use rolling_sum
This can cause errors when you have to do some additional calculation with results of rolling_sum or rolling_mean

@stinodego stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024
@c-peters c-peters added the accepted Ready for implementation label Feb 24, 2025
@c-peters c-peters added this to Backlog Feb 24, 2025
@c-peters c-peters moved this to Done in Backlog Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

8 participants