Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resample becomes non-deterministic, depending on DateTimeIndex values #28675

Closed
haeusser opened this issue Sep 30, 2019 · 6 comments · Fixed by #31809
Closed

resample becomes non-deterministic, depending on DateTimeIndex values #28675

haeusser opened this issue Sep 30, 2019 · 6 comments · Fixed by #31809
Labels
Resample resample method
Milestone

Comments

@haeusser
Copy link

haeusser commented Sep 30, 2019

Minimal Example

import datetime as dt
import numpy as np
import pandas as pd


def np_to_df(data, start_time):
    index = pd.DatetimeIndex(
        [start_time + dt.timedelta(milliseconds=t) for t in range(len(data))])
    df = pd.DataFrame(data, index=index)
    return df


# generate sample data
data = np.sin(np.arange(1000) / 30)

# create DataFrames with DateTimeIndices
df_1 = np_to_df(data, dt.datetime(2019, 9, 30, 9, 41))
df_2 = np_to_df(data, dt.datetime(2019, 9, 30, 9, 42))

# print difference before resampling
print("error_1-2:", np.mean(np.abs(df_1.values - df_2.values)))

# resample
df_1 = df_1.resample("19ms").mean()
df_2 = df_2.resample("19ms").mean()

# print difference after resampling
print("error_1-2:", np.mean(np.abs(df_1.values - df_2.values)))

Output:

error_1-2: 0.0
error_1-2: 0.04119868246404099

Problem description

When you give the exact same data to the resample function, it becomes non-deterministic if the DateTimeIndex has differing values - even though the frequency is the same.

Expected Output

The values of the two DataFrames should be exactly the same.

Output of pd.show_versions()

  • commit : None
  • python : 3.6.8.final.0
  • python-bits : 64
  • OS : Linux
  • OS-release : 4.15.0-51-generic
  • machine : x86_64
  • processor : x86_64
  • byteorder : little
  • LC_ALL : None
  • LANG : C.UTF-8
  • LOCALE : en_US.UTF-8
  • pandas : 0.25.1
  • numpy : 1.17.2
  • pytz : 2019.2
  • dateutil : 2.8.0
  • pip : 9.0.1
  • setuptools : 41.0.1
  • Cython : None
  • pytest : 4.4.0
  • hypothesis : None
  • sphinx : None
  • blosc : None
  • feather : None
  • xlsxwriter : None
  • lxml.etree : 4.3.3
  • html5lib : 0.999999999
  • pymysql : None
  • psycopg2 : 2.7.7 (dt dec pq3 ext lo64)
  • jinja2 : 2.10.1
  • IPython : 7.1.1
  • pandas_datareader: None
  • bs4 : None
  • bottleneck : None
  • fastparquet : None
  • gcsfs : None
  • lxml.etree : 4.3.3
  • matplotlib : 3.1.1
  • numexpr : None
  • odfpy : None
  • openpyxl : None
  • pandas_gbq : None
  • pyarrow : None
  • pytables : None
  • s3fs : None
  • scipy : 1.3.1
  • sqlalchemy : 1.3.7
  • tables : None
  • xarray : None
  • xlrd : None
  • xlwt : None
  • xlsxwriter : None

Happy about any help, @jreback ?

@haeusser
Copy link
Author

haeusser commented Oct 2, 2019

PUSH - this issue is causing serious problems for me. Happy about any feedback.

@jreback
Copy link
Contributor

jreback commented Oct 2, 2019

PUSH - this issue is causing serious problems for me. Happy about any feedback.

your issue has been open for 3 days

we have 3000, when / if someone has a chance they can look

@nrebena
Copy link
Contributor

nrebena commented Oct 3, 2019

If you look at df_1 and df_2 after resampling, you'll see that one start at 2019-09-30 09:40:59.984 and the other at 2019-09-30 09:41:59.986. Resampling does not start at the first value of the serie, but at the start of the day. So when resampling, you are not necessarly grouping the data as you think you would (it would work for resampling in a dividor of on minute in your specific case).

Maybe what you wanna do is binning the values every 19 samples. This would give what your expect. Or look at the base option of resampling

But resampling is pretty deterministic.

@haeusser
Copy link
Author

haeusser commented Oct 4, 2019 via email

@nrebena
Copy link
Contributor

nrebena commented Oct 4, 2019

The data are definitely binned in the same way, no matter the first value. You could consider the bin as fixed discretization of the axis, they do not depend on the values, as they should.

Depending on what you really wanna do, you could also look at pandas.cut, and define your one bin with a DatetimeIndex or something.

@jbrockmendel jbrockmendel added the Resample resample method label Oct 16, 2019
@hasB4K
Copy link
Member

hasB4K commented Apr 11, 2020

#31809 should help fix this, I have reproduced your code using the new origin argument:

import numpy as np
import pandas as pd
import datetime as dt

def np_to_df(data, start_time):
    index = pd.DatetimeIndex(
        [start_time + dt.timedelta(milliseconds=t) for t in range(len(data))])
    df = pd.DataFrame(data, index=index)
    return df

data = np.sin(np.arange(1000) / 30)

df_1 = np_to_df(data, dt.datetime(2019, 9, 30, 9, 41))
df_2 = np_to_df(data, dt.datetime(2019, 9, 30, 9, 42))

print("error_1-2:", np.mean(np.abs(df_1.values - df_2.values)))

df_1_resampled = df_1.resample("19ms", origin="start").mean()
df_2_resampled = df_2.resample("19ms", origin="start").mean()

print("error_1-2:", np.mean(np.abs(df_1_resampled.values - df_2_resampled.values)))

Output:

error_1-2: 0.0
error_1-2: 0.0

EDIT: use start option on origin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resample resample method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants