Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.interpolate() extrapolates over trailing missing data #8000

Closed
grahamjeffries opened this issue Aug 12, 2014 · 17 comments
Closed

DataFrame.interpolate() extrapolates over trailing missing data #8000

grahamjeffries opened this issue Aug 12, 2014 · 17 comments
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@grahamjeffries
Copy link
Contributor

See also the discussion at StackOverflow.

Linear interpolation on a series with missing data at the end of the array will overwrite trailing missing values with the last non-missing value. In effect, the function extrapolates rather than strictly interpolating.

Example:

import pandas as pd
import numpy as np

a = pd.Series([np.nan, 1, np.nan, 3, np.nan])
a.interpolate()

Yields (note the extrapolated 4):

0   NaN
1     1
2     2
3     3
4     4
5     4
dtype: float64

not

0   NaN
1     1
2     2
3     3
4     4
5     NaN
dtype: float64

I believe the fix is something along the lines of changing lines 1545:1546 in core/common.py from

result[firstIndex:][invalid] = np.interp(inds[invalid], inds[valid], yvalues[firstIndex:][valid])

to

result[firstIndex:][invalid] = np.interp(inds[invalid], inds[valid], yvalues[firstIndex:][valid], np.nan, np.nan)
@jreback
Copy link
Contributor

jreback commented Aug 12, 2014

pls show a complete reproducible example (e.g. copy-pastable code).

Then if you would like to do a pull-request would be great. These examples serve as the basis for a test, which should fail w/o a fix and pass after.

@TomAugspurger
Copy link
Contributor

Traveling back today. I can take a look this weekend.

I'd like to see what the behavior was before I refactored this stuff.

@jreback
Copy link
Contributor

jreback commented Sep 9, 2014

@TomAugspurger can you circle back on this?

@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 9, 2014
@TomAugspurger
Copy link
Contributor

OK, so this is the same behavior as back in 0.11 before I refactored all the interpolate stuff.

>>> pd.__version__
>>> s = pd.Series([np.nan, 1, np.nan, 3, np.nan])
>>> s
0   NaN
1     1
2   NaN
3     3
4   NaN
dtype: float64
>>> s.interpolate()
0   NaN
1     1
2     2
3     3
4     3
dtype: float64

I'll look into adding an argument to handle the NaNs before and after. The default will have to stay the same for now, I think. Possibly switch to the "correct' default of not extrapolating later on.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@Jezzamonn
Copy link

is there a work around for now?

@jluttine
Copy link

@Jezzamonn One workaround solution: http://stackoverflow.com/questions/25255496/dataframe-interpolate-extrapolates-over-trailing-missing-data/33390872#33390872

@cancan101
Copy link
Contributor

Any updates on this?

@jreback
Copy link
Contributor

jreback commented Mar 30, 2016

@cancan101 there is a closed PR (not merge) #8010 / #8013 which I believe was almost there. If you want to rebase and see where it is would be great.

@jorisvandenbossche
Copy link
Member

Given that the filling of the trailing values does not follow the specified method, but just forward fills, I think we could consider this as a bug. However, of course, still a bug that people could rely upon, so not sure whether we should just change the behaviour.

@relonger
Copy link

This is definitely a bug. All new panda users will find this behaviour as confusing and error-prone (as I just did). If there is a code that rely on this bug - that's mean there is a bug in that code also. You should fix it.
Interpolate - means interpolate, not extrapolate in any way.

@jreback
Copy link
Contributor

jreback commented Nov 12, 2017

You should fix it.

@relonger welcome to have a PR for this.

this PR actually does provide for this option: #16513

welcome to have a look at it, seems stalled.

@willweil
Copy link
Contributor

Just curious if there any updates on this issue? 'Cause as in pandas 0.20.3 this is still a puzzling question. See StackOverflow.

@jreback
Copy link
Contributor

jreback commented Feb 20, 2019

see
35812ea

might be able to close this issue

@willweil
Copy link
Contributor

willweil commented Feb 20, 2019

@jreback Thanks for the link. But I just tried one of the test examples in commit 35812ea and I didn't get the expected result as in the test:

>>> pd.__version__
 '0.20.3'
>>> s = pd.Series([nan, nan, 3, nan, nan, nan, 7, nan, nan])
>>> s
0    NaN
1    NaN
2    3.0
3    NaN
4    NaN
5    NaN
6    7.0
7    NaN
8    NaN
dtype: float64
>>> s.interpolate(method='linear', limit_area='inside')
0    NaN
1    NaN
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
7    7.0
8    7.0
dtype: float64

Any ideas? Should I try a newer version of pandas?

EDIT:
Also tried in a newer version of pandas '0.22.0' but still didn't get the expected results. The pandas document says the "limit_area" is new feature in version 0.21.0+. Any ideas?

>>> pd.__version__
'0.22.0'

@willweil
Copy link
Contributor

@jreback UPDATE: limit_area works as expected in pandas 0.23.0+, but not in 0.21.0 or 0.22.0. Maybe the pandas document has a typo as it marks limit_area as "New in version 0.21.0."?

@jreback
Copy link
Contributor

jreback commented Feb 20, 2019

yeah it looks like a typo; this change is in 0.23

would love a PR to update!

willweil added a commit to willweil/pandas that referenced this issue Feb 22, 2019
…terpolate'] which is the docstring for pandas.core.resample.Resampler.interpolate, pandas.DataFrame.interpolate, pandas.Series.interpolate, and pandas.Panel.interpolate. Reference can be found at pandas-dev#8000
willweil added a commit to willweil/pandas that referenced this issue Feb 22, 2019
…data about the limit_area keyword argument in interpolate(). The reference can be found at pandas-dev#8000 (comment).
@simonjayhawkins
Copy link
Member

yeah it looks like a typo; this change is in 0.23

would love a PR to update!

xref #25418

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

10 participants