-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent data shortens predict_insample() date range #718
Comments
Hi @tg2k! The input dataframe is expected to be balanced: it has a complete set of observations (rows) between the first and last dates for each time series for the given frequency. Even with this workaround, the model will not be accurate, as it expects a complete input with information for all the dates. The best solution is to balance your data beforehand, completing the missing rows. You can impute the missing data with an appropriate imputation method for your task. Alternatively, you can fill them with 0s and add the column Let me know if this helps! |
Following the second very helpful recommendation from @cchallu fixed the following exception and I really appreciate their help. But I think the expection could be more clearly communicated and perhaps some kind of check and warning could be fed back to the user during training? I found it frustrating that the error was so vague and occurred well after I had fit the model. |
I am getting the same bug |
What happened + What you expected to happen
It seems that intermittent data causes issues with
predict_insample()
. If any intervals are missing all data, then the problem becomes clear: the range of dates returned will be shortened based on how densely populated the date range is.A cursory reading of https://nixtla.github.io/neuralforecast/examples/intermittentdata.html suggests that this scenario should work though various pages like https://nixtla.github.io/neuralforecast/examples/getting_started_complete.html#evaluate-the-models-performance say otherwise. It may be helpful for the former page to have a note that the data must still be contiguous overall even if some of it is sparse.
The rest applies only if this is something to address in neuralforecast.
I believe the problems start with how
self.last_dates
is set:There is no corresponding
self.first_dates
, which could possibly be used to interpolate dates on the expected frequency (self.freq
). Whenpredict_insample()
is called, it calls_insample_dates()
with alen_series
value that is then used with_cv_dates()
to produce a date range viapd.date_range()
using the end date, with a number of periods based on the size of the data set rather than on its actual range.The
_prepare_fit()
also has a comment:From the Git blame perhaps it's talking about #348 / #354 ?
One workaround is to convert from using dates in the
ds
column to using ints. I've included code to demonstrate the issue, with code blocks marked off that both trigger the issue and also work around it.Versions / Dependencies
1.6.1
Reproduction script
Issue Severity
Low: It annoys or frustrates me.
The text was updated successfully, but these errors were encountered: