Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

id column in the result of make_forecasting_frame have only (id, ?) as identifier #1077

Closed
heib6xinyu opened this issue Jun 7, 2024 · 4 comments
Labels

Comments

@heib6xinyu
Copy link

heib6xinyu commented Jun 7, 2024

The problem: make_forecasting_frame function provide frame and y with id column in the format of (id,1), (id,2) instead of like the documentation described (For identifying every subsequence, tsfresh uses the time stamp of the point that will be predicted together with the old identifier as “id”.)

# Step 1: Create Dummy Data
np.random.seed(42)  # For reproducibility

# Create a date range
date_range = pd.date_range(start='2023-01-01', periods=100, freq='D')

# Create dummy product IDs
product_ids = ['P001', 'P002', 'P003']

# Generate dummy data
data = []
for product_id in product_ids:
    for date in date_range:
        data.append([product_id, date, np.random.uniform(10, 100)])

# Create a DataFrame
df = pd.DataFrame(data, columns=['id', 'timestamp', 'price'])

# Create the forecasting frame
df_forecasting, y = make_forecasting_frame(df['price'], kind="price", max_timeshift=5, rolling_direction=1)

# Display the first few rows of the forecasting frame
print("\nForecasting Frame:")
print(df_forecasting.head())
print("\nTarget Variable (y):")
print(y.head())

You should be able to recreate my problem using the above code

Rolling: 100%|██████████| 300/300 [00:02<00:00, 132.49it/s]

Forecasting Frame:
        id  time      value   kind
1  (id, 1)     0  43.708611  price
3  (id, 2)     0  43.708611  price
4  (id, 2)     1  95.564288  price
6  (id, 3)     0  43.708611  price
7  (id, 3)     1  95.564288  price

Target Variable (y):
(id, 1)    95.564288
(id, 2)    75.879455
(id, 3)    63.879264
(id, 4)    24.041678
(id, 5)    24.039507
Name: value, dtype: float64

As you can see, there are only (id, time), I have no way to see which product the data actually belongs to.

Anything else we need to know?: I would hope the result looks like (p001,1),...,(p003, 4), but I don't know how to.

Environment:

  • Python version: python 3.10
  • Operating System: windows 10
  • tsfresh version:0.20.1
  • Install method (conda, pip, source):pip install tsfresh
@heib6xinyu heib6xinyu added the bug label Jun 7, 2024
@heib6xinyu heib6xinyu changed the title make_forecasting_frame frame and y shape not align. Sorry this is a wrong issue Jun 7, 2024
@heib6xinyu heib6xinyu changed the title Sorry this is a wrong issue id column in the result of make_forecasting_frame Jun 7, 2024
@heib6xinyu heib6xinyu changed the title id column in the result of make_forecasting_frame id column in the result of make_forecasting_frame have only (id, ?) as identifier Jun 7, 2024
@heib6xinyu
Copy link
Author

I look through the source code, there is some hard coding part in it. I copy the function to my local computer and modify it to receive a id field, to replace the 'id' hard coding part, it works for me now.

@nils-braun
Copy link
Collaborator

Hi @heib6xinyu
yes, your observation is correct. The input to the make_forecasting_frame function is a single time series (also technically, it is a pandas series, not a dataframe), so it is not meant to be used for multiple time series (e.g. the input data can not even have a id column, because it is not a dataframe).
The sentence you quote from the docs ("For identifying every subsequence, tsfresh uses the time stamp of the point that will be predicted together with the old identifier as “id”.") is actually referring to the more general (and more powerful) roll_time_series function (sorry if this is not clear, happy for any PR to fix this!).

As you already looked into the code, you have probably seen that the make_forecasting_frame function is just forwarding to the roll_time_series function and I would also recommend using this for anything more "complex". The make_forecasting_frame function is really just a convenience function for one single use-case :)

@heib6xinyu
Copy link
Author

Hi @heib6xinyu yes, your observation is correct. The input to the make_forecasting_frame function is a single time series (also technically, it is a pandas series, not a dataframe), so it is not meant to be used for multiple time series (e.g. the input data can not even have a id column, because it is not a dataframe). The sentence you quote from the docs ("For identifying every subsequence, tsfresh uses the time stamp of the point that will be predicted together with the old identifier as “id”.") is actually referring to the more general (and more powerful) roll_time_series function (sorry if this is not clear, happy for any PR to fix this!).

As you already looked into the code, you have probably seen that the make_forecasting_frame function is just forwarding to the roll_time_series function and I would also recommend using this for anything more "complex". The make_forecasting_frame function is really just a convenience function for one single use-case :)

Yes roll time series is definitely more powerful, I just like the make forecasting frame for it automatically match the y, for some specific use cases (like my project with multiple time series and features), I just modify the make forecasting frame locally. So I can call it on a loop group by Id and feature, then concat the result. Definitely less efficient but easier for my lazy self lol.

@heib6xinyu heib6xinyu reopened this Jun 8, 2024
@heib6xinyu
Copy link
Author

Oh oops I accidentally hit the open with my tiny cell phone screen, don't mind me. Sorry for the trouble

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants