Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregate fails with time column #222

Closed
znicholls opened this issue Apr 22, 2019 · 8 comments · Fixed by #236
Closed

Aggregate fails with time column #222

znicholls opened this issue Apr 22, 2019 · 8 comments · Fixed by #236
Labels
bug datetime Compatibility with `datetime` time format question

Comments

@znicholls
Copy link
Collaborator

Describe the bug

If you make a dataframe with datetime times and call its aggregate method, you receive an error. This should work the same as a dataframe with year columns.

>>> from datetime import datetime
>>> 
>>> import pandas as pd
>>> 
>>> from pyam import IamDataFrame
>>> 
>>> eg_df = pd.DataFrame([
...     ['model_a', 'scen_a', 'World', 'Primary Energy|Gas', 'EJ/y', 1, 6.],
...     ['model_a', 'scen_a', 'World', 'Primary Energy|Coal', 'EJ/y', 0.5, 3],
...     ['model_a', 'scen_b', 'World', 'Primary Energy|Gas', 'EJ/y', 2, 7],
... ],
...     columns=['model', 'scenario', 'region', 'variable', 'unit', 2005, 2010],
... )
>>> year_df = IamDataFrame(data=eg_df)
>>> year_df.timeseries()
                                                  2005  2010
model   scenario region variable            unit            
model_a scen_a   World  Primary Energy|Coal EJ/y   0.5   3.0
                        Primary Energy|Gas  EJ/y   1.0   6.0
        scen_b   World  Primary Energy|Gas  EJ/y   2.0   7.0
>>> 
>>> year_df.aggregate('Primary Energy', append=True)
>>> year_df.timeseries()
                                                  2005  2010
model   scenario region variable            unit            
model_a scen_a   World  Primary Energy      EJ/y   1.5   9.0
                        Primary Energy|Coal EJ/y   0.5   3.0
                        Primary Energy|Gas  EJ/y   1.0   6.0
        scen_b   World  Primary Energy      EJ/y   2.0   7.0
                        Primary Energy|Gas  EJ/y   2.0   7.0
>>> 
>>> dt_df = IamDataFrame(data=eg_df.rename(
...     {2005: datetime(2005, 1, 1), 2010: datetime(2010, 6, 1)},
...     axis="columns"
... ))
>>> dt_df.timeseries()
                                                  2005-01-01  2010-06-01
model   scenario region variable            unit                        
model_a scen_a   World  Primary Energy|Coal EJ/y         0.5         3.0
                        Primary Energy|Gas  EJ/y         1.0         6.0
        scen_b   World  Primary Energy|Gas  EJ/y         2.0         7.0
>>> 
>>> dt_df.aggregate('Primary Energy', append=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/zebedeenicholls/Documents/AGCEC/Misc/pyam-zn-fork/pyam/core.py", line 700, in aggregate
    df_components = _aggregate_by_variables(self.data, components, units)
  File "/Users/zebedeenicholls/Documents/AGCEC/Misc/pyam-zn-fork/pyam/core.py", line 1319, in _aggregate_by_variables
    return df.groupby(GROUP_IDX).sum()['value']
  File "/Users/zebedeenicholls/miniconda3/envs/pyam-zn-fork/lib/python3.6/site-packages/pandas/core/generic.py", line 7632, in groupby
    observed=observed, **kwargs)
  File "/Users/zebedeenicholls/miniconda3/envs/pyam-zn-fork/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 2110, in groupby
    return klass(obj, by, **kwds)
  File "/Users/zebedeenicholls/miniconda3/envs/pyam-zn-fork/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 360, in __init__
    mutated=self.mutated)
  File "/Users/zebedeenicholls/miniconda3/envs/pyam-zn-fork/lib/python3.6/site-packages/pandas/core/groupby/grouper.py", line 578, in _get_grouper
    raise KeyError(gpr)
KeyError: 'year'

Failing Test

Change meta_df to test_df in all tests in tests/test_feature_aggregate.py.

System (please complete the following information):

  • OS: macOS
  • Python and pyam version: Python 3.6.7, 0.1.2+53.gb0e6562.dirty

Additional context

The failure occurs because the definition of GROUP_IDX is static. Hence line 1319 of pyam/core.py looks for a year column and doesn't find it, hence the failure. I think there are two possible solutions here:

  1. Like was done with LONG_IDX, move GROUP_IDX to being an attribute of IamDataFrame. It would make sense to also move YEAR_IDX, REGION_IDX and SORT_IDX at the same time in order to avoid these failures in other places. (I would suggest moving all indexes to being attributes of IamDataFrame for consistency.)
  2. Drop year support and always use time columns. Given earlier discussions I think this one is unlikely to happen, but this instance is another place where it would be a sensible solution. Just for completeness, I think dropping year support has a couple of extra arguments in its favour. It's always possible to recover year information from a datetime. It would make the meaning of the time column uniform (always being a datetime) and hence force users with other time information to put it in more well named columns (e.g. Summer evening might have to go in a season-time-of-day column instead to be more explicit, it would be annoying to try and filter e.g. time based on year only to have it explode because time is full of strings).
@znicholls znicholls added bug question datetime Compatibility with `datetime` time format labels Apr 22, 2019
@znicholls
Copy link
Collaborator Author

@gidden @danielhuppmann I'm happy to write an implementation once we know which solution (of the above or something different again) fits best with other developments.

@gidden
Copy link
Member

gidden commented Apr 22, 2019

Hi @znicholls, I am personally ok with option 2, especially with some simple helper function to switch the time column back to integer. We could further make this a configurable option in the plotting config files (how to show the time domain in plots) to completely recover existing behavior.

I agree that barring that, option 1 is also fine.

@danielhuppmann thoughts?

@danielhuppmann
Copy link
Member

I'm hesitant of option 2 because I fear a rabbit hole of unforeseen additional tweaks required to make the current use cases fully operational.

Re the static index lists, I wouldn't just migrate all of them, but review first whether they are really necessary. They were added haphazardly during the hot development phase and some of the could probably be refactored or merged away.

Re replacing meta_df by test_df, that would be a worthy goal of itself, because meta_df is now used for many tests that don't relate to the meta table.

@znicholls
Copy link
Collaborator Author

Ok shall I have a go at doing option 1 then and clean out these static index lists as much I can?

Re replacing meta_df by test_df, that would be a worthy goal of itself, because meta_df is now used for many tests that don't relate to the meta table.

In all tests or just this one test?

@danielhuppmann
Copy link
Member

Thanks @znicholls for volunteering - but let's tackle this one step at a time. First see if you can remove the bifurcation between year as int vs. time as datetime, let's discuss that in a PR, and then replace the static index lists and tests one at a time.

@znicholls
Copy link
Collaborator Author

First see if you can remove the bifurcation between year as int vs. time as datetime

What does this mean? Do you mean replace meta_df by test_df in the tests and see what happens?

@danielhuppmann
Copy link
Member

Apologies, I was mixing up options 1 and 2 (thinking you wanted to have a go at refactoring to use datetime throughout)...

Please go ahead and clean up the index lists and switch tests from meta_df to test_df step by step, and issue a PR when you think it makes sense to review.

@znicholls
Copy link
Collaborator Author

got it cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug datetime Compatibility with `datetime` time format question
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants