Aggregate fails with time column #222

znicholls · 2019-04-22T01:38:49Z

Describe the bug

If you make a dataframe with datetime times and call its aggregate method, you receive an error. This should work the same as a dataframe with year columns.

>>> from datetime import datetime
>>> 
>>> import pandas as pd
>>> 
>>> from pyam import IamDataFrame
>>> 
>>> eg_df = pd.DataFrame([
...     ['model_a', 'scen_a', 'World', 'Primary Energy|Gas', 'EJ/y', 1, 6.],
...     ['model_a', 'scen_a', 'World', 'Primary Energy|Coal', 'EJ/y', 0.5, 3],
...     ['model_a', 'scen_b', 'World', 'Primary Energy|Gas', 'EJ/y', 2, 7],
... ],
...     columns=['model', 'scenario', 'region', 'variable', 'unit', 2005, 2010],
... )
>>> year_df = IamDataFrame(data=eg_df)
>>> year_df.timeseries()
                                                  2005  2010
model   scenario region variable            unit            
model_a scen_a   World  Primary Energy|Coal EJ/y   0.5   3.0
                        Primary Energy|Gas  EJ/y   1.0   6.0
        scen_b   World  Primary Energy|Gas  EJ/y   2.0   7.0
>>> 
>>> year_df.aggregate('Primary Energy', append=True)
>>> year_df.timeseries()
                                                  2005  2010
model   scenario region variable            unit            
model_a scen_a   World  Primary Energy      EJ/y   1.5   9.0
                        Primary Energy|Coal EJ/y   0.5   3.0
                        Primary Energy|Gas  EJ/y   1.0   6.0
        scen_b   World  Primary Energy      EJ/y   2.0   7.0
                        Primary Energy|Gas  EJ/y   2.0   7.0
>>> 
>>> dt_df = IamDataFrame(data=eg_df.rename(
...     {2005: datetime(2005, 1, 1), 2010: datetime(2010, 6, 1)},
...     axis="columns"
... ))
>>> dt_df.timeseries()
                                                  2005-01-01  2010-06-01
model   scenario region variable            unit                        
model_a scen_a   World  Primary Energy|Coal EJ/y         0.5         3.0
                        Primary Energy|Gas  EJ/y         1.0         6.0
        scen_b   World  Primary Energy|Gas  EJ/y         2.0         7.0
>>> 
>>> dt_df.aggregate('Primary Energy', append=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/zebedeenicholls/Documents/AGCEC/Misc/pyam-zn-fork/pyam/core.py", line 700, in aggregate
    df_components = _aggregate_by_variables(self.data, components, units)
  File "/Users/zebedeenicholls/Documents/AGCEC/Misc/pyam-zn-fork/pyam/core.py", line 1319, in _aggregate_by_variables
    return df.groupby(GROUP_IDX).sum()['value']
  File "/Users/zebedeenicholls/miniconda3/envs/pyam-zn-fork/lib/python3.6/site-packages/pandas/core/generic.py", line 7632, in groupby
    observed=observed, **kwargs)
  File "/Users/zebedeenicholls/miniconda3/envs/pyam-zn-fork/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 2110, in groupby
    return klass(obj, by, **kwds)
  File "/Users/zebedeenicholls/miniconda3/envs/pyam-zn-fork/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 360, in __init__
    mutated=self.mutated)
  File "/Users/zebedeenicholls/miniconda3/envs/pyam-zn-fork/lib/python3.6/site-packages/pandas/core/groupby/grouper.py", line 578, in _get_grouper
    raise KeyError(gpr)
KeyError: 'year'

Failing Test

Change meta_df to test_df in all tests in tests/test_feature_aggregate.py.

System (please complete the following information):

OS: macOS
Python and pyam version: Python 3.6.7, 0.1.2+53.gb0e6562.dirty

Additional context

The failure occurs because the definition of GROUP_IDX is static. Hence line 1319 of pyam/core.py looks for a year column and doesn't find it, hence the failure. I think there are two possible solutions here:

Like was done with LONG_IDX, move GROUP_IDX to being an attribute of IamDataFrame. It would make sense to also move YEAR_IDX, REGION_IDX and SORT_IDX at the same time in order to avoid these failures in other places. (I would suggest moving all indexes to being attributes of IamDataFrame for consistency.)
Drop year support and always use time columns. Given earlier discussions I think this one is unlikely to happen, but this instance is another place where it would be a sensible solution. Just for completeness, I think dropping year support has a couple of extra arguments in its favour. It's always possible to recover year information from a datetime. It would make the meaning of the time column uniform (always being a datetime) and hence force users with other time information to put it in more well named columns (e.g. Summer evening might have to go in a season-time-of-day column instead to be more explicit, it would be annoying to try and filter e.g. time based on year only to have it explode because time is full of strings).

The text was updated successfully, but these errors were encountered:

znicholls · 2019-04-22T01:39:40Z

@gidden @danielhuppmann I'm happy to write an implementation once we know which solution (of the above or something different again) fits best with other developments.

gidden · 2019-04-22T13:06:03Z

Hi @znicholls, I am personally ok with option 2, especially with some simple helper function to switch the time column back to integer. We could further make this a configurable option in the plotting config files (how to show the time domain in plots) to completely recover existing behavior.

I agree that barring that, option 1 is also fine.

@danielhuppmann thoughts?

danielhuppmann · 2019-04-23T08:00:35Z

I'm hesitant of option 2 because I fear a rabbit hole of unforeseen additional tweaks required to make the current use cases fully operational.

Re the static index lists, I wouldn't just migrate all of them, but review first whether they are really necessary. They were added haphazardly during the hot development phase and some of the could probably be refactored or merged away.

Re replacing meta_df by test_df, that would be a worthy goal of itself, because meta_df is now used for many tests that don't relate to the meta table.

znicholls · 2019-04-23T08:02:04Z

Ok shall I have a go at doing option 1 then and clean out these static index lists as much I can?

Re replacing meta_df by test_df, that would be a worthy goal of itself, because meta_df is now used for many tests that don't relate to the meta table.

In all tests or just this one test?

danielhuppmann · 2019-04-23T08:16:50Z

Thanks @znicholls for volunteering - but let's tackle this one step at a time. First see if you can remove the bifurcation between year as int vs. time as datetime, let's discuss that in a PR, and then replace the static index lists and tests one at a time.

znicholls · 2019-04-23T08:19:00Z

First see if you can remove the bifurcation between year as int vs. time as datetime

What does this mean? Do you mean replace meta_df by test_df in the tests and see what happens?

danielhuppmann · 2019-04-23T08:38:11Z

Apologies, I was mixing up options 1 and 2 (thinking you wanted to have a go at refactoring to use datetime throughout)...

Please go ahead and clean up the index lists and switch tests from meta_df to test_df step by step, and issue a PR when you think it makes sense to review.

znicholls · 2019-04-23T08:45:09Z

got it cheers

znicholls added bug question datetime Compatibility with `datetime` time format labels Apr 22, 2019

znicholls mentioned this issue May 14, 2019

Extend meta_df tests to include datetime #236

Merged

3 tasks

danielhuppmann closed this as completed in #236 May 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregate fails with time column #222

Aggregate fails with time column #222

znicholls commented Apr 22, 2019

znicholls commented Apr 22, 2019

gidden commented Apr 22, 2019

danielhuppmann commented Apr 23, 2019

znicholls commented Apr 23, 2019

danielhuppmann commented Apr 23, 2019

znicholls commented Apr 23, 2019

danielhuppmann commented Apr 23, 2019

znicholls commented Apr 23, 2019

Aggregate fails with time column #222

Aggregate fails with time column #222

Comments

znicholls commented Apr 22, 2019

znicholls commented Apr 22, 2019

gidden commented Apr 22, 2019

danielhuppmann commented Apr 23, 2019

znicholls commented Apr 23, 2019

danielhuppmann commented Apr 23, 2019

znicholls commented Apr 23, 2019

danielhuppmann commented Apr 23, 2019

znicholls commented Apr 23, 2019