Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify time handling in multi-model statistics preprocessor #685

Merged
merged 40 commits into from
Jan 14, 2021

Conversation

Peter9192
Copy link
Contributor

@Peter9192 Peter9192 commented Jun 24, 2020

This PR simplifies the way time is handled in the multimodel statistics preprocessor. This should make it easier to understand what the code does, and hence to debug it.

These changes are inspired by some recent unexpected behavior when cubes with different calendars were passed to the function (#665). The first important change, therefore, is to fix the calendar for the output cubes to a standard format. Simultaneously, I added a function that makes all input cubes also use this same standard calendar.

Having consistent time arrays make time indexing much easier. Previously, complex mappings were needed to align the indexes representing the same time points in different cubes. Now, we can just get any time from any cube, without worrying too much about the index. Furthermore, I aligned the code between 'assemble overlap data' and 'assemble full data' to the extent that they could be merged into a single function.

Tasks

  • Create an issue: discussed here
  • This pull request has a descriptive title that can be used in a changelog
  • Add unit tests
  • If writing a new/modified preprocessor function, please update the documentation Not necessary
  • Circle/CI tests pass. Status can be seen below your pull request. If the tests are failing, click the link to find out why.
  • Codacy code quality checks pass. Status can be seen below your pull request. If there is an error, click the link to find out why. If you suspect Codacy may be wrong, please ask by commenting.

@Peter9192 Peter9192 marked this pull request as ready for review June 24, 2020 15:26
@Peter9192 Peter9192 requested a review from valeriupredoi June 24, 2020 15:26
@Peter9192
Copy link
Contributor Author

@valeriupredoi whenever you have time (I'm not in a rush)

@valeriupredoi
Copy link
Contributor

hey @Peter9192 great stuff mate! I am sorry I took a while to review, been on leave until today. A few questions:

  • you are now forcing time points with units of days since 1850-01-01... from eg new_times = reduce(np.intersect1d, time_spans) for overlap; _set_common_calendar() is taking care of aligning all cubes time points with correct units (days from 1850...) and standard calendar - this is great! But I would rename the function to something like _regrid_time_points_on_common_axis() or something like that since the func does much more than setting a common calendar;
  • the reason why I wanted to use the actual regrid_time() from _time was that that one was doing a bit more than resetting time points - setting time bounds (which the current output of MM will not have) and correcting for the aux coords too (which shouldn't be a problem here I'm guessing); if you just add a guess bounds call I think it should be fine.

Anyways two small nicks, it's a very nice optimization you did imho! 🍺

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice job, Peter 🍺

@valeriupredoi
Copy link
Contributor

one more thing - can you pls add the info in the documentation that describes the time regridding within the module?

@Peter9192
Copy link
Contributor Author

Thanks @valeriupredoi !

  • the reason why I wanted to use the actual regrid_time()

Yeah, I wanted to use that as well. But then it got messy, since regrid_time doesn't change the calendar, so then we'd be modifying the calendar here and the actual time points there (and it's just 2 lines...). At first I copied all the stuff about aux-coords as well, but then I noticed this never makes it to _put_in_cube() anyway, so I left it out again. I reinstated the guess_bounds 🍺

@valeriupredoi
Copy link
Contributor

cool! ready to go, @mattiarighi could you give it a test please mate 🍺

@valeriupredoi
Copy link
Contributor

@Peter9192 could you pls fix the conflict from the new master mate, as much as I like the new changes I would feel comfortable to merge only after @mattiarighi has tested with his magic perfmetrics standards 🍺

@mattiarighi
Copy link
Contributor

merge only after @mattiarighi has tested with his magic perfmetrics standards

Doing this right now... ⌛

@mattiarighi
Copy link
Contributor

It looks good to me. 👍

@Peter9192
Copy link
Contributor Author

Peter9192 commented Jan 7, 2021

Okay I had a first look at the monthly data tests only. These are my findings:

  • FAILED tests/sample_data/multimodel_statistics/test_multimodel.py::test_multimodel_regression_month[full]
    • Issue 1: Data are different. It is accurate to 7 digits, but not to 8. Could be similar to what's discussed above. Changing the statistic from 'mean' to 'median' as suggested by Mattia solves the issue on the data.
    • Issue 2: Coordinates are different. This is largely on purpose:
      • This PR adds bounds (as discussed here)
      • This PR sets the time to the 15th of each month instead of the first, consistent with regrid_time.
      • The new cube is missing the 'var_name' attribute in the time coord though, I added it back.
  • FAILED tests/sample_data/multimodel_statistics/test_multimodel.py::test_multimodel_regression_month[overlap]
    • Issue here is coords being different:
      • Different calendars (this PR resets all calendars to standard, old uses the first cube it encounters)
      • Different time points (this PR sets everything to the 15th, old ended up every 16th on 00 or 12 utc)
      • Different bounds (this should be addressed in Better time bounds in regrid_time preprocessor #803)
      • Lazy data (this PR has no chuncksizes attribute, old code did). This is ok (see here).

The new regression test for the daily statistics of a 365-day calendar
failed because some days had time points at 00:00 and others at 12:00.

Since we're planning to move time handling to regrid time altogether,
this behaviour is consistent with regrid_time (see issue #744).
@Peter9192
Copy link
Contributor Author

Differences in time coordinates in daily regression tests:

  • Time points: daily reference data for the regression tests have time points at 12:00, whereas this PR now resets them to 00:00, consistent with _regrid_time. I propose to keep the current behaviour and update the reference cubes. Address this as part of Move temporal 'regridding' from multi_model_statistics preprocessor to regrid_time preprocessor #744 if another decision is preferred.
  • Time attributus: daily reference data for the regression tests sometimes have legacy attributes like _ChunkSizes and time_origin. This PR doesn't reproduce that as the time coordinate is reconstructed from the ground up in all cases, instead of sometimes using one of the input cubes' time coord. I propose to keep the current behaviour and update the reference cubes.
  • Long name on time coordinate: in the reference it is sometimes "time", sometimes "Time axis". New behaviour: "time". I propose to update the reference cubes.

Then there are some differences in the actual data (only for span=full) that I still need to look into.

@Peter9192
Copy link
Contributor Author

Then there are some differences in the actual data (only for span=full) that I still need to look into.

This is due to a bug in the current implementation (on which the reference data was based). I made an issue #937. Reference data should be updated.

The reference data contains some metadata that is redundant or inconsistent.
After updating the reference data, this can be reverted.
@Peter9192
Copy link
Contributor Author

Peter9192 commented Jan 12, 2021

* Issue 1: Data are different. It is accurate to 7 digits, but not to 8. Could be similar to what's [discussed above](https://github.com/ESMValGroup/ESMValCore/pull/685#issuecomment-698253145). Changing the statistic from 'mean' to 'median' as suggested by Mattia solves the issue on the data.

I also found out where this difference comes from: in the current master branch, the data is passed over from the original cubes into a new array here:

new_arr = np.ma.empty([len(cubes)] + list(new_shape[1:]))

The incoming data is float32, whereas the new data type is float64. Consequently, the statistics are calculated on the float64 data, introducing an unrealistically high precision. Later on the dtype is converted again to float32, but at that point, the damage is already done.

I've created a new PR #941 to correct this change and to update the reference data. That way we can confirm that the current PR doesn't break the preprocessor.

@Peter9192
Copy link
Contributor Author

Peter9192 commented Jan 12, 2021

Okay, so I think I've found all the quirks that surfaced after the introduction of the new regression tests. In summary:

To do:

@stefsmeets can you check my diagnosis and approve the proposed solution(s)?

@Peter9192
Copy link
Contributor Author

Okay, so now you can see that the only remaining failures are the daily sample data tests:

FAILED tests/sample_data/multimodel_statistics/test_multimodel.py::test_multimodel_regression_day[full-365_day]
FAILED tests/sample_data/multimodel_statistics/test_multimodel.py::test_multimodel_regression_day[full-gregorian]
FAILED tests/sample_data/multimodel_statistics/test_multimodel.py::test_multimodel_regression_day[full-proleptic_gregorian]

This is because of #937. If you look at the test output, you can see that the shapes are inconsistent (2 (old) vs 62 (new) days in 2 months).

I will now revert the coordinate check. This will introduce more failures due to different time attributes etc. as discussed above.

@Peter9192
Copy link
Contributor Author

Peter9192 commented Jan 13, 2021

Indeed, now there are 8 failures. E.g. AssertionError: assert DimCoord(arra...r_name='time') == DimCoord(arra...unkSizes': 1}) due the differences explained here.

Next I've update the reference data so that:

  • The daily files are fixed
  • The time points and bounds are consistent with regrid time
  • The time coordinate attributes are cleaner and consistent

@Peter9192
Copy link
Contributor Author

Peter9192 commented Jan 13, 2021

All tests pass now. @stefsmeets and/or @nielsdrost can you do a final check?

Copy link
Contributor

@stefsmeets stefsmeets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Co-authored-by: Stef Smeets <stefsmeets@users.noreply.github.com>
@stefsmeets stefsmeets self-requested a review January 14, 2021 09:15
Copy link
Contributor

@stefsmeets stefsmeets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, nice work!

@nielsdrost nielsdrost merged commit ec6aa6f into master Jan 14, 2021
@nielsdrost nielsdrost deleted the mmstats_simplify_time branch January 14, 2021 09:33
@nielsdrost
Copy link
Member

Good work @Peter9192 and @stefsmeets and @bouweandela and @valeriupredoi !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working preprocessor Related to the preprocessor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants