Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct AveragedTimeInterval to use actuations #3721

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

liuchihl
Copy link

This is a cleanup version of the closed PR #3717

@hdrake
Copy link
Contributor

hdrake commented Aug 21, 2024

@liuchihl, thanks for cleaning up these changes by separating them from the background flux PR—it's much clearer now.

Consolidating @glwagner and @navidcy's earlier comments, it seems there are three things that need to be done before this can be merged:

  1. Review the the existing OutputWriter tests and verify that they still pass with the new implementation in this PR
  2. Create a new, more rigorous, test that is capable of flagging the bizarre behavior you found in your issue but (hopefully) now passes thanks to the changes in this branch.
  3. Add some warnings to let users know that TimeInterval and AveragedTimeInterval (and probably other diagnostic schedules) are currently broken and give incorrect results after picking up from a checkpoint whenever the checkpoint interval is not an integer multiple of the scheduled time interval.

@glwagner
Copy link
Member

glwagner commented Aug 22, 2024

Add some warnings to let users know that TimeInterval and AveragedTimeInterval (and probably other diagnostic schedules) are currently broken and give incorrect results after picking up from a checkpoint whenever the checkpoint interval is not an integer multiple of the scheduled time interval.

Can we make it so this warning gets thrown only if one is using a Checkpointer? I think the majority of simulations do not use a Checkpointer so the warning would be irrelevant in most cases. Maybe we should put the warning within the checkpointer constructor.

Checkpointing certainly needs love. I think it's only used for barebones stuff right now, not complicated simulations. To be fully featured we have to somehow have a system for checkpointing all callbacks. It's not just AveragedTimeInterval that would have a problem.

@hdrake
Copy link
Contributor

hdrake commented Aug 22, 2024

I think the majority of simulations do not use a Checkpointer so the warning would be irrelevant in most cases.

I don't get this. How are people not using a Checkpointer? Is no one else limited by HPC wall times or running long simulations? It seems like one of the most fundamental capabilities of any time-stepped numerical model.

But yes, this warning should only be issued if both OutputWriter and Checkpointers are being used and if the checkpointer interval is not an integer multiple of the OutputWriter interval.

@glwagner
Copy link
Member

glwagner commented Aug 22, 2024

I think the majority of simulations do not use a Checkpointer so the warning would be irrelevant in most cases.

I don't get this. How are people not using a Checkpointer? Is no one else limited by HPC wall times or running long simulations? It seems like one of the most fundamental capabilities of any time-stepped numerical model.

But yes, this warning should only be issued if a Checkpointer is used (and maybe also when a simulation picks up from an existing Checkpoint).

I think for sophisticated research Checkpointing is common, but for simpler classroom and LES applications the checkpointer is used less. After all, probably the most simulations are actually run in our examples on CI -- and there are no examples with a checkpointer! (It would be nice to change that)

I can't speak for others, but for boundary layer parameterization work the LES typically run in less than 24 hours of wall time. We also only utilize very simple diagnostics, like the horizontally-averaged solution at the final time step. So in those rare cases that we need a checkpointer (I have used a handful of times) barebones checkpointing is sufficient.

Of course we are currently working on building a OMIP simulation and that will require much longer runs, so we will definitely need more sophisticated checkpointing very soon.

@simone-silvestri and @tomchor might have more to add. Or @sandreza, what do you use for the neverworld work?

I'm not saying we don't want to develop this, I'm just providing some context about why this hasn't been resolved / developed yet.

In an ideal world the simulations would run fast enough that we wouldn't need checkpointing, after all 😄

@tomchor
Copy link
Collaborator

tomchor commented Aug 22, 2024

I think for sophisticated research Checkpointing is common, but for simpler classroom and LES applications the checkpointer is used less. After all, probably the most simulations are actually run in our examples on CI -- and there are no examples with a checkpointer! (It would be nice to change that)

Agreed!

I can't speak for others, but for boundary layer parameterization work the LES typically run in less than 24 hours of wall time. We also only utilize very simple diagnostics, like the horizontally-averaged solution at the final time step. So in those rare cases that we need a checkpointer (I have used a handful of times) barebones checkpointing is sufficient.

Of course we are currently working on building a OMIP simulation and that will require much longer runs, so we will definitely need more sophisticated checkpointing very soon.

@simone-silvestri and @tomchor might have more to add. Or @sandreza, what do you use for the neverworld work?

For context, 100% of my simulations have used checkpoints. As far as I know, 100% of the simulations from others in my group also use checkpoints. The only exceptions for my case are very early scripts still in the development phase, and still with very coarse grids. As soon as I try to get more serious with it, I need checkpoints. So this a PR I'm very much looking forward to seeing merged ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants