Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add restart reproducibility test into repro CI #30

Merged
merged 1 commit into from
Mar 26, 2024

Conversation

jo-basevi
Copy link
Collaborator

@jo-basevi jo-basevi commented Mar 22, 2024

Enable restart reproducibility checksum checks - as detailed in #11

I've added the marker checksum to the restart checksum tests, added a model-specific checksum comparison method (as format of checksums may different depending on the model).

For access-om2 model, it is basically doing the same check as before, i.e any field and checksum value in the 2-day model run is in either of the first or second 1 day model runs.

I have manually tested it with a 1deg_jra55_ryf configuration where I’ve just updated the executables paths to point to the corresponding pre-release locations with the fix - config.yaml used. Currently it's failing the test_bit_repro_historical test - so the checksums (historical-3hr-checksum.json) are different to checksums saved on the release configuration branch, is that to be expected?

Also have noticed the tests are taking ages to run - as it's mostly waiting in queues - wondered whether its worth dropping the qsub walltime and memory for short tests. But that might not be the best idea as different configurations may need varying resources..

With adding the checksum marker, this pytest will be run as part of the reproducibility checks. It outputs some checksum files if it fails, I haven't added them to the checksum output sub-directory as that is rsynced to github. Should it be rsynced to github? I don't think it should be added to the testing/checksum on each configuration branch. Another todo, is maybe update the reproducibility fail messages in the CI as the test can fail in either the historical checksums or the restart test.

@aidanheerdegen
Copy link
Member

I have manually tested it with a 1deg_jra55_ryf configuration where I’ve just updated the executables paths to point to the corresponding pre-release locations with the fix - config.yaml used. Currently it's failing the test_bit_repro_historical test - so the checksums (historical-3hr-checksum.json) are different to checksums saved on the release configuration branch, is that to be expected?

No. Given that the checksum was generated automatically by the workflow @CodeGat created I'm surprised. It should be the same.

https://github.com/ACCESS-NRI/access-om2-configs/blob/release-1deg_jra55_ryf/testing/checksum/historical-3hr-checksum.json

However it does sounds like the restart reproducibility is working, is that right?

Also have noticed the tests are taking ages to run - as it's mostly waiting in queues - wondered whether its worth dropping the qsub walltime and memory for short tests. But that might not be the best idea as different configurations may need varying resources..

That is a good suggestion. I don't know if it's possible to do a way that it might be possible to accomodate different requirements.

Another possibility is to use an express queue. If the resources used are pretty modest we might not care too much, but in the long run if we do a lot of these tests it might add up. So reducing resource requests is the best approach in the first instance.

With adding the checksum marker, this pytest will be run as part of the reproducibility checks. It outputs some checksum files if it fails, I haven't added them to the checksum output sub-directory as that is rsynced to github. Should it be rsynced to github?

I don't think so. A user can always go and find the files on gadi if the details of the failure are important to investigate.

Another todo, is maybe update the reproducibility fail messages in the CI as the test can fail in either the historical checksums or the restart test.

Good idea. @CodeGat might have some ideas about that.

Copy link
Member

@aidanheerdegen aidanheerdegen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @jo-basevi!

As time is tight I think we should merge and leave improving throughput of the tests to a later date.

@access-hive-bot
Copy link

This pull request has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/access-om2-bit-repro-testing/1960/1

@jo-basevi jo-basevi deleted the 11-Add-restart-reproducibility-tests branch May 13, 2024 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants