-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up "orog" program (mtnlm7_oclsm.f) to eliminate segmentation faults on MACOS #545
Clean up "orog" program (mtnlm7_oclsm.f) to eliminate segmentation faults on MACOS #545
Conversation
Can we get a unit or consistency check for this? |
@edwardhartnett I don't see any documentation on creating consistency checks or unit tests for this repository, can you point me in the right direction? |
@GeorgeGayno-NOAA can you please help with testing for this code? |
@mkavulich I can help test. Did you open an issue for this work? |
The "grid_gen" consistency test creates (./reg_tests/grid_gen) several model grids. This orog code is run as one component of this test. So running the consistency tests would be a good first step. We should expect no differences, right? |
I ran the consistency tests on WCOSS-Dell. All passed. |
@mkavulich That code also has a lot of unused variables (I use "-warn unused" for Intel). Feel free to remove them if you want. |
! integers | ||
allocate (JST(JM),JEN(JM),KPDS(200),KGDS(200),numi(jm)) | ||
allocate (lonsperlat(jm/2)) | ||
allocate (IST(IM,jm),IEN(IM,jm),ZSLMX(2700,1350)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IST and IEN can be deallocated after the call to MAKEOA
|
||
deallocate (GICE) | ||
|
||
allocate (OCLSM(IM,JM),SLMI(IM,JM)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OCLSM does not appear to be deallocated.
@@ -914,6 +921,11 @@ SUBROUTINE TERSUB(IMN,JMN,IM,JM,NM,NR,NF0,NF1,NW,EFAC,BLAT, | |||
C | |||
C COMPUTE MOUNTAIN DATA : OA OL | |||
C | |||
allocate (IWORK(IM,JM,4)) | |||
allocate (OA(IM,JM,4),OL(IM,JM,4),HPRIME(IM,JM,14)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these deallocated? Line 925
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the deleted response, I began replying and realized I was wrong but hit "comment" by accident.
These are used through the end of the subroutine and so deallocating them there saves no memory. I also did not bother deallocating smaller 2d and 1d arrays; I believe the fortran standard does allow this as variables are deallocated once they are out of scope (such as at the end of a subroutine). But for completeness and consistency I can change all allocatable variables to be explicitly deallocated.
@GeorgeGayno-NOAA Thanks for the comments and follow-up about testing; I will run the grid_gen consistency check on Hera and get back to you, but no difference in result should be expected. |
@GeorgeGayno-NOAA I made some of your suggested updates to remove unused variables and deallocate all allocated variables. Consistency checks on Hera passed for the latest version (/scratch2/BMC/det/kavulich/workdir/UFS_UTILS_PR_545/UFS_UTILS/reg_tests/grid_gen). Let me know if there are any other concerns. |
Hi @GeorgeGayno-NOAA, @mkavulich, this PR looks to have been sufficiently reviewed and tested, passing the consistency checks on (at least) WCOSS and Hera. Are there any other changes or tests required? |
Let me run the consistency checks on Orion and Jet. We formally support those machines as well and WCOSS and Hera. |
Consistency checks pass on all supported machines. |
@mkavulich Your branch is out-of-date with 'develop'. |
… I have no idea why this fixes the segfaults, but it does
… results in a 30% reduction in memory footprint!
…se assignment of one array to another out of a loop for faster execution
…usly where ZAVG should have been initialized from glob.
2e82e0c
to
daf0744
Compare
@GeorgeGayno-NOAA I have updated my branch to be even with develop; the consistency checks still pass for Hera (/scratch2/BMC/det/kavulich/workdir/UFS_UTILS_PR_545/UFS_UTILS_new/reg_tests/grid_gen) I can run them on Jet and Orion as well but I do not have WCOSS access. I notice that the automated github actions tests are failing due to a timeout in sfc_climo_gen (which I have not modified); is this a known problem? |
Sometimes that test seems to hang. I will rerun it. |
Yes, please run on Orion and Jet. I will run on WCOSS. |
@mkavulich Did your merge from 'develop' go OK? Something odd seems to have happened. |
@GeorgeGayno-NOAA I rebased my changes rather than merging them, could that have caused some issues for you? I'm running the checks on Orion and Jet now. |
I had cloned your fork and checked out your branch. When you made your update, I did a 'git remote update' and 'git pull' on your branch. And I was prompted to do a merge and commit. It was odd. I re-cloned your fork to do my testing. 'develop' was updated again this morning, so you will need to do another merge. |
Okay, hopefully last needed update has been applied. Tests successful for Orion (/work/noaa/gsd-fv3-test/kavulich/UFS/workdir/UFS_UTILS_PR_545/UFS_UTILS/reg_tests/grid_gen) Jet (/mnt/lfs4/HFIP/dtc-hurr/kavulich/workdir/UFS_UTILS_PR_545/UFS_UTILS_update2/reg_tests/grid_gen) and Hera (/scratch2/BMC/det/kavulich/workdir/UFS_UTILS_PR_545/UFS_UTILS_update2/reg_tests/grid_gen). |
WCOSS is back online. I will test there. |
Ran the "grid_gen" consistency tests (using 8c98729) on WCOSS-Cray and WCOSS-Dell. All passed. |
Problem
When adapting the develop branch of the ufs-srweather-app to the MACOS platform, I found that the orog program would consistently exhibit segmentation faults for unknown reasons with my compiler (GNU 9.4.0 homebrew). We have seen this before for certain older fortran programs using GNU compilers when declaring variables as a size of a variable value in certain instances (e.g. #245). I was able to determine this was occurring during the variable declaration/allocation step in the "TERSUB" subroutine.
Solution
Given the impressively discombobulated state of that code, I decided it was best to reorganize the variable declaration section, change all variable-sized declared arrays to allocatable arrays, and allocate/deallocate those arrays as needed. This fixed the segmentation fault issue for MACOS, and had the added benefit of reducing the peak memory usage of this program by ~30%!
I also removed some unnecessary looping through indices for the assignment of values in matrices; I expected this would speed up the code but it ended up having no effect. Still is better coding practice so I left those changes in.
Testing
From what I can tell, there is no regression test for this program yet; I assume this is because this program is only needed for regional applications. I ran several end-to-end tests on Hera and Cheyenne (intel compilers) to ensure bit-for-bit results, and this was confirmed for a wide variety of tests. Let me know if there are any specific cases/domains/settings that reviewers would like me to test.
Fixes #550