Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up "orog" program (mtnlm7_oclsm.f) to eliminate segmentation faults on MACOS #545

Merged
merged 18 commits into from
Jul 23, 2021

Conversation

mkavulich
Copy link
Contributor

@mkavulich mkavulich commented Jul 2, 2021

Problem

When adapting the develop branch of the ufs-srweather-app to the MACOS platform, I found that the orog program would consistently exhibit segmentation faults for unknown reasons with my compiler (GNU 9.4.0 homebrew). We have seen this before for certain older fortran programs using GNU compilers when declaring variables as a size of a variable value in certain instances (e.g. #245). I was able to determine this was occurring during the variable declaration/allocation step in the "TERSUB" subroutine.

Solution

Given the impressively discombobulated state of that code, I decided it was best to reorganize the variable declaration section, change all variable-sized declared arrays to allocatable arrays, and allocate/deallocate those arrays as needed. This fixed the segmentation fault issue for MACOS, and had the added benefit of reducing the peak memory usage of this program by ~30%!

I also removed some unnecessary looping through indices for the assignment of values in matrices; I expected this would speed up the code but it ended up having no effect. Still is better coding practice so I left those changes in.

Testing

From what I can tell, there is no regression test for this program yet; I assume this is because this program is only needed for regional applications. I ran several end-to-end tests on Hera and Cheyenne (intel compilers) to ensure bit-for-bit results, and this was confirmed for a wide variety of tests. Let me know if there are any specific cases/domains/settings that reviewers would like me to test.

Fixes #550

@edwardhartnett
Copy link
Collaborator

Can we get a unit or consistency check for this?

@mkavulich
Copy link
Contributor Author

@edwardhartnett I don't see any documentation on creating consistency checks or unit tests for this repository, can you point me in the right direction?

@edwardhartnett
Copy link
Collaborator

@GeorgeGayno-NOAA can you please help with testing for this code?

@GeorgeGayno-NOAA
Copy link
Collaborator

@mkavulich I can help test. Did you open an issue for this work?

@GeorgeGayno-NOAA
Copy link
Collaborator

@edwardhartnett I don't see any documentation on creating consistency checks or unit tests for this repository, can you point me in the right direction?

The "grid_gen" consistency test creates (./reg_tests/grid_gen) several model grids. This orog code is run as one component of this test. So running the consistency tests would be a good first step. We should expect no differences, right?

@GeorgeGayno-NOAA
Copy link
Collaborator

@edwardhartnett I don't see any documentation on creating consistency checks or unit tests for this repository, can you point me in the right direction?

The "grid_gen" consistency test creates (./reg_tests/grid_gen) several model grids. This orog code is run as one component of this test. So running the consistency tests would be a good first step. We should expect no differences, right?

I ran the consistency tests on WCOSS-Dell. All passed.

@GeorgeGayno-NOAA
Copy link
Collaborator

@mkavulich That code also has a lot of unused variables (I use "-warn unused" for Intel). Feel free to remove them if you want.

! integers
allocate (JST(JM),JEN(JM),KPDS(200),KGDS(200),numi(jm))
allocate (lonsperlat(jm/2))
allocate (IST(IM,jm),IEN(IM,jm),ZSLMX(2700,1350))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IST and IEN can be deallocated after the call to MAKEOA


deallocate (GICE)

allocate (OCLSM(IM,JM),SLMI(IM,JM))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OCLSM does not appear to be deallocated.

@@ -914,6 +921,11 @@ SUBROUTINE TERSUB(IMN,JMN,IM,JM,NM,NR,NF0,NF1,NW,EFAC,BLAT,
C
C COMPUTE MOUNTAIN DATA : OA OL
C
allocate (IWORK(IM,JM,4))
allocate (OA(IM,JM,4),OL(IM,JM,4),HPRIME(IM,JM,14))
Copy link
Collaborator

@GeorgeGayno-NOAA GeorgeGayno-NOAA Jul 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these deallocated? Line 925

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the deleted response, I began replying and realized I was wrong but hit "comment" by accident.

These are used through the end of the subroutine and so deallocating them there saves no memory. I also did not bother deallocating smaller 2d and 1d arrays; I believe the fortran standard does allow this as variables are deallocated once they are out of scope (such as at the end of a subroutine). But for completeness and consistency I can change all allocatable variables to be explicitly deallocated.

@mkavulich
Copy link
Contributor Author

@GeorgeGayno-NOAA Thanks for the comments and follow-up about testing; I will run the grid_gen consistency check on Hera and get back to you, but no difference in result should be expected.

@mkavulich
Copy link
Contributor Author

@GeorgeGayno-NOAA I made some of your suggested updates to remove unused variables and deallocate all allocated variables. Consistency checks on Hera passed for the latest version (/scratch2/BMC/det/kavulich/workdir/UFS_UTILS_PR_545/UFS_UTILS/reg_tests/grid_gen). Let me know if there are any other concerns.

@JeffBeck-NOAA
Copy link
Collaborator

Hi @GeorgeGayno-NOAA, @mkavulich, this PR looks to have been sufficiently reviewed and tested, passing the consistency checks on (at least) WCOSS and Hera. Are there any other changes or tests required?

@GeorgeGayno-NOAA
Copy link
Collaborator

Hi @GeorgeGayno-NOAA, @mkavulich, this PR looks to have been sufficiently reviewed and tested, passing the consistency checks on (at least) WCOSS and Hera. Are there any other changes or tests required?

Let me run the consistency checks on Orion and Jet. We formally support those machines as well and WCOSS and Hera.

@GeorgeGayno-NOAA
Copy link
Collaborator

Consistency checks pass on all supported machines.

@GeorgeGayno-NOAA
Copy link
Collaborator

@mkavulich Your branch is out-of-date with 'develop'.

@mkavulich
Copy link
Contributor Author

@GeorgeGayno-NOAA I have updated my branch to be even with develop; the consistency checks still pass for Hera (/scratch2/BMC/det/kavulich/workdir/UFS_UTILS_PR_545/UFS_UTILS_new/reg_tests/grid_gen) I can run them on Jet and Orion as well but I do not have WCOSS access.

I notice that the automated github actions tests are failing due to a timeout in sfc_climo_gen (which I have not modified); is this a known problem?

@GeorgeGayno-NOAA
Copy link
Collaborator

@GeorgeGayno-NOAA I have updated my branch to be even with develop; the consistency checks still pass for Hera (/scratch2/BMC/det/kavulich/workdir/UFS_UTILS_PR_545/UFS_UTILS_new/reg_tests/grid_gen) I can run them on Jet and Orion as well but I do not have WCOSS access.

I notice that the automated github actions tests are failing due to a timeout in sfc_climo_gen (which I have not modified); is this a known problem?

Sometimes that test seems to hang. I will rerun it.

@GeorgeGayno-NOAA
Copy link
Collaborator

@GeorgeGayno-NOAA I have updated my branch to be even with develop; the consistency checks still pass for Hera (/scratch2/BMC/det/kavulich/workdir/UFS_UTILS_PR_545/UFS_UTILS_new/reg_tests/grid_gen) I can run them on Jet and Orion as well but I do not have WCOSS access.

I notice that the automated github actions tests are failing due to a timeout in sfc_climo_gen (which I have not modified); is this a known problem?

Yes, please run on Orion and Jet. I will run on WCOSS.

@GeorgeGayno-NOAA
Copy link
Collaborator

@mkavulich Did your merge from 'develop' go OK? Something odd seems to have happened.

@mkavulich
Copy link
Contributor Author

@GeorgeGayno-NOAA I rebased my changes rather than merging them, could that have caused some issues for you?

I'm running the checks on Orion and Jet now.

@GeorgeGayno-NOAA
Copy link
Collaborator

@GeorgeGayno-NOAA I rebased my changes rather than merging them, could that have caused some issues for you?

I'm running the checks on Orion and Jet now.

I had cloned your fork and checked out your branch. When you made your update, I did a 'git remote update' and 'git pull' on your branch. And I was prompted to do a merge and commit. It was odd. I re-cloned your fork to do my testing.

'develop' was updated again this morning, so you will need to do another merge.

@mkavulich
Copy link
Contributor Author

Okay, hopefully last needed update has been applied. Tests successful for Orion (/work/noaa/gsd-fv3-test/kavulich/UFS/workdir/UFS_UTILS_PR_545/UFS_UTILS/reg_tests/grid_gen) Jet (/mnt/lfs4/HFIP/dtc-hurr/kavulich/workdir/UFS_UTILS_PR_545/UFS_UTILS_update2/reg_tests/grid_gen) and Hera (/scratch2/BMC/det/kavulich/workdir/UFS_UTILS_PR_545/UFS_UTILS_update2/reg_tests/grid_gen).

@GeorgeGayno-NOAA
Copy link
Collaborator

WCOSS is back online. I will test there.

@GeorgeGayno-NOAA
Copy link
Collaborator

Ran the "grid_gen" consistency tests (using 8c98729) on WCOSS-Cray and WCOSS-Dell. All passed.

@GeorgeGayno-NOAA GeorgeGayno-NOAA merged commit 0f5f3cb into ufs-community:develop Jul 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"orog" program seg faults on MACOS
4 participants