Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building hpc-stack-nco branch in ubuntu 20.04 container with gcc 9 and mpich 3.3.2 #71

Closed
MinsukJi-NOAA opened this issue Nov 17, 2020 · 31 comments · Fixed by #104
Closed
Assignees
Labels
bug Something isn't working

Comments

@MinsukJi-NOAA
Copy link
Contributor

Describe the bug
Using build_stack.sh fails to install hpc-stack-nco in ubuntu 20.04 container

To Reproduce

  1. (terminal 1) git clone --branch feature/hpc-stack-nco https://github.com/NOAA-EMC/hpc-stack
  2. (terminal 2) docker run --rm -it --name hpc-container noaaemc/ubuntu-base:v1
  3. (terminal 1) cd hpc-stack
  4. (terminal 1) docker cp . hpc-container:/home/builder/hpc-stack
  5. (terminal 2) cd hpc-stack
  6. (terminal 2) export HPC_MPI=mpich/3.3.2 &&
    ./build_stack.sh -p /home/builder/opt -c config/config_custom.sh -y config/stack_custom.yaml

Build fails with "[[: not found" error message.

Change all lib/build_*.sh from #!/bin/sh to #!/bin/bash

Repeat above steps and build fails with:

-- Could NOT find NetCDF_Fortran (missing: NetCDF_Fortran_LIBRARY NetCDF_Fortran_INCLUDE_DIR)
CMake Error at src/flib/CMakeLists.txt:275 (message):
Must have PnetCDF and/or NetCDF Fortran libraries
-- Configuring incomplete, errors occurred!
See also "/home/builder/hpc-stack/pkg/pio-2.5.1/build/CMakeFiles/CMakeOutput.log".
BUILD FAIL! Lib: pio Error:1

System:
Ubuntu 20.04, gcc 9, mpich 3.3.2

@MinsukJi-NOAA MinsukJi-NOAA added the bug Something isn't working label Nov 17, 2020
@climbfuji
Copy link
Contributor

Maybe because ubuntu is using dash instead of bash as default shell? Just a wild guess.

@aerorahul
Copy link
Contributor

Just to be clear, this is an issue in the branchfeature/hpc-stack-nco and not with develop.

@MinsukJi-NOAA
Copy link
Contributor Author

MinsukJi-NOAA commented Nov 17, 2020 via email

@climbfuji
Copy link
Contributor

I usually relink /usr/bin/sh to /usr/bin/bash - can you try?

@MinsukJi-NOAA
Copy link
Contributor Author

That should work. But the second issue still remains: missing: NetCDF_Fortran_LIBRARY NetCDF_Fortran_INCLUDE_DIR

@aerorahul
Copy link
Contributor

Why is this not raised in the CI which uses ubuntu 20.04?

@climbfuji
Copy link
Contributor

Why is this not raised in the CI which uses ubuntu 20.04?

It's possible that the CI version does this relinking, too - specifically to address the dash shortcomings. I remember having seen something like this in the past.

@aerorahul
Copy link
Contributor

No such special treatment in the hpc-stack workflow.
https://github.com/NOAA-EMC/hpc-stack/blob/develop/.github/workflows/build_ubuntu.yaml

@aerorahul
Copy link
Contributor

I am concerned that a branch that did not pass CI or any kind of review went into the UFS PR.
I am more concerned that the UFS will be susceptible to issues until this is resolved and the stack will be the one to be blamed.

@MinsukJi-NOAA
Copy link
Contributor Author

MinsukJi-NOAA commented Nov 17, 2020

Current CI is using a container that does not use hpc-stack.
I tested a container with hpc-stack develop branch, and it worked fine.
The hpc-stack-nco branch, which takes care of the UPP, does not build, however.

@aerorahul
Copy link
Contributor

Exactly my point. hpc-stack-nco branch was used in the UFS PR that just went in.

@MinsukJi-NOAA
Copy link
Contributor Author

MinsukJi-NOAA commented Nov 17, 2020

If hpc-stack-nco was built on NOAA machines successfully, that's fine, as far as ufs-weather-model is concerned, I think. I am just encountering an issue trying to build hpc-stack-nco branch in a container.

@climbfuji
Copy link
Contributor

I am not sure if any of this is helpful to solve the issues we have right now. I'll try something different.

Minsuk, these variables are usually set by FindNetCDF.cmake - do you have the ominous export NCO_V=false in the stack config file (e.g. config/config_hera.sh) - can you compare your config file with one of the existing ones where the stack built successfully?

@MinsukJi-NOAA
Copy link
Contributor Author

diff hpc-stack/config/config_custom.sh hpc-nco/config/config_custom.sh 7a8 > export NCO_V=false

@MinsukJi-NOAA
Copy link
Contributor Author

That didn't do it. Same error without export NCO_V=false

@edwardhartnett
Copy link
Contributor

This must be resolved before #72

@MinsukJi-NOAA how do we reproduce this error on a regular ubuntu workstation?

@MinsukJi-NOAA
Copy link
Contributor Author

@edwardhartnett, the 'steps to reproduce' above assumes you have docker installed on your ubuntu workstation.

@edwardhartnett
Copy link
Contributor

Can we do a 1.1 release of hpc-stack without resolving this issue? Or must this issue be resolved before the release?

@kgerheiser would like to do a release tomorrow or the next day...

@MinsukJi-NOAA
Copy link
Contributor Author

@edwardhartnett , this issue is with feature/hpc-stack-nco branch, and not with develop branch. So, if #72 is not merging the feature/hpc-stack-nco into develop, I don't see why this needs to be resolved before 1.1 release.

@edwardhartnett
Copy link
Contributor

OK, hpc-stack-nco is indeed going to be merged. So does this need to be resolved before the imminent 1.1.0 release?

@aerorahul
Copy link
Contributor

I don't see how it will be merged in its current state.
There are multiple "features" in this branch that need to be their own PR's

We can discuss this later today.

@edwardhartnett
Copy link
Contributor

We believe the develop branch will work in the ubuntu container. Can you try this please.

@MinsukJi-NOAA
Copy link
Contributor Author

The develop branch worked in ubuntu container when I tested it two days ago. I was trying to use hpc-stack-nco branch because it has upp in it (which the weather model is using now).

If I use the develop branch, and specify upp in config files, would that work?

@kgerheiser
Copy link
Contributor

UPP has been added to the develop branch

@MinsukJi-NOAA
Copy link
Contributor Author

That's great! I will test the latest weather model in an ubuntu 20.04 container, and report the outcome.

@MinsukJi-NOAA
Copy link
Contributor Author

MinsukJi-NOAA commented Nov 20, 2020

I verified that this works by installing hpc-stack/develop in ubuntu 20.04 container and running ufs-weather-model restart unit test.

@MinsukJi-NOAA
Copy link
Contributor Author

Latest develop branch failed to build in ubuntu 20.04. Error messages are:

[ 39%] Building Fortran object sorc/ncep_post.fd/CMakeFiles/upp.dir/AVIATION.f.o
cd /home/builder/hpc-stack/pkg/upp-upp_v10.0.0/build/sorc/ncep_post.fd && /usr/bin/gfortran  -I/home/builder/opt/include_4 -I/home/builder/opt/include -I/usr/include/x86_64-linux-gnu/mpich  -g -fbacktrace -ffree-form -ffree-line-length-none -fconvert=big-endian -O3 -Jinclude   -fopenmp -c /home/builder/hpc-stack/pkg/upp-upp_v10.0.0/sorc/ncep_post.fd/AVIATION.f -o CMakeFiles/upp.dir/AVIATION.f.o
/home/builder/hpc-stack/pkg/upp-upp_v10.0.0/sorc/ncep_post.fd/ALLOCATE_ALL.f:106:15:

  106 |       allocate(QC_BL(im,jsta_2l:jend_2u,lm))
      |               1
Error: Allocate-object at (1) is neither a data pointer nor an allocatable variable
/home/builder/hpc-stack/pkg/upp-upp_v10.0.0/sorc/ncep_post.fd/ALLOCATE_ALL.f:202:15:

  202 |       allocate(hail_maxhailcast(im,jsta_2l:jend_2u))
      |               1
Error: Allocate-object at (1) is neither a data pointer nor an allocatable variable
/home/builder/hpc-stack/pkg/upp-upp_v10.0.0/sorc/ncep_post.fd/ALLOCATE_ALL.f:231:15:

  231 |       allocate(shdmin(im,jsta_2l:jend_2u))
      |               1
Error: Allocate-object at (1) is neither a data pointer nor an allocatable variable
/home/builder/hpc-stack/pkg/upp-upp_v10.0.0/sorc/ncep_post.fd/ALLOCATE_ALL.f:232:15:

  232 |       allocate(shdmax(im,jsta_2l:jend_2u))
      |               1
Error: Allocate-object at (1) is neither a data pointer nor an allocatable variable
/home/builder/hpc-stack/pkg/upp-upp_v10.0.0/sorc/ncep_post.fd/ALLOCATE_ALL.f:233:15:

  233 |       allocate(lai(im,jsta_2l:jend_2u))
      |               1
Error: Allocate-object at (1) is neither a data pointer nor an allocatable variable
/home/builder/hpc-stack/pkg/upp-upp_v10.0.0/sorc/ncep_post.fd/ALLOCATE_ALL.f:275:15:

  275 |       allocate(rainc_bucket1(im,jsta_2l:jend_2u))
      |               1
Error: Allocate-object at (1) is neither a data pointer nor an allocatable variable
/home/builder/hpc-stack/pkg/upp-upp_v10.0.0/sorc/ncep_post.fd/ALLOCATE_ALL.f:277:15:

  277 |       allocate(rainnc_bucket1(im,jsta_2l:jend_2u))
      |               1
Error: Allocate-object at (1) is neither a data pointer nor an allocatable variable
/home/builder/hpc-stack/pkg/upp-upp_v10.0.0/sorc/ncep_post.fd/ALLOCATE_ALL.f:279:15:

  279 |       allocate(pcp_bucket1(im,jsta_2l:jend_2u))
      |               1
Error: Allocate-object at (1) is neither a data pointer nor an allocatable variable
/home/builder/hpc-stack/pkg/upp-upp_v10.0.0/sorc/ncep_post.fd/ALLOCATE_ALL.f:281:15:

  281 |       allocate(snow_bucket1(im,jsta_2l:jend_2u))
      |               1
Error: Allocate-object at (1) is neither a data pointer nor an allocatable variable
/home/builder/hpc-stack/pkg/upp-upp_v10.0.0/sorc/ncep_post.fd/ALLOCATE_ALL.f:283:15:

  283 |       allocate(graup_bucket1(im,jsta_2l:jend_2u))
      |               1
Error: Allocate-object at (1) is neither a data pointer nor an allocatable variable
make[2]: *** [sorc/ncep_post.fd/CMakeFiles/upp.dir/build.make:76: sorc/ncep_post.fd/CMakeFiles/upp.dir/ALLOCATE_ALL.f.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: Leaving directory '/home/builder/hpc-stack/pkg/upp-upp_v10.0.0/build'
make[1]: *** [CMakeFiles/Makefile2:133: sorc/ncep_post.fd/CMakeFiles/upp.dir/all] Error 2
make[1]: Leaving directory '/home/builder/hpc-stack/pkg/upp-upp_v10.0.0/build'
make: *** [Makefile:130: all] Error 2
BUILD FAIL!  NCEPlib: upp Error:2

I did not write down which version passed my test on 11/20/2020, but I am guessing either 10a8431 or 941250a. I can find out which commit is the culprit if it is helpful.

@kgerheiser
Copy link
Contributor

kgerheiser commented Nov 23, 2020

The issue is that it's building two UPP's, the one with the commit dcea26 and UPP v10.0.0.

They have the same Fortran module names (vrbls2d, etc), and the build of UPP v10.0.0 is picking up decea26's module and not its own.

The solution is to disable building nceppost.

Maybe we should disable that in the stack file that you are using? Is it stack_ufs_weather_ci.yaml? Do you need both of them?

@MinsukJi-NOAA
Copy link
Contributor Author

MinsukJi-NOAA commented Nov 23, 2020

Thank you @kgerheiser . No, I don't think so. I am building with nceppost disabled in stack_ufs_weather-ci.yaml right now. I will report back.

@kgerheiser
Copy link
Contributor

It works fine when you build in a hierarchical structure with lmod, but when installing in a flat structure, like in a container, the names conflict.

@MinsukJi-NOAA
Copy link
Contributor Author

I was able to build the latest develop branch with nceppost disabled in stack_ufs_weather_ci.yaml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants