Skip to content

Cloud setup

Joseph Caracappa (NOAA) edited this page Nov 6, 2024 · 35 revisions

NOAA Cloud

NOAA RDHHPCS Cloud access is via a gateway developed by Parallel works. Currently the underlying operating system on Azure is RHEL7/Centos7 which reaches end of life in June 2024. "There is no plan for OS upgrade until 09/30/2023. We have suggested an OS version upgrade to RL8, which is being reviewed by NOAA RDHPCS." Once there is an OS update some of the set up below will break (since it is dependent on Centos7)

Contents

  1. Log into Parallel works
  2. Create a snapshot
  3. Create a Resource
  4. Add code in User Bootstrap section of resource
  5. Create a compute partition (to run the model)
  6. Boot the resource
  7. Add/Run Rstudio workflow from marketplace
  8. Using the Parallel Works IDE
  9. Clone the neus-atlantis repo in /model
  10. Run Atlantis

Log into Parallel works

Parallel works requires a CAC card and an account on the cloud. To request an account you'll need to log in to AIM, the Account Information Management System.

Create a snapshot

A snapshot results in a running container that is used to spin up your resource. By including code in the snapshot it saves you having to install software on the resource when it is up an running.

  • Click on yourName->account->cloud snapshots

  • Create a new snapshot

    • Type: Microsoft Azure
    • Snapshot Region: eastus
    • Base Image: Rocky 8
    • Name: Give the snapshot a name
  • Add following code in Build script field

sudo dnf -y install netcdf-devel
sudo dnf -y install nco
# install podman and pull images
sudo dnf -y install runc
sudo dnf -y install podman --allowerasing;
sudo podman login --username=andybeet --password=secretKey docker.io;
sudo podman pull docker.io/andybeet/atlantis:6536;
sudo podman pull docker.io/andybeet/atlantis:6665;
sudo podman pull docker.io/andybeet/atlantis:6681;

# required for some other R packages
sudo dnf -y install fontconfig-devel
sudo dnf -y install proj-devel
sudo dnf -y install gdal-devel
sudo dnf -y install geos-devel
sudo dnf -y install harfbuzz-devel
sudo dnf -y install fribidi-devel
sudo dnf -y install libjpeg-devel

# install packages ahead of time
sudo /usr/bin/Rscript -e "install.packages('remotes', repos='http://cran.rstudio.com/')";
sudo /usr/bin/Rscript -e "remotes::install_github('NOAA-EDAB/atlantisprocessing')";
sudo /usr/bin/Rscript -e "remotes::install_github('NOAA-EDAB/atlantisdiagnostics')";

This script will install podman, git, netcdf, nco, and several other libraries. It will then log in to Dockerhub and pull the atlantis images for both model 6536 and 6665. Rstudio and R version 4.2.2 are then installed. Note you'll need to add the secretKey

  • Click "Save snapshot config"

  • Click "Provision snapshot" (this will create the image)

  • Confirm it built successfully. This can take up to an hour

Create a resource

  • Click on the Resource tab

  • Click "+Add Resource"

  • Give your Resource a name and select "Azure Slurm V2"

  • Alternatively you can duplicate a resource already listed and then change the configurations (recommended)

We recommend using a low performance machine as your Controller resource:

  • Standard_D4_v3 (4 CPUs, 16 GB Memoey, amd64)

Add Managed Disk

  • Click "Disks" on the left sidebar

  • Click "+Add Storage"

  • Name disk, select "Azure Disk" on the right icons, and click "Add Storage" on right

  • Select "1" for Zone, "Standard LRS" for Type, set the appropriate file size for your task, and save changes

  • To save costs the disk will be turned on/off only as needed

Add code in User Bootstrap

The User bootstrap text field (found in the resource properties) is a location to add linux command line operations that will get executed after the cluster is booted up.

        #echo "Install R packages .... Moved to snapshot";
        # Rprofile contains Rstudio start up  instructions to add path to spatial libraries
        cp /model/.Rprofile /home/Andrew.Beet/.Rprofile

This will copy a startup Rprofile script (this allows R to use underlying spatial linux libraries, proj, geos, gdal from RStduio) from \model (permanent storage location) to your user profile. You can also copy any other content you like.

You will need to create the .Rprofile file and save it into your \model folder if you are not part of the neus-atlantis group.

    # For gdal, geos and proj to work we prefix PKG_CONFIG_PATH and PATH 
    # https://support.posit.co/hc/en-us/articles/10784536440855?_gl=1*2zieqw*_ga*MTkyNDQzNDM0LjE2NTU0ODMyMTk.*_ga_2C0WZ1JHG0*MTY4NTU1OTEwMi4zLjEuMTY4NTU1OTExNS4wLjAuMA..*_ga_8PLL5FXR9M*MTY4NTU1OTEwMi4zLjEuMTY4NTU1OTExNS4wLjAuMA..

    temp_pkg_path<-Sys.getenv("PKG_CONFIG_PATH")
    new_pkg_path<-"/usr/gdal34/lib/pkgconfig:/usr/geos311/lib64/pkgconfig:/usr/proj81/lib/pkgconfig"

    if (is.na(temp_pkg_path) || temp_pkg_path != '') {
      Sys.setenv(PKG_CONFIG_PATH=paste0(new_pkg_path,":",temp_pkg_path))
    } else {
      Sys.setenv(PKG_CONFIG_PATH=new_pkg_path)
    }

    temp_path<-Sys.getenv("PATH")
    new_path<-"/usr/gdal34/bin:/usr/geos311/bin:/usr/proj81/bin"

    if (is.na(temp_path) || temp_path != '') {
      Sys.setenv(PATH=paste0(new_path,":",temp_path))
    } else {
      Sys.setenv(PATH=new_path)
    }

Create a compute partition

We separate the the main "controller" machine from the cluster of machines you will use to do all of the computation. This cluster of machines will be configured as a separate partition. We will name this cluster of machines as the compute partition

In the resource properties

  • Click "+ partition"

  • Select Instance Type (This is the type of machine that will be booted up)

  • Max Nodes (This is the number of these machines that you want available)

  • Select "Local Disk NFS Export" under controller settings

Note: You do not get charged for any of these machines unless you send a job to the SLURM resource manager. At that point a node will be booted up and the job run. After the job has finished, the node will remain idle for a small amount of time before being shutdown.

Depending on the size of the job submitted to SLURM we recommend any of the following types:

  • Standard_D4_v3 (4 CPUs, 16 GB Memory, amd64)
  • Standard_HC44rs (44 vCPUs, 352 GB Memory, amd64)
  • Standard_HB120rs_v2 (120 vCPUs, 456 GB Memory, amd64)

Attach filesystems

All filesystems should be attached before booting the cluster to ensure that they're present. Though changes can be made for buckets/NFS afterwards.

  • Click on the resource definition

  • Click "Add filesystem", then select "Model" with the mount "/model"

  • Click "Add filesystem", then select "atlantisarchive" with the mount "/atlantisarchive"

  • Under "Attached Disks", click "Add Attached Disks", then select your managed disk with the mount "/atlantisdisk"

Boot the resource

On the compute tab

  • Click the big button. The resource will boot up. This can take up 5-15 mins

Add and Run Rstudio workflow from marketplace

You will need to first get the workflow from the "marketplace" then you'll need to apply it to the active resource

  • Click your username -> marketplace

  • Click "Add Parallel workflow" button on the Rstudio workflow.

  • Click the Workflows tab and click the "heart" icon to favorite this workflow. It will nw appear on your compute tab.

  • Click the compute tab and click the workflow.

  • You now need to configure this workflow to run on your active resource.

  • Click the small cloud icon (bottom right)

  • From the Default resource dropdown, select the resource name you just booted.Now hit X to close the window.

  • Click the "Execute" button

  • From the compute tab you should see the RSTUDIO_CLOUD workflow in a running status. After a minute or so click the blue eye icon to launch Rstudio.

NOTE: RStudio is launched inside a Gnome desktop environment. Within this environment is a application called tracker. This indexes all files to enable fast searching. For some reason the tracker database gets very large (100s GB) when running many SLURM jobs. Because Rstudio is launched via a workflow after the Cluster has booted we can not add any command during boot up to disable the tracker. So each user must kill the tracker once Rstudio is running

  • On the command IDE command line (not in Rstudio) type: tracker daemon --kill. This should stop all indexing while you have the cluster up. This will need to be repeated EVERY time you boot a cluster and initiate the Rstudio workflow.

Using the IDE

Once the resource has fully booted, the "power" button next to the resource will turn bright green and a line (bright green) will appear under the resource. You will also see an ip address.

  • Click the ip address (this will copy it)

  • Click the "terminal" icon next to your username to open up the IDE

  • On the command line type ssh and hit return

You are now logged into the resource and can use any linux command to explore the environment

Clone the neus-atlantis repo

The neus-atlantis dev branch is stored in the \model directory, which is permanent storage, and is accessible to anyone in the atlantis cloud team. It can then be copied to your personal environment in the User Bootrap script if desired.

If the dev branch is updated on GitHub, it will need to be pulled again.

     cd \model
     sudo git clone -b dev_branch --single-branch https://github.com/NOAA-EDAB/neus-atlantis.git;

It will then clone a single branch of the neus-atlantis repo into the folder /model/neus-atlantis folder. In this example it will clone the dev_branch

Run atlantis

Podman can not be used on centos7 in combination with SLURM. The version bundled with Centos7 (v1.6.4) can not be successfully updated to a more recent version. Because of this there are issues relating to the queueing of jobs dependent on podman. When a resource becomes available podman jobs all attempt to start at the same time and conflict with each other. Some jobs terminate. However this is unlikely to be an issue on ubuntu where we can update podman to a version where this bug has been resolved. Note:The version of Docker bundled with Centos7 is also dated (1.13.1) and updating to a newer version is cumbersome and has not been attempted. Like podman this should not be an issue under ubuntu

Podman

  • You'll need to use sudo for all podman commands, eg sudo podman images

  • sudo podman run --rm -d --name scenario1 --mount "type=bind,src=/model/READ-EDAB_neus-atlantis/currentVersion/,dst=/app/model" --mount "type=bind,src=/atlantisdisk/First.Last/out1,dst=/app/model/output/" atlantis:6536

  • You'll need to have created the output folder for the run (in this case mkdir /atlantisdisk/First.Last/out1)

  • To create multiple folders- mkdir out{1..15} will create multiple directories

  • Unless you save output in /model folder it will not persist after the cluster is shut down

Singularity

Pre v6554

Another container option is Singularity. This is the preferred option on Linux Centos7. This is also supported on NOAAs HPC and cloud mounted volumes. Docker/Podman utilizes a Dockerfile from which an image is built, Singularity has a recipe file from which an image is built. A singularity image has an .sif extension and can be stored and shared like any other file.

Below are the contents of the Singularity recipe file. This will need to be built. Note:the path to the source code under the %files section. You will need to have the source code (version 6536 in this example) residing in this location if you want to build the image yourself Save the contents in a file called Singularity

    Bootstrap: docker
    From: debian:buster

    %help
      Atlantis v6536 model

    %labels
      Author andrew.beet@noaa.gov
  
    %files
      /model/atlantisCode/v6536/atlantis /app/atlantis
      /model/atlantisCode/v6536/svn /app/.svn

    %post
      apt-get update && apt-get install -yq build-essential autoconf libnetcdf-dev libxml2-dev libproj-dev subversion valgrind dos2unix nano
      cd /app/atlantis
      aclocal && autoheader && autoconf && automake -a && ./configure && make && make install
      mkdir /app/model
  
    %runscript
      cd /app/model 
      ./RunAtlantis.sh
  
    %startscript
     cd /app/model
     ./RunAtlantis.sh

Build the Singularity recipe image from the command line. The following will create a singularity image called atlantis6536.sif

sudo singularity build atlantis6536.sif Singularity

You will find the file atlantis6536.sif file and the Singularity recipe file in the /model/atlantisCode folder

  • sudo singularity run --bind /model/READ-EDAB_neus-atlantis/currentVersion:/app/model,/atlantisdisk/First.Last/out1:/app/model/output /model/atlantisCode/atlantis6536.sif (By default this will run RunAtlantis.sh found in currentVersion)

  • sudo singularity exec --bind /model/READ-EDAB_neus-atlantis/currentVersion:/app/model,/atlantisdisk/First.Last/out1:/app/model/output /model/atlantisCode/atlantis6536.sif /app/model/RunAtlantis2.sh(Will run an alternative shell script mounted to the container from currentVersion)

    NOTE: singularity run vs singularity exec. singularity run runs the default sh in the image. singularity exec allows you pass a sh script to override the default.

  • You'll need to have created the output folder for the run (in this case mkdir /atlantisdisk/First.Last/out1)

  • To create multiple folders- mkdir out{1..15} will create multiple directories

  • Unless you save output in /atlantisdisk folder it will not persist after the cluster is shut down

v6554 - v6668

    Bootstrap: docker
    From: ubuntu:18.04

    %help
      Atlantis v???? model

    %labels
      Author ????
  
    %files
      /model/atlantisCode/v6665/atlantis /app/atlantis
      /model/atlantisCode/v6665/svn /app/.svn

    %post
      apt-get update && apt-get install -yq build-essential autoconf libnetcdf-dev libxml2-dev libproj-dev subversion valgrind dos2unix nano
      cd /app/atlantis
      aclocal && autoheader && autoconf && automake -a && ./configure && make && make install
      mkdir /app/model
  
    %runscript
      cd /app/model 
      ./RunAtlantis.sh
  
    %startscript
     cd /app/model
     ./RunAtlantis.sh

v6669

    Bootstrap: docker
    From: ubuntu:18.04

    %help
      Atlantis v???? model

    %labels
      Author ????
      
    %environment
      TZ=UTC
      DEBIAN_FRONTEND=noninteractive
  
    %files
      /model/atlantisCode/v6681/atlantis /app/atlantis
      /model/atlantisCode/v6681/svn /app/.svn

    %post
      export TZ=UTC
      export DEBIAN_FRONTEND=noninteractive
      apt-get update && apt-get install -yq build-essential autoconf libnetcdf-dev libxml2-dev libproj-dev subversion valgrind dos2unix nano r-base-core
      cd /app/atlantis
      aclocal && autoheader && autoconf && automake -a && ./configure --enable-rassesslink && make && make install
      mkdir /app/model
  
    %runscript
      cd /app/model 
      ./RunAtlantis.sh
  
    %startscript
     cd /app/model
     ./RunAtlantis.sh

Using SLURM

To send a job to the "compute" partition and boot up additional nodes (other than the main controller node) you can use the SLURM resource manager (Simple Linux Utility for Resource Management)

  • Configure your resource to have a additional partition. Name it "compute" and select a resource. Both the 44 core machine Standard_HC44rs and the 120 core machine Standard_HB120rs_v2 have been tested and are recommended. Then select the number of nodes (how many of this type of resource you want to make available). For example if you want to submit 500 atlantis runs to SLURM then you'll need at least 500 cores. Using the Standard_HB120rs_v2 machine you will need to select 5 of them to accomodate all of the runs.

A single run

  • Create a file with the .sh extension (for this example, job.sh) and copy the following

    For Podman:

     #!/bin/bash
    
     sudo podman run --rm --name scenario1 --mount "type=bind,src=/model/neus-atlantis/currentVersion/,dst=/app/model" --mount "type=bind,src=/atlantisdisk/$USER/out1,dst=/app/model/output/" atlantis:6536
For Singularity:
     #!/bin/bash
     cd /app/model
     
     sudo singularity exec --bind /model/$USER/READ-EDAB-neusAtlantis/currentVersion:/app/model,/atlantisdisk/out1:/app/model/output /model/atlantisCode/atlantis6536.sif 
  • On the command line type
     sbatch -N 1 job.sh

This will request a single node from the partition and run the commands found in job.sh If this is run again it will run an addition job on the next free core on the node. If all cores are used a new node will be launched. Use commands sinfo, scancel, squeue to inspect resource usage and cancel jobs.

However using this method to submit many jobs (where an additional node is required) is not recommended.

Multiple runs

In practice we want to submit multiple jobs, each job with a different set of input files, with each job outputting model results to its own folder. For example suppose we want 500 model runs each one with different initial starting value scalars (init_scalar in the at_run.prm file)

  • Create the 500 at_run.prm files and save them in the currentVersion folder (eg, at_run1.prm, ..., at_run500.prm)

  • The Singularity image located in \model\atlantisCode\atlantis6536.sif can take a shell script as an argument. This script will be executed instead of the default script (RunAtlantis.sh) embedded in the container.

  • Create a RunAtlantis.sh file for each run (eg RunAtlantis1.sh, ..., RunAtlantis500.sh. Each of these shell scripts will have the form:

     #!/bin/bash
     cd /app/model
    
     find /app/model -type f | xargs dos2unix (This line no longer needed in the shell script.)
    
     atlantisMerged -i neus_init.nc 0 -o neus_output.nc -r at_runxxx.prm -f at_force_LINUX.prm -p at_physics.prm -b at_biology.prm -h at_harvest.prm -e at_economics.prm -s neus_groups.csv -q neus_fisheries.csv -t . -d output

Note: the addition of the line cd /app/model. This is required for Singularity because unlike Podman/Docker you can not specify a working directory inside the container.

  • Create a runjob.sh specifying an array of 500. The runjob.sh will have the form
     #!/bin/bash
     #SBATCH --mail-type=ALL
     #SBATCH --mail-user=me@work.com
     #SBATCH --nodes=1
     #SBATCH --partition=compute
     #SBATCH --array=1-500
    
     sudo mkdir -p /atlantisdisk/slurm_array/out$SLURM_ARRAY_TASK_ID
    
     sudo singularity exec --bind /model/#USER/READ-EDAB-neusAtlantis/currentVersion:/app/model,/atlantisdisk/slurm_array/out$SLURM_ARRAY_TASK_ID:/app/model/output /model/atlantisCode/atlantis6536.sif /app/model/RunAtlantis$SLURM_ARRAY_TASK_ID.sh

Note: the full path to the sif file needs to be included. The path to the location of the RunAtlantis.sh INSIDE the container needs to be specified.

  • Submit the batch (runjob.sh) to be managed by SLUM
    sbatch runjob.sh
  • Monitor batch progress using squeue and sinfo command line functions. To view run specific progress you can access the slurm-xx.out logs from parallel works IDE. These can be found in directory that contain your output folder. Select the slurm job out file. This will contain all of the information that atlantis would print to standard out.

  • The --mail and --mail-user commands are optional. You will get email notification when the job begins and ends

In the job.sh file above the --array flag specifies the number of runs to execute. --partition specifies the name of the partition on the cluster. The sudo mkdir line will create all 500 output directories with the names out1, ..., out500. The singularity line will run each of the 500 jobs using separate RunAtlantisxxx.sh scripts and outputting results into corresponding directories outxxx. The variable $SLURM_ARRAY_TASK_ID takes the values of 1-500. The loop is implicit

Writing to bucket for export

The managed disks are indented to be turned on and off when needed. When a task is complete and the files are intented to be exported off of PW, they need to be copied to the atlantisarchive bucket. For example

cp -R /atlantisdisk/folder /atlantisarchive/export/

Example r script

    nfolders <- 500
    for (ifolder in 1:nfolders) {
    
      # create Atlantis sh script
      filenm <- paste0("/model/READ-EDAB_neus-atlantis/currentVersion/RunAtlantis",ifolder,".sh")
      fileConn<-file(filenm,open="w")
      cat("#!/bin/bash\n",file=fileConn,append=T)
      cat("cd /app/model\n",file=fileConn,append=T)
      cat(paste0("atlantisMerged -i neus_init.nc 0 -o neus_output.nc -r at_run",ifolder,".prm -f at_force_LINUX.prm -p at_physics.prm -b at_biology.prm -h at_harvest.prm -e at_economics.prm -s neus_groups.csv -q neus_fisheries.csv -t . -d output\n"),file = fileConn,append=T)
      close(fileConn)
      system(paste0("chmod 775 ",filenm))
    }


Processing the output

We can use an additional partition on the cluster (named process) to do all of the processing of the Atlantis output. Like the compute cluster we can select several 120 node machines. The partition is required to have R installed and all of the necessary dependencies.

Continuing from the example above, all of the output are stored in the folders out1, ..., out500 under /atlantisdisk/$USER/slurm_array. To process the data we create the shell script, processjob.sh

     #!/bin/bash
     #SBATCH --nodes=1
     $SBATCH --partition=process
     #SBATCH --array=1-500
    
     Rscript --no-restore --no-save /atlantisdisk/test_processing.r out$SLURM_ARRAY_TASK_ID

Note: The work will be sent to the partition called process each job in the array of 500 will be run on a separate node. The R script, test_processing.r will be run with an argument. The argument will be the name of the folder to be processed.

  • test_processing.r will look something like this
    arg <- commandArgs(trailingOnly=T)
    runname<- arg[1]
    print(runname)

    run.prefix <- "neus_output"
    atl.dir <- file.path("/atlantisdisk/First.Last/slurm_array",runname)
    param.dir <- "path to parameterfiles for this run"
    fig.dir <- file.path(alt.dir,"Post_Processed")
    out.dir <- file.path(fig.dir,"Data")

    param.ls <- atlantisprocessing::get_atl_paramfiles(...)

    atlantisprocessing::get_atl_paramfiles(...)

    atlantisprocessing::make_atlantis_diagnostic_figures(...)