-
Notifications
You must be signed in to change notification settings - Fork 0
Cloud setup
NOAA RDHHPCS Cloud access is via a gateway developed by Parallel works. Currently the underlying operating system on Azure is RHEL7/Centos7 which reaches end of life in June 2024. "There is no plan for OS upgrade until 09/30/2023. We have suggested an OS version upgrade to RL8, which is being reviewed by NOAA RDHPCS." Once there is an OS update some of the set up below will break (since it is dependent on Centos7)
- Log into Parallel works
- Create a snapshot
- Create a Resource
- Add code in User Bootstrap section of resource
- Create a compute partition (to run the model)
- Boot the resource
- Add/Run Rstudio workflow from marketplace
- Using the Parallel Works IDE
- Clone the neus-atlantis repo in /model
- Run Atlantis
Parallel works requires a CAC card and an account on the cloud. To request an account you'll need to log in to AIM, the Account Information Management System.
A snapshot results in a running container that is used to spin up your resource. By including code in the snapshot it saves you having to install software on the resource when it is up an running.
-
Click on yourName->account->cloud snapshots
-
Create a new snapshot
- Type: Microsoft Azure
- Snapshot Region: eastus
- Base Image: Rocky 8
- Name: Give the snapshot a name
-
Add following code in Build script field
sudo dnf -y install netcdf-devel
sudo dnf -y install nco
# install podman and pull images
sudo dnf -y install runc
sudo dnf -y install podman --allowerasing;
sudo podman login --username=andybeet --password=secretKey docker.io;
sudo podman pull docker.io/andybeet/atlantis:6536;
sudo podman pull docker.io/andybeet/atlantis:6665;
sudo podman pull docker.io/andybeet/atlantis:6681;
# required for some other R packages
sudo dnf -y install fontconfig-devel
sudo dnf -y install proj-devel
sudo dnf -y install gdal-devel
sudo dnf -y install geos-devel
sudo dnf -y install harfbuzz-devel
sudo dnf -y install fribidi-devel
sudo dnf -y install libjpeg-devel
# install packages ahead of time
sudo /usr/bin/Rscript -e "install.packages('remotes', repos='http://cran.rstudio.com/')";
sudo /usr/bin/Rscript -e "remotes::install_github('NOAA-EDAB/atlantisprocessing')";
sudo /usr/bin/Rscript -e "remotes::install_github('NOAA-EDAB/atlantisdiagnostics')";
This script will install podman, git, netcdf, nco, and several other libraries. It will then log in to Dockerhub and pull the atlantis images for both model 6536 and 6665. Rstudio and R version 4.2.2 are then installed. Note you'll need to add the secretKey
-
Click "Save snapshot config"
-
Click "Provision snapshot" (this will create the image)
-
Confirm it built successfully. This can take up to an hour
-
Click on the Resource tab
-
Click "+Add Resource"
-
Give your Resource a name and select "Azure Slurm V2"
-
Alternatively you can duplicate a resource already listed and then change the configurations (recommended)
We recommend using a low performance machine as your Controller
resource:
- Standard_D4_v3 (4 CPUs, 16 GB Memoey, amd64)
-
Click "Disks" on the left sidebar
-
Click "+Add Storage"
-
Name disk, select "Azure Disk" on the right icons, and click "Add Storage" on right
-
Select "1" for
Zone
, "Standard LRS" forType
, set the appropriate file size for your task, and save changes -
To save costs the disk will be turned on/off only as needed
The User bootstrap text field (found in the resource properties) is a location to add linux command line operations that will get executed after the cluster is booted up.
#echo "Install R packages .... Moved to snapshot";
# Rprofile contains Rstudio start up instructions to add path to spatial libraries
cp /model/.Rprofile /home/Andrew.Beet/.Rprofile
This will copy a startup Rprofile script (this allows R to use underlying spatial linux libraries, proj, geos, gdal from RStduio) from \model (permanent storage location) to your user profile. You can also copy any other content you like.
You will need to create the .Rprofile
file and save it into your \model
folder if you are not part of the neus-atlantis group.
# For gdal, geos and proj to work we prefix PKG_CONFIG_PATH and PATH
# https://support.posit.co/hc/en-us/articles/10784536440855?_gl=1*2zieqw*_ga*MTkyNDQzNDM0LjE2NTU0ODMyMTk.*_ga_2C0WZ1JHG0*MTY4NTU1OTEwMi4zLjEuMTY4NTU1OTExNS4wLjAuMA..*_ga_8PLL5FXR9M*MTY4NTU1OTEwMi4zLjEuMTY4NTU1OTExNS4wLjAuMA..
temp_pkg_path<-Sys.getenv("PKG_CONFIG_PATH")
new_pkg_path<-"/usr/gdal34/lib/pkgconfig:/usr/geos311/lib64/pkgconfig:/usr/proj81/lib/pkgconfig"
if (is.na(temp_pkg_path) || temp_pkg_path != '') {
Sys.setenv(PKG_CONFIG_PATH=paste0(new_pkg_path,":",temp_pkg_path))
} else {
Sys.setenv(PKG_CONFIG_PATH=new_pkg_path)
}
temp_path<-Sys.getenv("PATH")
new_path<-"/usr/gdal34/bin:/usr/geos311/bin:/usr/proj81/bin"
if (is.na(temp_path) || temp_path != '') {
Sys.setenv(PATH=paste0(new_path,":",temp_path))
} else {
Sys.setenv(PATH=new_path)
}
We separate the the main "controller" machine from the cluster of machines you will use to do all of the computation. This cluster of machines will be configured as a separate partition. We will name this cluster of machines as the compute
partition
In the resource properties
-
Click "+ partition"
-
Select Instance Type (This is the type of machine that will be booted up)
-
Max Nodes (This is the number of these machines that you want available)
-
Select "Local Disk NFS Export" under controller settings
Note: You do not get charged for any of these machines unless you send a job to the SLURM resource manager. At that point a node will be booted up and the job run. After the job has finished, the node will remain idle for a small amount of time before being shutdown.
Depending on the size of the job submitted to SLURM we recommend any of the following types:
- Standard_D4_v3 (4 CPUs, 16 GB Memory, amd64)
- Standard_HC44rs (44 vCPUs, 352 GB Memory, amd64)
- Standard_HB120rs_v2 (120 vCPUs, 456 GB Memory, amd64)
All filesystems should be attached before booting the cluster to ensure that they're present. Though changes can be made for buckets/NFS afterwards.
-
Click on the resource definition
-
Click "Add filesystem", then select "Model" with the mount "/model"
-
Click "Add filesystem", then select "atlantisarchive" with the mount "/atlantisarchive"
-
Under "Attached Disks", click "Add Attached Disks", then select your managed disk with the mount "/atlantisdisk"
On the compute tab
- Click the big button. The resource will boot up. This can take up 5-15 mins
You will need to first get the workflow from the "marketplace" then you'll need to apply it to the active resource
-
Click your username -> marketplace
-
Click "Add Parallel workflow" button on the Rstudio workflow.
-
Click the Workflows tab and click the "heart" icon to favorite this workflow. It will nw appear on your compute tab.
-
Click the compute tab and click the workflow.
-
You now need to configure this workflow to run on your active resource.
-
Click the small cloud icon (bottom right)
-
From the Default resource dropdown, select the resource name you just booted.Now hit X to close the window.
-
Click the "Execute" button
-
From the compute tab you should see the RSTUDIO_CLOUD workflow in a running status. After a minute or so click the blue eye icon to launch Rstudio.
NOTE: RStudio is launched inside a Gnome desktop environment. Within this environment is a application called tracker
. This indexes all files to enable fast searching. For some reason the tracker database gets very large (100s GB) when running many SLURM jobs. Because Rstudio is launched via a workflow after the Cluster has booted we can not add any command during boot up to disable the tracker. So each user must kill the tracker once Rstudio is running
- On the command IDE command line (not in Rstudio) type:
tracker daemon --kill
. This should stop all indexing while you have the cluster up. This will need to be repeated EVERY time you boot a cluster and initiate the Rstudio workflow.
Once the resource has fully booted, the "power" button next to the resource will turn bright green and a line (bright green) will appear under the resource. You will also see an ip address.
-
Click the ip address (this will copy it)
-
Click the "terminal" icon next to your username to open up the IDE
-
On the command line type ssh and hit return
You are now logged into the resource and can use any linux command to explore the environment
The neus-atlantis dev branch is stored in the \model directory, which is permanent storage, and is accessible to anyone in the atlantis cloud team. It can then be copied to your personal environment in the User Bootrap script if desired.
If the dev branch is updated on GitHub, it will need to be pulled again.
cd \model
sudo git clone -b dev_branch --single-branch https://github.com/NOAA-EDAB/neus-atlantis.git;
It will then clone a single branch of the neus-atlantis repo into the folder /model/neus-atlantis
folder. In this example it will clone the dev_branch
Podman can not be used on centos7 in combination with SLURM. The version bundled with Centos7 (v1.6.4) can not be successfully updated to a more recent version. Because of this there are issues relating to the queueing of jobs dependent on podman. When a resource becomes available podman jobs all attempt to start at the same time and conflict with each other. Some jobs terminate. However this is unlikely to be an issue on ubuntu where we can update podman to a version where this bug has been resolved. Note:The version of Docker bundled with Centos7 is also dated (1.13.1) and updating to a newer version is cumbersome and has not been attempted. Like podman this should not be an issue under ubuntu
-
You'll need to use
sudo
for all podman commands, egsudo podman images
-
sudo podman run --rm -d --name scenario1 --mount "type=bind,src=/model/READ-EDAB_neus-atlantis/currentVersion/,dst=/app/model"
--mount "type=bind,src=/atlantisdisk/First.Last/out1,dst=/app/model/output/" atlantis:6536
-
You'll need to have created the output folder for the run (in this case
mkdir /atlantisdisk/First.Last/out1
) -
To create multiple folders-
mkdir out{1..15}
will create multiple directories -
Unless you save output in
/model
folder it will not persist after the cluster is shut down
Another container option is Singularity. This is the preferred option on Linux Centos7. This is also supported on NOAAs HPC and cloud mounted volumes. Docker/Podman utilizes a Dockerfile
from which an image is built, Singularity has a recipe
file from which an image is built. A singularity image has an .sif
extension and can be stored and shared like any other file.
Below are the contents of the Singularity recipe
file. This will need to be built.
Note:the path to the source code under the %files
section. You will need to have the source code (version 6536 in this example) residing in this location if you want to build the image yourself
Save the contents in a file called Singularity
Bootstrap: docker
From: debian:buster
%help
Atlantis v6536 model
%labels
Author andrew.beet@noaa.gov
%files
/model/atlantisCode/v6536/atlantis /app/atlantis
/model/atlantisCode/v6536/svn /app/.svn
%post
apt-get update && apt-get install -yq build-essential autoconf libnetcdf-dev libxml2-dev libproj-dev subversion valgrind dos2unix nano
cd /app/atlantis
aclocal && autoheader && autoconf && automake -a && ./configure && make && make install
mkdir /app/model
%runscript
cd /app/model
./RunAtlantis.sh
%startscript
cd /app/model
./RunAtlantis.sh
Build the Singularity recipe image from the command line. The following will create a singularity image called atlantis6536.sif
sudo singularity build atlantis6536.sif Singularity
You will find the file atlantis6536.sif
file and the Singularity
recipe file in the /model/atlantisCode
folder
-
sudo singularity run --bind /model/READ-EDAB_neus-atlantis/currentVersion:/app/model,/atlantisdisk/First.Last/out1:/app/model/output
/model/atlantisCode/atlantis6536.sif
(By default this will runRunAtlantis.sh
found in currentVersion) -
sudo singularity exec --bind /model/READ-EDAB_neus-atlantis/currentVersion:/app/model,/atlantisdisk/First.Last/out1:/app/model/output
/model/atlantisCode/atlantis6536.sif /app/model/RunAtlantis2.sh
(Will run an alternative shell script mounted to the container from currentVersion)NOTE:
singularity run
vssingularity exec
.singularity run
runs the default sh in the image.singularity exec
allows you pass a sh script to override the default. -
You'll need to have created the output folder for the run (in this case
mkdir /atlantisdisk/First.Last/out1
) -
To create multiple folders-
mkdir out{1..15}
will create multiple directories -
Unless you save output in
/atlantisdisk
folder it will not persist after the cluster is shut down
Bootstrap: docker
From: ubuntu:18.04
%help
Atlantis v???? model
%labels
Author ????
%files
/model/atlantisCode/v6665/atlantis /app/atlantis
/model/atlantisCode/v6665/svn /app/.svn
%post
apt-get update && apt-get install -yq build-essential autoconf libnetcdf-dev libxml2-dev libproj-dev subversion valgrind dos2unix nano
cd /app/atlantis
aclocal && autoheader && autoconf && automake -a && ./configure && make && make install
mkdir /app/model
%runscript
cd /app/model
./RunAtlantis.sh
%startscript
cd /app/model
./RunAtlantis.sh
Bootstrap: docker
From: ubuntu:18.04
%help
Atlantis v???? model
%labels
Author ????
%environment
TZ=UTC
DEBIAN_FRONTEND=noninteractive
%files
/model/atlantisCode/v6681/atlantis /app/atlantis
/model/atlantisCode/v6681/svn /app/.svn
%post
export TZ=UTC
export DEBIAN_FRONTEND=noninteractive
apt-get update && apt-get install -yq build-essential autoconf libnetcdf-dev libxml2-dev libproj-dev subversion valgrind dos2unix nano r-base-core
cd /app/atlantis
aclocal && autoheader && autoconf && automake -a && ./configure --enable-rassesslink && make && make install
mkdir /app/model
%runscript
cd /app/model
./RunAtlantis.sh
%startscript
cd /app/model
./RunAtlantis.sh
To send a job to the "compute" partition and boot up additional nodes (other than the main controller node) you can use the SLURM resource manager (Simple Linux Utility for Resource Management)
- Configure your resource to have a additional partition. Name it "compute" and select a resource. Both the 44 core machine
Standard_HC44rs
and the 120 core machineStandard_HB120rs_v2
have been tested and are recommended. Then select the number of nodes (how many of this type of resource you want to make available). For example if you want to submit 500 atlantis runs to SLURM then you'll need at least 500 cores. Using theStandard_HB120rs_v2
machine you will need to select 5 of them to accomodate all of the runs.
-
Create a file with the .sh extension (for this example, job.sh) and copy the following
For Podman:
#!/bin/bash
sudo podman run --rm --name scenario1 --mount "type=bind,src=/model/neus-atlantis/currentVersion/,dst=/app/model" --mount "type=bind,src=/atlantisdisk/$USER/out1,dst=/app/model/output/" atlantis:6536
For Singularity:
#!/bin/bash
cd /app/model
sudo singularity exec --bind /model/$USER/READ-EDAB-neusAtlantis/currentVersion:/app/model,/atlantisdisk/out1:/app/model/output /model/atlantisCode/atlantis6536.sif
- On the command line type
sbatch -N 1 job.sh
This will request a single node from the partition and run the commands found in job.sh
If this is run again it will run an addition job on the next free core on the node. If all cores are used a new node will be launched. Use commands sinfo
, scancel
, squeue
to inspect resource usage and cancel jobs.
However using this method to submit many jobs (where an additional node is required) is not recommended.
In practice we want to submit multiple jobs, each job with a different set of input files, with each job outputting model results to its own folder. For example suppose we want 500 model runs each one with different initial starting value scalars (init_scalar
in the at_run.prm
file)
-
Create the 500
at_run.prm
files and save them in thecurrentVersion
folder (eg,at_run1.prm
, ...,at_run500.prm
) -
The Singularity image located in
\model\atlantisCode\atlantis6536.sif
can take a shell script as an argument. This script will be executed instead of the default script (RunAtlantis.sh
) embedded in the container. -
Create a
RunAtlantis.sh
file for each run (egRunAtlantis1.sh
, ...,RunAtlantis500.sh
. Each of these shell scripts will have the form:
#!/bin/bash
cd /app/model
find /app/model -type f | xargs dos2unix (This line no longer needed in the shell script.)
atlantisMerged -i neus_init.nc 0 -o neus_output.nc -r at_runxxx.prm -f at_force_LINUX.prm -p at_physics.prm -b at_biology.prm -h at_harvest.prm -e at_economics.prm -s neus_groups.csv -q neus_fisheries.csv -t . -d output
Note: the addition of the line cd /app/model
. This is required for Singularity because unlike Podman/Docker you can not specify a working directory inside the container.
- Create a
runjob.sh
specifying anarray
of 500. Therunjob.sh
will have the form
#!/bin/bash
#SBATCH --mail-type=ALL
#SBATCH --mail-user=me@work.com
#SBATCH --nodes=1
#SBATCH --partition=compute
#SBATCH --array=1-500
sudo mkdir -p /atlantisdisk/slurm_array/out$SLURM_ARRAY_TASK_ID
sudo singularity exec --bind /model/#USER/READ-EDAB-neusAtlantis/currentVersion:/app/model,/atlantisdisk/slurm_array/out$SLURM_ARRAY_TASK_ID:/app/model/output /model/atlantisCode/atlantis6536.sif /app/model/RunAtlantis$SLURM_ARRAY_TASK_ID.sh
Note: the full path to the sif
file needs to be included. The path to the location of the RunAtlantis.sh
INSIDE the container needs to be specified.
- Submit the batch (
runjob.sh
) to be managed by SLUM
sbatch runjob.sh
-
Monitor batch progress using
squeue
andsinfo
command line functions. To view run specific progress you can access the slurm-xx.out logs from parallel works IDE. These can be found in directory that contain your output folder. Select the slurm jobout
file. This will contain all of the information that atlantis would print to standard out. -
The
--mail
and--mail-user
commands are optional. You will get email notification when the job begins and ends
In the job.sh file above the --array
flag specifies the number of runs to execute. --partition
specifies the name of the partition on the cluster. The sudo mkdir
line will create all 500 output directories with the names out1, ..., out500
. The singularity
line will run each of the 500 jobs using separate RunAtlantisxxx.sh
scripts and outputting results into corresponding directories outxxx
. The variable $SLURM_ARRAY_TASK_ID
takes the values of 1-500. The loop is implicit
The managed disks are indented to be turned on and off when needed. When a task is complete and the files are intented to be exported off of PW, they need to be copied to the atlantisarchive bucket. For example
cp -R /atlantisdisk/folder /atlantisarchive/export/
nfolders <- 500
for (ifolder in 1:nfolders) {
# create Atlantis sh script
filenm <- paste0("/model/READ-EDAB_neus-atlantis/currentVersion/RunAtlantis",ifolder,".sh")
fileConn<-file(filenm,open="w")
cat("#!/bin/bash\n",file=fileConn,append=T)
cat("cd /app/model\n",file=fileConn,append=T)
cat(paste0("atlantisMerged -i neus_init.nc 0 -o neus_output.nc -r at_run",ifolder,".prm -f at_force_LINUX.prm -p at_physics.prm -b at_biology.prm -h at_harvest.prm -e at_economics.prm -s neus_groups.csv -q neus_fisheries.csv -t . -d output\n"),file = fileConn,append=T)
close(fileConn)
system(paste0("chmod 775 ",filenm))
}
We can use an additional partition on the cluster (named process
) to do all of the processing of the Atlantis output. Like the compute cluster we can select several 120 node machines. The partition is required to have R installed and all of the necessary dependencies.
Continuing from the example above, all of the output are stored in the folders out1
, ..., out500
under /atlantisdisk/$USER/slurm_array
. To process the data we create the shell script, processjob.sh
#!/bin/bash
#SBATCH --nodes=1
$SBATCH --partition=process
#SBATCH --array=1-500
Rscript --no-restore --no-save /atlantisdisk/test_processing.r out$SLURM_ARRAY_TASK_ID
Note: The work will be sent to the partition called process
each job in the array of 500 will be run on a separate node. The R script, test_processing.r
will be run with an argument. The argument will be the name of the folder to be processed.
-
test_processing.r
will look something like this
arg <- commandArgs(trailingOnly=T)
runname<- arg[1]
print(runname)
run.prefix <- "neus_output"
atl.dir <- file.path("/atlantisdisk/First.Last/slurm_array",runname)
param.dir <- "path to parameterfiles for this run"
fig.dir <- file.path(alt.dir,"Post_Processed")
out.dir <- file.path(fig.dir,"Data")
param.ls <- atlantisprocessing::get_atl_paramfiles(...)
atlantisprocessing::get_atl_paramfiles(...)
atlantisprocessing::make_atlantis_diagnostic_figures(...)
Running Atlantis
- Home
- Getting started
- Run NEUS Atlantis
- Using CDF distiller
- NOAA Cloud setup
- Issues & Tips
- Atlantis release updates
Calibration
Model Components
Model Criteria
Documentation