Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ecflow package for wcoss2 GFS transition #555

Merged

Conversation

lgannoaa
Copy link
Contributor

@lgannoaa lgannoaa commented Jan 7, 2022

This PR:

  • No need to review obsproc related files and link under ecflow. NCO will work on new ops version.
  • Updates the 4 suite definition files prodXX.def; XX=00,06,12,18
  • Includes updates to facilitate developer testing by specifying developer values for variables such as QUEUE,
    QUEUE_TRANSFER, QUEUE_SHARED and COM
  • Modify module position in each ecflow job using the same position from
    $HOMEgfs/modulefiles/module_base.wcoss2.lua
  • Adding envir-p1.h for wcoss2 environment
  • Include head.h for reference developer setting
  • Change in ecflow script:
    • Updates the ecflow scripts to consistent with GFS module position
    • Using wcoss2 environment
    • Using PBS for WCOSS2
    • Using updated module
    • Move enkf out of gdas and rename it to enkfgdas.
      Include all ecflow definition files job name
      Include all ecflow scripts name and job/log name
    • Move "model=gfs" to the top on each job except all jobs under obsproc.
      obsproc will no longer be part of GFS. Therefore leave it without change for testing purpose.
    • Remove the source of model_ver from each ecflow script except all jobs under obsproc.
      obsproc will no longer be part of GFS. Therefore leave it without change for testing purpose.
      This work is part of WCOSS2 Migration and Porting #398

Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed upto the file:
ecf/scripts/gfs/atmos/post_processing/bufr_sounding/jgfs_atmos_postsnd.ecf
But not head.h as it is very complicated.

Also PR #553 has been merged, so please rebase this PR based on current feature/ops-wcoss2

Please see comments.

Comment on lines 10 to 11
#module load prod_envir/2.0.4 prod_util/2.0.9 envvar/1.0
echo debug
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove commented and unnecessary echo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

ecf/scripts/gdas/atmos/gempak/jgdas_atmos_gempak.ecf Outdated Show resolved Hide resolved
@@ -43,6 +37,7 @@ module list
#############################################################
export cyc=%CYC%
export cycle=t%CYC%z
export CDATE=$PDY$cyc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is set in envir-p1.h. It does not need to be set here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

module load gempak/${gempak_ver}
module load libjpeg/${libjpeg_ver}
module load grib_util/${grib_util_ver}
module load imagemagick/${imagemagick_ver}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is ${imagemagic_ver} coming from? I don't see it here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This job need module imagemagic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I understand that. But what is the value of imagemagik_ver and where is it set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It need to be set in run.ver

#PBS -q %QUEUE%
#PBS -A %PROJ%-%PROJENVIR%
#PBS -l walltime=00:20:00
#PBS -l place=vscatter,select=10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WenMeng-NOAA On WCOSS_Dell, this job (and the other jgfs_atmos_post_fXXX jobs), they take 4 Nodes totaling to 112 tasks. See lines 17,18 on the left.
Am I reading this right that this job (and other jgfs_atmos_post_fXXX jobs) are requesting 10 nodes totaling to 1280 tasks for a single forecast hour.

#PBS -q %QUEUE%
#PBS -A %PROJ%-%PROJENVIR%
#PBS -l walltime=04:00:00
#PBS -l place=vscatter,select=1:ncpus=128
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely this job only needs 1 task as it only sleeps

Copy link
Contributor

@WalterKolczynski-NOAA WalterKolczynski-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed from jgfs_atmos_post_anl.ecf to the end.

export dictionaries_ver="v3.4.0"
export omb_ver="v4.4"

## main "model" software
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these commented out lines should be removed.

export radarl2_ver=v1.2

#### NCO requested testing location is v16.2
#### export gfs_ver="v16.2.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed unused line

@aerorahul
Copy link
Contributor

This PR has been partially reviewed.
The changes are too many to be made at this time for a thorough cleanup and review and retest.
Please merge as it only affects ecflow and only a single developer can run this setup in its current form due to hardwired paths, username, and ECF_PORT number in the ecflow files such as .def, head.h, model_ver.h, envir_p1.h.

Comment on lines 19 to 24
#### Developer overwrite
export SIPHONROOT=${UTILROOT}/fakedbn
####
else
export SIPHONROOT=${UTILROOT}/fakedbn
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is developer only. Not used by NCO and thus will not be addressed by NCO.
Please resolve the if-else logic as it is executed regardless of the condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed


export DBNROOT=$SIPHONROOT

####if [[ ! " prod para test " =~ " ${envir} " && " ops.prod ops.para " =~ " $(whoami) " ]]; then err_exit "ENVIR must be prod, para, or test [envir-p1.h]"; fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented out? If it is unused, what purpose does it serve?
Either put a comment on this explaining the purpose or remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored

ecf/include/head.h Show resolved Hide resolved
ecf/include/envir-p1.h Show resolved Hide resolved
export DBNROOT=$SIPHONROOT

####if [[ ! " prod para test " =~ " ${envir} " && " ops.prod ops.para " =~ " $(whoami) " ]]; then err_exit "ENVIR must be prod, para, or test [envir-p1.h]"; fi
export ECF_PORT=34326
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this line. This variable is set when ecflow module is loaded. ecflow module is loaded in head.h. This variable is unique to the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job tested without this line sometime failed silently. There is a good reason to include this in here.
As discussed, this file will be used only for reference. I prefer to keep it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes failure is a symptom of the system instability. I would rather encounter this failure sometimes, than hard-wire some users personal ECF_PORT information. I suggest removal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed


module list

#############################################################
# WCOSS environment settings
#############################################################
export KMP_AFFINITY=scatter
export OMP_NUM_THREADS=$threads
export FORT_BUFFERED=true
export OMP_NUM_THREADS_CY=28
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is OMP_NUM_THREADS_CY=28 appropriate for WCOSS2? config.resources.nco.static is setting this to npe_node_max. 28 is the value for WCOSS_Dell_P3 and was hard-wired here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same is the case in jgfs_atmos_analysis.ecf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set to 128

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set to 128

Are you sure, the value from this variable set here OMP_NUM_THREADS_CY is being used?
I see https://github.com/NOAA-EMC/GSI/blob/a9561f4219e5a69ac1af250081ae027084c40c3b/scripts/exglobal_atmos_analysis.sh#L1017 setting the value based on NTHREADS_CYCLE right before entering CYCLESH


module list

#############################################################
# WCOSS environment settings
#############################################################
export KMP_AFFINITY=scatter
export OMP_NUM_THREADS=$threads
Copy link
Contributor

@aerorahul aerorahul Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the value for $threads coming from? It was previously set on line 20 above and has been removed in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

 - Move enkf out of gdas and rename it to enkfgdas.
     Include all ecflow definition files job name
     Include all ecflow scripts name and job/log name
 - Move "model=gfs" to the top on each job except all jobs under obsproc.
     obsproc will no longer be part of GFS. Therefore leave it without change for testing purpose.
 - Remove the source of model_ver from each ecflow script except all jobs under obsproc.
     obsproc will no longer be part of GFS. Therefore leave it without change for testing purpose.
Update analysis ecflow script to use 128 for wcoss2
Remove extra CDATE
Comment on lines +10 to +20
if [[ "$envir" == prod && "$SENDDBN" == YES ]]; then
export eval=%EVAL:NO%
if [ $eval == YES ]; then
export SIPHONROOT=${UTILROOT}/para_dbn
else
export SIPHONROOT=/lfs/h1/ops/prod/dbnet_siphon
fi
export SIPHONROOT=${UTILROOT}/fakedbn
else
export SIPHONROOT=${UTILROOT}/fakedbn
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the logic here, regardless of the conditions, the result is:
export SIPHONROOT=${UTILROOT}/fakedbn
Also, a variable eval is being exported for use somewhere down when envir == prod and SENDDBN is YES.
Is that the desired outcome?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The developer will always use fakedbn to test. eval is ok here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does eval do? Is it used anywhere other than the if-test?
If the developer will always use fakeddbn, then the first if test "$envir" == prod is not necessary (and this whole block just boiled down to this):

export SIPHONROOT=${UTILROOT}/fakedbn

Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see more comments.

@@ -1,6 +1,6 @@
extern /prod18/gdas/jgdas_forecast
extern /prod18/gdas/atmos/post
extern /prod18/gdas/enkf/post
extern /prod18/enkfgdas/post
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are new changes added in these .def files that were agreed to be added in a follow-up PR in the meeting this AM. This PR is already big to add any additional changes.

ecf/include/envir-p1.h Show resolved Hide resolved
module load ecflow/$ecflow_ver
echo "ecflow module location: $(module display ecflow |& head -2 | tail -1 | sed 's/:$//')"
export ECF_ROOT=/apps/ops/prod/nco/core/ecflow.v5.6.0.7
export ECF_PORT=34326
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove user specific PORT information. This value is set in the ecflow module for the user and is unique to the user.

echo "Running module load ecflow/$ecflow_ver"
module load ecflow/$ecflow_ver
echo "ecflow module location: $(module display ecflow |& head -2 | tail -1 | sed 's/:$//')"
export ECF_ROOT=/apps/ops/prod/nco/core/ecflow.v5.6.0.7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are dual uses of the variable ecflow_ver and 5.6.0.7 for ecflow version. Which one is it? Please be consistent in the use of the version.

echo "Running module load ecflow/$ecflow_ver"
module load ecflow/$ecflow_ver
echo "ecflow module location: $(module display ecflow |& head -2 | tail -1 | sed 's/:$//')"
export ECF_ROOT=/apps/ops/prod/nco/core/ecflow.v5.6.0.7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is set by the ecflow module. It should not be overwritten. Load the appropriate module instead.
Please remove.

Comment on lines +30 to +31
export ECF_ROOT=/apps/ops/prod/nco/core/ecflow.v5.6.0.7
. ${ECF_ROOT}/versions/run.ver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am failing to understand what this is doing. Why export? and why loading prod_util in head.h

export ECF_ROOT=/apps/ops/prod/nco/core/ecflow.v5.6.0.7
export ECF_PORT=34326
export ECF_HOST=ddecflow02
export ECF_INCLUDE=/lfs/h2/emc/global/noscrub/$USER/git/feature-ops-wcoss2/ecf/include
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hardwired to a specific branch which will not exist as soon as it is merged and deleted.
I am failing to understand the need for this entire section from 40-47. It seems to be tailored for a specific branch and specific location.

@@ -0,0 +1,51 @@
#PBS -S /bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These EnKF jobs are new and we agreed to add them in a followup PR to avoid further bloating this current PR.

@KateFriedman-NOAA
Copy link
Member

Forcibly submitting this PR due to need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants