-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gdasgldas task fails with Restart Tile Space Mismatch #622
Comments
see the email from DaveHi Helin and Jun, I am finding with a recent upgrade of the global workflow that there is a mismatch between the land-sea mask algorithms of the UFS and the GLDAS. This is resulting in a failure of the gdasgldas job, specifically when the Land Information System (LIS) executable is run. The error reported is 'Restart Tile Space Mismatch, Halting..', which can be found in sorce/gldas_model.fd/lsms/noah.igbp/noahrst.F:104. I have only verified this error for a C192/C96 run on 2020080500, where NCH = 97296 and LIS%D%NCH = 99582. With the recent upgrade, the UFS is modifying the input land sea mask during the first gdasfcst and outputting this modified mask to the tiled surface restart files. Thus, all future forecasts, analyses, etc use this modified mask until the GLDAS reads its own from $FIX/fix_gldas/FIX_T382/lmask_gfs_T382.bfsa. So I have a few questions. First, the UFS's modification of the land-sea mask is expected, correct? Secondly, should a new fix file be created for the GLDAS with the modified land-sea mask or is the UFS-modified land-sea mask time dependent and thus not fixed? Lastly, should I expect this to be an issue at all resolutions? Since GLDAS will not be included in the next operational implementation, we need to have someone to decide if we should spend more time on this task. |
Is the cycling working without gdasgldas ? Have the user tried at the operational resolutions (C768/C384) ? |
The cycling appears to fail when gdasgldas fails. I have not run at
operational resolution, the current run is an initial test-run I'm doing as
a new global-workflow user on Orion and the initial conditions were
provided to me at C384/C192.
Brett
…On Mon, Jan 31, 2022 at 9:06 PM Fanglin Yang ***@***.***> wrote:
Is the cycling working without gdasgldas ? Have the user tried at the
operational resolutions (C768/C384) ?
—
Reply to this email directly, view it on GitHub
<#622 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AXNDXO746HD2FAC4IFILZOLUY5E37ANCNFSM5NHOMEAQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
@yangfanglin The experiment had no errors when running in the first few days before GLDAS turns on, so I assume that it would run if GLDAS was turned off altogether. If as @HelinWei-NOAA says that GLDAS will not be used going forward, should we turn it off now in our experiments? The DA team is still running atmosphere-only cases since we have non-coupling data upgrades to worry about. |
@CatherineThomas-NOAA If the cycling at C384/C192 resolutions with gdasgldas turned on is working on WCOSS, there is likely issues related to the setups on Orion. I had some discussion with Daryl about the use of gdasgldas in future systems. We can discuss more about this issue with Daryl offline. |
@AndrewEichmann-NOAA Has your WCOSS experiment progressed to the point where the gldas step is run? If so, did it fail or run without issue? |
@CatherineThomas-NOAA No - I ran into rstprod access issues and am waiting for a response from helpdesk |
Thanks @AndrewEichmann-NOAA. I can run a short warm start test on WCOSS. |
The warm start test on WCOSS ran the gldas step without failure. Now that Hera is back, I can try a quick test there as well. |
Global-workflow develop does not build on Hera (#561), so I can't run this test at this time. |
I ran into this issue with the S4 and Jet ports, which I reported to Helin. I have since turned off GLDAS altogether and everything has run OK out to 10 days. Below is more of the thread between Helin, @junwang-noaa, and myself: David, GLDAS should use the same land-sea mask as UFS. If the land-sea mask can be changed during the forecast, that certainly will bring up some issues for GLDAS. Helin Dave, The model does check/modify the land-sea mask according to the input land/lake fraction and soil type values, this is applied to both non-fractional grid and fractional grid. The changes are required to make sure the model has consistent land sea mask, and soil types. The new land sea mask is output in the model sfc file. I expect you do not change the oro data, the soil type data during your run, so this land sea mask won't change. In other words, I think you need to create lmask_gfs_T382.bfsa once with this land sea mask in the history file. Jun |
@HelinWei-NOAA @barlage and @yangfanglin should Jun's suggestion be followed at the mask be generated for GLDAS from the history file ? It would be good to know if the same issue is seen on other platforms |
@arunchawla-NOAA Cathy reported that "the warm start test on WCOSS ran the gldas step without failure". So the failures on other platforms might be a porting issue. @CatherineThomas-NOAA Cathy, can you confirm ? What is the resolution you tested on WCOSS ? I assume you were using the GFS.v16* tag instead of the UFS, right ? |
@BrettHoover-NOAA I'm unable to access either the code or experiment directories. I ran a test on Orion a few days ago and didn't have any issue, so this probably isn't a port issue. I'm setting up another test just to be sure. |
My test on WCOSS was C384/C192 with warm start initial conditions from the PR #500 test. This was using the head of develop global-workflow at the time (d3028b9), compiling and running with atmosphere only. Since then, I've run new tests on Hera and Orion with the recent update to develop (97ebc4d) and ran into no issues with gldas. @BrettHoover-NOAA It's possible this issue got fixed inadvertently with the recent develop update. It could also be related to the set of ICs that you started from. How about you try to replicate the test I ran first? I'll point you to my initial conditions offline. |
I was able to run a 6½-cycle warm-start from a fresh clone overnight without issue. I'm also using the C384 ICs Cathy produced for PR #500. |
@CatherineThomas-NOAA I was able to complete your test with the new develop (97ebc4d) and warm-start ICs on Orion, and I ran into no problems, gdasgldas appears to finish successfully. |
@BrettHoover-NOAA Great to hear. It looks like your Orion environment is working properly. Maybe to round out this set of tests you could try the warm-start 2020083000 ICs but with the original workflow that you cloned, assuming you still have it. |
@CatherineThomas-NOAA I have that test running right now, I'll report back ASAP |
@CatherineThomas-NOAA The warm-start test with the original workflow also finished successfully. |
@BrettHoover-NOAA Great! There may have been an incompatibility with the other ICs then. @HelinWei-NOAA @DavidHuber-NOAA Is the land-sea mask problem that you mentioned early documented elsewhere? Can this issue be closed? |
@CatherineThomas-NOAA No. It hasn't been documented elsewhere. But I have let Fanglin and Mike know this issue. IMO this issue can be closed now.
|
@CatherineThomas-NOAA I'm also OK with this issue being closed. |
I gave this a fresh cold start test (C192/C96) on Orion over the weekend and received the same error. Initial conditions were generated on Hera (/scratch1/NESDIS/nesdis-rdo2/David.Huber/ufs_utils/util/gdas_init), outputting them here: /scratch1/NESDIS/nesdis-rdo2/David.Huber/output/192. UFS_Utils was checked out using the same hash as the global workflow checkout script (04ad17e2). These were then transferred to Orion where a test ran from 2020073118 through 2020080500, where gdasgldas failed with the same message ("Restart Tile Space Mismatch, Halting.."). The global workflow hash used was 64b1c1e and can be found here: /work/noaa/nesdis-rdo2/dhuber/gw_dev. Logs from the run can be found here: /work/noaa/nesdis-rdo2/dhuber/para/com/test_gldas/logs. A comparison of the land surface mask ( I also created initial conditions for C384/C192 and compared the land surface mask against Cathy's tile 1 restart surface file, which shows no difference. Lastly, I copied the C384/C192 ICs over to Orion and executed just the gdasfcst job, then compared the It is this modification by the UFS that triggers problems for GLDAS and I think any further tests could be limited to just a single half-cycle run of gdasfcst. I tracked this modification to the addLsmask2grid subroutine, which is tied to ocean fraction which in turn is set in the orographic fix files, which are identical on Orion and Hera. So I am at a loss as to why these differ between warm and cold starts. Is this expected behavior and if so should GLDAS be turned off for cold starts? |
All, is there guidance on whether users should turn off GLDAS when running cold-started experiments for now? @jkhender just hit the same error on Orion with a cold-started experiment using global-workflow |
correction - my experiment is running on Hera |
This is what I would suggest. I don't think that a fix file update will work for everyone since warm starts seem to be using the current fix files without a problem, implying that an update would result in a mismatch for those users (testing could confirm this). Alternatively, two sets of fix files could be created, one for warm starts and one for cold, but that would require some scripting to know which to link to. |
@KateFriedman-NOAA @HelinWei-NOAA The new fix files should only be used for cold starts. Warm starts should continue to use the old fix files. The reason for this is that warm starts continue to use the same version of the GFS, which uses an older algorithm for the land mask and thus does not change the land mask to match what is in the new fix files. My suggestion was to create two sets of fix files -- one for cold starts and one for warm starts. This will require changes to the gldas scripts to point to the correct fix dataset. |
@DavidHuber-NOAA Ok I see. So I should use the following in these situations then (please check my understanding):
What about frac_grid off in warm-start? If warm-start use the same set regardless of frac_grid? What bugs me about this still is that what happens in cold vs warm is carried through the tests even though they are both warm after the initial cycle. Seems to me that the cold vs warm difference should only exist in the first cycle and the same fix files should be used in either started run after that cycle. I'm not a forecast model expert though so this is just my view from the workflow end of things. :) |
@KateFriedman-NOAA I believe the current fix files should work for both nofrac_grid and frac_grid warm starts since production does not modify the land-sea mask. |
I agree with your point that the new fix files should be used after the first cycle. Why the land sea mask does not change in future warm-start cycles does not make sense to me, either. |
Ok, thanks @DavidHuber-NOAA for confirming. Will likely need to run nofrac_grid tests at some point.
@HelinWei-NOAA @junwang-noaa @yangfanglin Is there a particular reason that a different land sea mask file is used depending on how the run was started? If a cycle is using warm-starts (after the initial cold-start cycle) why is it different than a warm-started run that is also using warm-starts in it's cycles? Is it possible to only use the special cold-start GLDAS fix file set in the first gldas job of a cold-started run and then use the existing fix set for later cycles (and in all warm-started run cycles)? Hope my questions make sense. Thanks! |
@KateFriedman-NOAA @DavidHuber-NOAA The same set of fix files should be used for all cases discussed here, no matter it is warm or cold start, first cycle or the cycles after, fractional or non-fractional grid. If you are using the current operational GFS then you do need the "old" fix files. Any UFS based applications should use the new fix files. If they new files are not working then there are probably still issues. |
@yangfanglin Noted. Ok, trying to wrap my head around the different IC scenarios we have here then, which need to be supported in the workflow since we will have GFSv16 warm-starts for a while still. Clarification questions below. Here are more details on my ICs and runs:
Both sets of ICs were originally from GFSv16 ops but were different based on cold (run through chgres_cube) or warm. I ran with the Prototype-P8 tag of ufs-weather-model in both tests. Question 1) All warm-started runs of UFS that use warm-start ICs from GFSv16 need to use the old GLDAS fix files, correct? Regardless of the checksum run on them? Question 2) Any run of the UFS that uses either cold-start (regardless of source) or warm-start from a different GFSv17 run use the new GLDAS fix files? Is the correct? Question 3) Is there an aspect of the ICs that I can check at the start of a run to help decide which set of fix files the workflow will use? Question 4) Is there a way to process GFSv16 warm-starts for use with the new GLDAS fix files? Thanks! |
@KateFriedman-NOAA We cannot run Prototype-P8 tag of ufs-weather-model using GFS.v16 warm start files because of changes in a few physics schemes including land and microphysics etc. GFS.v16 warm start ICs need to be converted to cold start ICs using CHGRES even if you are running at the C768L127 resolution . We have been doing this for a while. You can find a set of cold start ICs at the C768L127 resolution converted from GFS.v16 warm start ICs archived on HPSS at /NCEPDEV/emc-global/5year/Fanglin.Yang/ICs/C768L127 |
Got it, thanks for explaining!
Ok, I was wondering if that would be the case. Thanks!
Good to know, thanks! |
Absorb GLDAS scripts into global-workflow and fix GLDAS job by updating scripts to use new GLDAS fix file set. * Remove GLDAS scripts from .gitignore * Remove GLDAS script symlinks from link_workflow.sh * Add GLDAS scripts to global-workflow * Updates to GLDAS scripts, includes converting GLDAS script to replace machine checks with CFP variables * Address linter warnings and remove obsolete platforms Refs #622 #1014
@BrettHoover-NOAA @DavidHuber-NOAA @HelinWei-NOAA @yangfanglin PR #1018 mostly fixed this issue, however PR #1009 will bring in changes to have the GLDAS job use the updated fix set. We will announce that the GLDAS job is working again after #1009 goes into global-workflow Thanks for your help on resolving this issue! |
GLDAS scripts were recent moved into the workflow repo and need to be updated for the new fix structure like other components. Refs: NOAA-EMC#622, NOAA-EMC#966
GLDAS scripts were recent moved into the workflow repo and need to be updated for the new fix structure like other components. Refs: NOAA-EMC#622, NOAA-EMC#966
Restart Tile Space Mismatch for C96 parallel Running a C96L127 parallel on Orion. The 2021122500 gdasgldas job is the first job in the parallel with sufficient
Here are details of the parallel
Tagging @KateFriedman-NOAA , @WalterKolczynski-NOAA , @HelinWei-NOAA Have we successfully run gldas on Orion for C96 after #1009 was merged into |
@HelinWei-NOAA @KateFriedman-NOAA @WalterKolczynski-NOAA Looking through the history, it seems we never generated the |
@DavidHuber-NOAA You are right. We haven't updated the gldas fixed fields for the resolution C96. Now we have one for fraction grid (/work/noaa/stmp/rtreadon/RUNDIRS/prgsida4/2021122500/gdas/gldas.63881/sfc.gaussian.nemsio.20211222). Would you mind running another C96 case but turning off fractional grid? Thanks. |
@HelinWei-NOAA Will do. |
@HelinWei-NOAA OK, the C96, non-fractional grid sfc file is here: /work/noaa/nesdis-rdo2/dhuber/for_helin/96_nofrac/sfc.gaussian.nemsio.20220401. |
@DavidHuber-NOAA This nemsio file is for C192 resolution even you ran the model @c96. My program is not robust enough to fail with the wrong resolution input file. I check the previous nemsio tarball you created. The data is always one level higher resolution than it should. @KateFriedman-NOAA I need to recreate some GLDAS fixed fields because of this mismatch. |
@HelinWei-NOAA Okie dokie. Pass me the updated/new files when ready and I'll copy them into the fix set. Thanks! |
@DavidHuber-NOAA When you ran gdas@c96, gfs was always run @ 1 level higher resolution (c192 for this case). nemsio file was created by gfs? |
@HelinWei-NOAA The resolution is correct for C96. The GLDAS runs at T190 for a C96 case and that is the resolution of the nemsio file I provided. The GLDAS resolution is determined by the global workflow script scripts/exgdas_atmos_gldas.sh at line 58: This is confirmed on lines 90 through 93 where the linking to the correct fix files is written as
So I believe you have correctly derived the fix files for C192 - C768 and that between Russ' and my nemsio files, you will have what you need for C96. If T190 is supposed to correspond to C192, then the gldas scripts need to be rewritten. |
@DavidHuber-NOAA You are absolutely right. Thank you for clarification. However, my program interpreted them wrongly, so I need to recreate the whole set of gldas fixed filed data. Do you have the nemsio files for C192 and C384 without fractional grid? Thanks. |
@DavidHuber-NOAA I found those two nemsio files. Thanks. |
@KateFriedman-NOAA @DavidHuber-NOAA It turns out only C96 (T190) has a problem. Please copy the updated data for this resolution FIX_T190 from /scratch1/NCEPDEV/global/Helin.Wei/save/fix_gldas for both nonfrac_grid and frac_grid. Thanks. |
@HelinWei-NOAA @DavidHuber-NOAA @RussTreadon-NOAA I have rsync'd the updated FIX_190 nonfrac_grid and frac_grid files into the fix/gldas/20220920 subfolder on all supported platforms. Please retest the GLDAS job that failed and let me know if further updates are needed. Thanks! |
…#622) * fix initialization for moving nest grid
Expected behavior
gdasgldas task should complete successfully and global-workflow should continue to cycle fv3gdas
Current behavior
gdasgldas task is failing on the first cycle in which the task is not skipped (first 00z analysis period when enough data has been produced to trigger task).
Machines affected
This error is being expressed on Orion.
To Reproduce
I am seeing this bug in a test of global-workflow being conducted on Orion in the following directories:
expid: /work/noaa/da/bhoover/para/bth_test
code: /work/noaa/da/bhoover/global-workflow
ROTDIR: /work/noaa/stmp/bhoover/ROTDIRS/bth_test
RUNDIR: /work/noaa/stmp/bhoover/RUNDIRS/bth_test
This run is initialized on 2020082200, and designed to terminate 2 weeks later on 2020090500.
Experiment setup:
/work/noaa/da/bhoover/global-workflow/ush/rocoto/setup_expt.py --pslot bth_test --configdir /work/noaa/da/bhoover/global-workflow/parm/config --idate 2020082200 --edate 2020090500 --comrot /work/noaa/stmp/bhoover/ROTDIRS --expdir /work/noaa/da/bhoover/para --resdet 384 --resens 192 --nens 80 --gfs_cyc 1
Workflow setup:
/work/noaa/da/bhoover/global-workflow/ush/rocoto/setup_workflow.py --expdir /work/noaa/da/bhoover/para/bth_test
Initial conditions:
/work/noaa/da/cthomas/ICS/2020082200/
The error is found in the gdasgldas task on 2020082600.
Log file:
/work/noaa/stmp/bhoover/ROTDIRS/bth_test/logs/2020082600/gdasgldas.log
Context
This run is being used by a new Orion user and member of the satellite DA group, only to familiarize myself with the process of carrying out an experiment. There have been no code-changes made for this run. I followed directions for cloning and building the global-workflow, and setting up a cycled experiment, from the available wiki:
https://github.com/NOAA-EMC/global-workflow/wiki/
I did not create the initial condition files, they were instead produced for me. The global-workflow repository was cloned on January 25 2022 (d3028b9)
The task fails with the following error in the log-file:
The dimension size of 389408 is suspicious, since earlier in the log a different dimension size is referenced, e.g.:
When I search for "389408" in the log-file, it only appears in two places, one is in the Restart Tile Space Mismatch error, and the other is while running exec/gldas_rst, when reporting the results of a FAST_BYTESWAP:
I believe that the error is related to the difference in tile-size between these two values.
Detailed Description
I have proposed no change or addition to the code for this run.
Additional Information
Prior gdasgldas tasks in the run from initialization to 2020082600 were successful, but they were all skipped either because the analysis was for a non-00z period or because the requisite number of cycles had not been completed to allow the task to trigger. There are no successful gdasgldas tasks in this run that I can use to compare to the one that has failed. I have conferred with more experienced EMC users of fv3gdas and the cause of the problem is not obvious.
Possible Implementation
I have no implementation plan to offer.
The text was updated successfully, but these errors were encountered: