Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1/2 degree coupled model crashes when backscatter is on and a mask_table is used #262

Closed
nikizadehgfdl opened this issue Mar 28, 2016 · 10 comments

Comments

@nikizadehgfdl
Copy link
Contributor

The base 1/2 degree coupled model ( CM4_c96L32_am4g9_2000_OMp5 ) crashes with some specific layouts with mask_tables (both c1 and c3) with

FATAL from PE  1040: NaN in input field of reproducing_sum(_2d).
forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source
fms_cm4_sis2_comp  0000000002F73766  mpp_mod_mp_mpp_er          51  mpp_util_mpi.inc
fms_cm4_sis2_comp  00000000007571C8  mom_coms_mp_repro         174  MOM_coms.F90
fms_cm4_sis2_comp  00000000007BB894  mom_spatial_means          56  MOM_spatial_means.F90
fms_cm4_sis2_comp  000000000070AE72  mom_mp_step_mom_         1188  MOM.F90
fms_cm4_sis2_comp  00000000006548A5  ocean_model_mod_m         414  ocean_model_MOM.F90
fms_cm4_sis2_comp  0000000000401DBB  MAIN__                    912  coupler_main.F90

It runs fine for the same layout without the mask_table.

It also runs fine with some other mask_tables (for other layouts).

Here are the work dirs for the test using mask_table and not with the same layout:

using mask_table crash:
/lustre/f1/Niki.Zadeh/work/ulm_201510_awg_v20160212_mom6_2016.03.22/CM4_c96L32_am4g9_2000_OMp5.2016.03.22_1x0m1d_432x2a_1057x1o.o5043100

no mask_table runs:
/lustre/f1/Niki.Zadeh/work/ulm_201510_awg_v20160212_mom6_2016.03.22/CM4_c96L32_am4g9_2000_OMp5.2016.03.22_1x0m1d_432x2a_1296x1o.o5043099

@Zhi-Liang is there a way to test if a mask_table is "bad"?

I made all mask_tables using the same check_mask tool.

Moreover, if I change the model (MOM6) parameters it runs fine with the same mask_table that crashed for the base experiment above, e.g.,

using mask_table runs
/lustre/f1/Niki.Zadeh/work/ulm_201510_awg_v20160212_mom6_2016.03.22/CM4_c96L32_am4g9_2000_OMp5_lmix_H5_nmle_ndiff_meke.2016.03.22_1x1m0d_432x2a_1057x1o.o76029

This tells me there might be something wrong with some of the mask_tables that I have generated for the 1/2 degree ocean grid:

the ones that worked
/lustre/f1/unswept/Niki.Zadeh/archive/input/verona/John.Dunne_OM4_05_20151028/mosaic_ocean.720x576/mask_table.320.36x45
/lustre/f1/unswept/Niki.Zadeh/archive/input/verona/John.Dunne_OM4_05_20151028/mosaic_ocean.720x576/mask_table.417.45x45

the ones that crashed
/lustre/f1/unswept/Niki.Zadeh/archive/input/verona/John.Dunne_OM4_05_20151028/mosaic_ocean.720x576/mask_table.239.36x36
/lustre/f1/unswept/Niki.Zadeh/archive/input/verona/John.Dunne_OM4_05_20151028/mosaic_ocean.720x576/mask_table.201.32x36
/lustre/f1/unswept/Niki.Zadeh/archive/input/verona/John.Dunne_OM4_05_20151028/mosaic_ocean.720x576/mask_table.359.36x50
/lustre/f1/unswept/Niki.Zadeh/archive/input/verona/John.Dunne_OM4_05_20151028/mosaic_ocean.720x576/mask_table.548.36x72

@Zhi-Liang
Copy link
Contributor

Hi Niki,

I do not know the method to check if the mask table is bad. Are these
runs crashed runs OK with previous codes or these are new tests ( mask
table created recently)?

Zhi

On Mon, Mar 28, 2016 at 3:44 PM, Niki Zadeh notifications@github.com
wrote:

The base 1/2 degree coupled model ( CM4_c96L32_am4g9_2000_OMp5 ) crashes
with some specific layouts with mask_tables (both c1 and c3) with

FATAL from PE 1040: NaN in input field of reproducing_sum(2d).
forrtl: error (76): Abort trap signal
Image PC Routine Line Source
fms_cm4_sis2_comp 0000000002F73766 mpp_mod_mp_mpp_er 51 mpp_util_mpi.inc
fms_cm4_sis2_comp 00000000007571C8 mom_coms_mp_repro 174 MOM_coms.F90
fms_cm4_sis2_comp 00000000007BB894 mom_spatial_means 56 MOM_spatial_means.F90
fms_cm4_sis2_comp 000000000070AE72 mom_mp_step_mom
1188 MOM.F90
fms_cm4_sis2_comp 00000000006548A5 ocean_model_mod_m 414 ocean_model_MOM.F90
fms_cm4_sis2_comp 0000000000401DBB MAIN__ 912 coupler_main.F90

It runs fine for the same layout without the mask_table.

It also runs fine with some other mask_tables (for other layouts).

Here are the work dirs for the test using mask_table and not with the same
layout:

using mask_table crash:
/lustre/f1/Niki.Zadeh/work/ulm_201510_awg_v20160212_mom6_2016.03.22/CM4_c96L32_am4g9_2000_OMp5.2016.03.22_1x0m1d_432x2a_1057x1o.o5043100

no mask_table runs:
/lustre/f1/Niki.Zadeh/work/ulm_201510_awg_v20160212_mom6_2016.03.22/CM4_c96L32_am4g9_2000_OMp5.2016.03.22_1x0m1d_432x2a_1296x1o.o5043099

@Zhi-Liang https://github.com/Zhi-Liang is there a way to test if a
mask_table is "bad"?

I made all mask_tables using the same check_mask tool.

Moreover, if I change the model (MOM6) parameters it runs fine with the
same mask_table that crashed for the base experiment above, e.g.,

using mask_table runs
/lustre/f1/Niki.Zadeh/work/ulm_201510_awg_v20160212_mom6_2016.03.22/CM4_c96L32_am4g9_2000_OMp5_lmix_H5_nmle_ndiff_meke.2016.03.22_1x1m0d_432x2a_1057x1o.o76029

This tells me there might be something wrong with some of the mask_tables
that I have generated for the 1/2 degree ocean grid:

the ones that worked

/lustre/f1/unswept/Niki.Zadeh/archive/input/verona/John.Dunne_OM4_05_20151028/mosaic_ocean.720x576/mask_table.320.36x45

/lustre/f1/unswept/Niki.Zadeh/archive/input/verona/John.Dunne_OM4_05_20151028/mosaic_ocean.720x576/mask_table.417.45x45

the ones that crashed

/lustre/f1/unswept/Niki.Zadeh/archive/input/verona/John.Dunne_OM4_05_20151028/mosaic_ocean.720x576/mask_table.239.36x36

/lustre/f1/unswept/Niki.Zadeh/archive/input/verona/John.Dunne_OM4_05_20151028/mosaic_ocean.720x576/mask_table.201.32x36

/lustre/f1/unswept/Niki.Zadeh/archive/input/verona/John.Dunne_OM4_05_20151028/mosaic_ocean.720x576/mask_table.359.36x50

/lustre/f1/unswept/Niki.Zadeh/archive/input/verona/John.Dunne_OM4_05_20151028/mosaic_ocean.720x576/mask_table.548.36x72


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
NOAA-GFDL#262

@nikizadehgfdl
Copy link
Contributor Author

@Zhi-Liang you have a good point.
I had not seen these crashes before and indeed I think this test ran with MOM6 tag 2016.03.02 for the mask_table.239.36x36
that is now crashing.

I will run the test with DEBUG= True to see where things start to diverge.

@nikizadehgfdl
Copy link
Contributor Author

When I set DEBUG=True the issue went away!!!

/lustre/f1/Niki.Zadeh/ulm_201510_awg_v20160212_mom6_2016.03.22/CM4_c96L32_am4g9_2000_OMp5.2016.03.22/ncrc2.intel16-prod-openmp/stdout/run/CM4_c96L32_am4g9_2000_OMp5.2016.03.22_1x0m1d_432x2a_1057x1o.o5043100 

/lustre/f1/Niki.Zadeh/ulm_201510_awg_v20160212_mom6_2016.03.22/CM4_c96L32_am4g9_2000_OMp5.2016.03.22/ncrc2.intel16-prod-openmp/stdout/run/CM4_c96L32_am4g9_2000_OMp5.2016.03.22_1x0m1d_432x2a_1057x1o1.o5043101

@Zhi-Liang
Copy link
Contributor

Hi Niki,

From my previous experience, the possible reason Is that some variable
is not initialized. In debug mode, it will be given value 0 and it runs OK.
In production mode, it will cause crash.

Zhi

On Mon, Mar 28, 2016 at 6:21 PM, Niki Zadeh notifications@github.com
wrote:

When I set DEBUG=True the issue went away!!!

/lustre/f1/Niki.Zadeh/ulm_201510_awg_v20160212_mom6_2016.03.22/CM4_c96L32_am4g9_2000_OMp5.2016.03.22/ncrc2.intel16-prod-openmp/stdout/run/CM4_c96L32_am4g9_2000_OMp5.2016.03.22_1x0m1d_432x2a_1057x1o.o5043100

/lustre/f1/Niki.Zadeh/ulm_201510_awg_v20160212_mom6_2016.03.22/CM4_c96L32_am4g9_2000_OMp5.2016.03.22/ncrc2.intel16-prod-openmp/stdout/run/CM4_c96L32_am4g9_2000_OMp5.2016.03.22_1x0m1d_432x2a_1057x1o1.o5043101


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
NOAA-GFDL#262 (comment)

@nikizadehgfdl
Copy link
Contributor Author

Yes, but I am not talking about the "debug" mode. I used the same "prod" executable. DEBUG=True is a MOM6 parameter that prints out lots of checksums.

To check my sanity I repeated the experiment a few times and every time it crashed, then I added
#override DEBUG = True
and again it ran fine!

@adcroft could #override DEBUG = True have any effect on uninitialized variables?

@nikizadehgfdl
Copy link
Contributor Author

@adcroft modified MOM_spatial_means.F90 a little for debugging :

54  tmpForSumming(:,:) = 0.
55  tmpAreaForSumming(:,:) = 0.
56  do j=js,je ; do i=is, ie
57    tmpForSumming(i,j) = ( var(i,j) * (G%areaT(i,j) * G%mask2dT(i,j)) )
58   tmpAreaForSumming(i,j) = G%areaT(i,j) * G%mask2dT(i,j)
59 enddo ; enddo
60  global_area= reproducing_sum( tmpAreaForSumming )
61  global_area_mean = reproducing_sum( tmpForSumming ) / global_area
62 !global_area_mean = reproducing_sum( tmpForSumming ) * G%IareaT_global

This resulted in a crash as before at line 61 above confirming that something goes wrong with the input array var.

I'll rerun in "debug" mode for further clues.

@adcroft
Copy link
Collaborator

adcroft commented Mar 31, 2016

That tells us the bad data is in var, so yes we need to look at it in a debugger.

@nikizadehgfdl
Copy link
Contributor Author

The crash does not happen in "debug" mode! It happens in both "prod" and "repro" modes.

@nikizadehgfdl
Copy link
Contributor Author

This issue is related to the combination of MOM6 backscatter AND using a mask_table.
Turning backscatter off by removing the following line from MOM_override makes the NaN issue go away.

#override MEKE_VISCOSITY_COEFF = -0.4

So, backscatter code is not compatible with using a mask_table. I edit the title to reflect that.

@nikizadehgfdl nikizadehgfdl changed the title some 1/2 degree coupled model crashes with some specific layouts with mask_table 1/2 degree coupled model crashes when backscatter is on and a mask_table is used Apr 8, 2016
nikizadehgfdl added a commit to nikizadehgfdl/MOM6 that referenced this issue Mar 23, 2018
- Closes issue mom-ocean#734 and hopefully issue mom-ocean#262
- MEKE%Ku array is allocated on data domain and has a mpp_domain_update
  following the loop so it needs to be calculated only on compute domain
- The problem with larger extents of the loops is Lmixscale array is
  initialized/calculate on compute domain and is NaN beyond compute domain
  extents, causing NaN's to lurk into MEKE%Ku and the model.
@Hallberg-NOAA
Copy link
Collaborator

We do not think that this is an issue with any actively used models. Further work that completes the backscatter capability would have to address this, if it is still an issue there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants