FESOM2 writing restart files slowly on some machines #103

JanStreffing · 2021-04-19T15:01:33Z

After running AWI-CM3 on Aleph for the first time I found that we have an issue with writing restart files extremely slowly there. Are are looking at write speed of less than 3 MB/s. Some maschines (e.g Ollie and Mistral) apparently have their NetCDF IO layer configured differently and do this for us, but on Juwels and Aleph we need to order the dimensions correctly ourselves.

I had previously found and fixed this for Juwels with these two commits:
9bce28a
d710062

Shortly after I made these changes @hegish merged his substantial changes and improvements to regular io into the master branch. Merging what I did for Juwels became quite difficult thereafter and we never attempted it. Now I think we have to.

To recap from the old gitlab issue: https://gitlab.dkrz.de/FESOM/fesom2/-/issues/19

We are writing restart files like this (now outdated):

        do lev=1, size1
           laux=id%var(i)%pt2(lev,:)
           t0=MPI_Wtime()
           if (size1==nod2D  .or. size2==nod2D)  call gather_nod (laux, aux)
           if (size1==elem2D .or. size2==elem2D) call gather_elem(laux, aux)
           t1=MPI_Wtime()
           if (mype==0) then
              id%error_status(c)=nf_put_vara_double(id%ncid, id%var(i)%code, (/lev, 1, id%rec_count/), (/1, size2, 1/), aux, 1); c=c+1
           end if
           t2=MPI_Wtime()
           if (mype==0 .and. size2==nod2D) write(*,*) 'nvar: ', i, 'lev: ', lev, 'gather_nod: ', t1-t0
           if (mype==0 .and. size2==nod2D) write(*,*) 'nvar: ', i, 'lev: ', lev, 'nf_put_var: ', t2-t1
        end do

We are holding the first dimension, lev. Since we are writing from Fortran, NetCDF transposes the values in the output file, compared to what we tell it to write here. Order as shown via ncdump:
double u(time, elem, nz_1) ;
During writing, this means we hold fix, lev, the dimension that changes most often in the original data in memory and NetCDF has to start a new write process each nod2D block. In 2D imagine if you want to write one row of 100000 values, but what you do is write 100000 columns of length one. This is confirmed by using the darshan IO logging tool, showing the number of seek and write access for writing a single restart file, as well as the size of the write accesses. You can find the full logfile attached.
Before: streffin_fesom.x_id2406075_6-29-66061-10726804095939863824_1.darshan_1_.pdf

As you can see, we are generating 84 million accesses with just a CORE2 mesh restart. On Juwels and Aleph this is painfully slow (~25 min for a CORE2 restart).

For the old routines I had found a working solution on Juwels in chunking the netcdf data in io_restart.F90:

     if (n==1) then
        id%error_status(c)=nf_def_var_chunking(id%ncid, id%var(j)%code, NF_CHUNKED, (/1/)); c=c+1
     elseif (n==2) then
        id%error_status(c)=nf_def_var_chunking(id%ncid, id%var(j)%code, NF_CHUNKED, (/1, id%dim(1)%size/)); c=c+1
     end if

and io_meandata.F90:

  if (entry%ndim==1) then
     entry%error_status(c) = nf_def_var_chunking(entry%ncid, entry%varID, NF_CHUNKED, (/1/)); c=c+1
  elseif (entry%ndim==2) then
     entry%error_status(c) = nf_def_var_chunking(entry%ncid, entry%varID, NF_CHUNKED, (/1,  entry%glsize(1)/)); c=c+1
  endif

After: streffin_fesom.x_id2412833_7-3-8403-13876901149393710826_1.darshan.pdf

This increased the output speed on a CORE2 restart on Juwels from 25 minutes to 40 seconds.

I think we have to revisit this issue and work on an implementation within the new IO scheme. @hegish @dsidoren

The text was updated successfully, but these errors were encountered:

koldunovn · 2021-04-19T15:50:46Z

Is it maybe a good time to combine it with splitting restarts to individual files?

JanStreffing · 2021-04-19T16:13:37Z

Do you think splitting restarts to individual files also has a timeline of a day or two? That's what I was hoping for with chunking, since it essentially blocks us from doing anything but short scaling tests on Aleph.

koldunovn · 2021-04-19T16:18:22Z

This is a question to @hegish :) I think we agreed on this strategy, so eventually it will be implemented, but no idea on the timeline :)

hegish · 2021-04-19T17:17:16Z

We agreed to do it after the upcoming release.
I will have to finish the async part to make it faster, though. As this is unfinished work anyway, I really would like to finish it ASAP. We have to talk about priorities...

trackow · 2021-04-26T11:41:08Z

I wanted to add that this was a serious issue and painfully slow also on the ECMWF machine (cray) for the Dyamond runs. It was still bearable with the core grid, but for the ORCA025 grid reading/writing one (!) restart was on the order of 45+ minutes. I tried the fixes from Jan once, but we could not get it to work at ECMWF when coupled to IFS.

The restart took 6500 seconds:

==========================================
MODEL SETUP took on mype=0 [seconds]
runtime setup total 6520.80273

runtime setup mesh 15.691925
runtime setup ocean 1.98124599
runtime setup forcing 2.488662
runtime setup ice 6.949400902E-2
runtime setup restart 6500.56738
runtime setup other 3.926753998E-3
============================================
MPI has been initialized in the atmospheric model

From an email from Nils Wedi: "I am not sure why you looped over 2D variables, but I recoded it and added openMP on the packing in the halo exchange, possibly I need the same for the write + gather now to do a 3D gather. you can have a look at my versions of

gen_halo_exchange.F90
io_restart.F90

I now get

MODEL SETUP took on mype=0 [seconds]
runtime setup total 122.529007

runtime setup mesh 15.724205
runtime setup ocean 1.51957893
runtime setup forcing 3.497099876E-2
runtime setup ice 8.305215836E-2
runtime setup restart 105.131508
runtime setup other 3.56900692E-2"

The reading of restarts is OK like this (https://gitlab.dkrz.de/FESOM/fesom2/-/commit/6619be5b3e9d4ad128eb9ec6ee192bdf1faa90e4#c62db497f6f45aec1ab97cf4bc52c005e29c68df ).

However, writing (every 3 hours following the Dyamond protocol) was still as slow as before. In the end, we wrote the whole 3D output at once as one big chunk (with @hegish : https://gitlab.dkrz.de/FESOM/fesom2/-/commit/0a61861a49c78e1d386951335ac2ffb343ddb799), which works for the 0.25° NEMO grid, but this won't work for the Rossby grid possibly because it needs too much memory.

JanStreffing · 2021-04-29T14:44:13Z

Hello @hegish,
I was talking to Dima Sein just now and we installed AWI-CM3 on Aleph for him. We concluded that fixing the restarts is more pressing than having a simpler rmp_ generation workflow at the moment. Would it be possible to prioritize this issue over workflow one?
Cheers, Jan

JanStreffing · 2021-05-10T10:47:54Z

@JanStreffing Foward JSC communication.

JanStreffing · 2021-07-27T14:19:51Z

The new transposed parallel netcdf restart by @hegish has been successfully tested by @tsemmler05 running AWICM3 via esm_tools on juwels. This greatly reduced the time needed for restarts, both from using internally transposed fields and from writing the 20 restart fields into separate files in parallel. A version 0.1 of memory dump restarts is also functional, but will still be improved further to reduce the number of files it produced. This can live in a separate issue though, as the transposed parallel netcdf restarts already allow for restarts that take seconds to minutes rather than hours.

hegish · 2021-07-27T14:31:51Z

Has reading the transposed restart files also been tested?

tsemmler05 · 2021-07-27T14:58:43Z

Actually the test that I have done so far was with the raw restart files (restart dumps). The warm start hasn't worked out probably because the transposed restarts haven't been copied by the esm tools:

/p/scratch/chhb19/semmler1/runtime/awicm-3.1/D3fsp2/scripts/D3fsp2_compute_19900101-19900101_4077997.log:

48: reading restart PARALLEL for ssh at /p/scratch/chhb19/semmler1/runtime/aw
icm-3.1//D3fsp2//run_19900101-19900101/work/fesom.1989.oce.restart/ssh.nc
48: error in line 254 io_netcdf_file_module.F90 No such file or dire
ctory
48: 1

JanStreffing · 2021-07-28T07:21:54Z

Ah, I see. I would assume that for the warm start via parallel netcdf restart you also need #157, because the restart files from standalone FESOM2 don't contain ice_albedo and ice_temp.

tsemmler05 · 2021-07-28T07:28:50Z

O.k., I did try to just manually copy the restart files into the directory before putting the job to run but that didn't work out either (but it didn't even get to try reading any restart files but didn't do anything whatsoever in pre-existing D3fsp2 experiment). I will still give it another try in a fresh directory to verify that it will be looking for ice_albedo and ice_temp.

tsemmler05 · 2021-07-28T07:37:40Z

@hegish, @JanStreffing: Do I need to switch to fesom branch awicm-3-frontiers_parallel-restart? Maybe only the restart dumps are merged into awicm-3-frontiers but not the transposed restarts?

hegish · 2021-07-28T08:25:01Z

only awicm-3-frontiers_parallel-restart contains the parallel and transposed restart read/write functionality. Also the raw restart dumps are available via this branch.
I will push changes today to the raw restart where only a single file will be used per process. This speeds up copying the data by a factor of more than 10 (on aleph), though you should nod copy these files anyway. It will also help with the file count limit we have on some systems (e.g. juwels).

JanStreffing · 2021-07-28T09:00:26Z

Hey Jan, thanks for the update, I'm testing awicm-3-frontiers_parallel-restart out together with Dima Sein on Aleph right now.

hegish · 2021-07-28T09:39:49Z

Has reading the transposed restart files also been tested?

Good news: reading the portable restarts in parallel seems to work on juwels with Fesom np576 according to run awicm-3.1//D3fsp3//run_19900101-19900101 done by @tsemmler05. Reading in parallel does not work on aleph so far.

JanStreffing added the bug Something isn't working label Apr 19, 2021

JanStreffing assigned hegish Apr 19, 2021

JanStreffing closed this as completed Jul 27, 2021

JanStreffing mentioned this issue Oct 4, 2021

Change of the output dimensions order #173

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FESOM2 writing restart files slowly on some machines #103

FESOM2 writing restart files slowly on some machines #103

JanStreffing commented Apr 19, 2021 •

edited

Loading

koldunovn commented Apr 19, 2021

JanStreffing commented Apr 19, 2021

koldunovn commented Apr 19, 2021

hegish commented Apr 19, 2021 •

edited

Loading

trackow commented Apr 26, 2021 •

edited

Loading

JanStreffing commented Apr 29, 2021

JanStreffing commented May 10, 2021

JanStreffing commented Jul 27, 2021

hegish commented Jul 27, 2021

tsemmler05 commented Jul 27, 2021

JanStreffing commented Jul 28, 2021

tsemmler05 commented Jul 28, 2021

tsemmler05 commented Jul 28, 2021

hegish commented Jul 28, 2021

JanStreffing commented Jul 28, 2021

hegish commented Jul 28, 2021

FESOM2 writing restart files slowly on some machines #103

FESOM2 writing restart files slowly on some machines #103

Comments

JanStreffing commented Apr 19, 2021 • edited Loading

koldunovn commented Apr 19, 2021

JanStreffing commented Apr 19, 2021

koldunovn commented Apr 19, 2021

hegish commented Apr 19, 2021 • edited Loading

trackow commented Apr 26, 2021 • edited Loading

JanStreffing commented Apr 29, 2021

JanStreffing commented May 10, 2021

JanStreffing commented Jul 27, 2021

hegish commented Jul 27, 2021

tsemmler05 commented Jul 27, 2021

JanStreffing commented Jul 28, 2021

tsemmler05 commented Jul 28, 2021

tsemmler05 commented Jul 28, 2021

hegish commented Jul 28, 2021

JanStreffing commented Jul 28, 2021

hegish commented Jul 28, 2021

JanStreffing commented Apr 19, 2021 •

edited

Loading

hegish commented Apr 19, 2021 •

edited

Loading

trackow commented Apr 26, 2021 •

edited

Loading