Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FESOM2 writing restart files slowly on some machines #103

Closed
JanStreffing opened this issue Apr 19, 2021 · 16 comments
Closed

FESOM2 writing restart files slowly on some machines #103

JanStreffing opened this issue Apr 19, 2021 · 16 comments
Assignees
Labels
bug Something isn't working

Comments

@JanStreffing
Copy link
Collaborator

JanStreffing commented Apr 19, 2021

After running AWI-CM3 on Aleph for the first time I found that we have an issue with writing restart files extremely slowly there. Are are looking at write speed of less than 3 MB/s. Some maschines (e.g Ollie and Mistral) apparently have their NetCDF IO layer configured differently and do this for us, but on Juwels and Aleph we need to order the dimensions correctly ourselves.

I had previously found and fixed this for Juwels with these two commits:
9bce28a
d710062

Shortly after I made these changes @hegish merged his substantial changes and improvements to regular io into the master branch. Merging what I did for Juwels became quite difficult thereafter and we never attempted it. Now I think we have to.

To recap from the old gitlab issue: https://gitlab.dkrz.de/FESOM/fesom2/-/issues/19

We are writing restart files like this (now outdated):

        do lev=1, size1
           laux=id%var(i)%pt2(lev,:)
           t0=MPI_Wtime()
           if (size1==nod2D  .or. size2==nod2D)  call gather_nod (laux, aux)
           if (size1==elem2D .or. size2==elem2D) call gather_elem(laux, aux)
           t1=MPI_Wtime()
           if (mype==0) then
              id%error_status(c)=nf_put_vara_double(id%ncid, id%var(i)%code, (/lev, 1, id%rec_count/), (/1, size2, 1/), aux, 1); c=c+1
           end if
           t2=MPI_Wtime()
           if (mype==0 .and. size2==nod2D) write(*,*) 'nvar: ', i, 'lev: ', lev, 'gather_nod: ', t1-t0
           if (mype==0 .and. size2==nod2D) write(*,*) 'nvar: ', i, 'lev: ', lev, 'nf_put_var: ', t2-t1
        end do

We are holding the first dimension, lev. Since we are writing from Fortran, NetCDF transposes the values in the output file, compared to what we tell it to write here. Order as shown via ncdump:
double u(time, elem, nz_1) ;
During writing, this means we hold fix, lev, the dimension that changes most often in the original data in memory and NetCDF has to start a new write process each nod2D block. In 2D imagine if you want to write one row of 100000 values, but what you do is write 100000 columns of length one. This is confirmed by using the darshan IO logging tool, showing the number of seek and write access for writing a single restart file, as well as the size of the write accesses. You can find the full logfile attached.
Before: streffin_fesom.x_id2406075_6-29-66061-10726804095939863824_1.darshan_1_.pdf
Screenshot from 2021-04-19 17-06-48

As you can see, we are generating 84 million accesses with just a CORE2 mesh restart. On Juwels and Aleph this is painfully slow (~25 min for a CORE2 restart).

For the old routines I had found a working solution on Juwels in chunking the netcdf data in io_restart.F90:

     if (n==1) then
        id%error_status(c)=nf_def_var_chunking(id%ncid, id%var(j)%code, NF_CHUNKED, (/1/)); c=c+1
     elseif (n==2) then
        id%error_status(c)=nf_def_var_chunking(id%ncid, id%var(j)%code, NF_CHUNKED, (/1, id%dim(1)%size/)); c=c+1
     end if

and io_meandata.F90:

  if (entry%ndim==1) then
     entry%error_status(c) = nf_def_var_chunking(entry%ncid, entry%varID, NF_CHUNKED, (/1/)); c=c+1
  elseif (entry%ndim==2) then
     entry%error_status(c) = nf_def_var_chunking(entry%ncid, entry%varID, NF_CHUNKED, (/1,  entry%glsize(1)/)); c=c+1
  endif

After: streffin_fesom.x_id2412833_7-3-8403-13876901149393710826_1.darshan.pdf
Screenshot from 2021-04-19 17-09-45

This increased the output speed on a CORE2 restart on Juwels from 25 minutes to 40 seconds.

I think we have to revisit this issue and work on an implementation within the new IO scheme. @hegish @dsidoren

@JanStreffing JanStreffing added the bug Something isn't working label Apr 19, 2021
@koldunovn
Copy link
Member

Is it maybe a good time to combine it with splitting restarts to individual files?

@JanStreffing
Copy link
Collaborator Author

Do you think splitting restarts to individual files also has a timeline of a day or two? That's what I was hoping for with chunking, since it essentially blocks us from doing anything but short scaling tests on Aleph.

@koldunovn
Copy link
Member

This is a question to @hegish :) I think we agreed on this strategy, so eventually it will be implemented, but no idea on the timeline :)

@hegish
Copy link
Collaborator

hegish commented Apr 19, 2021

We agreed to do it after the upcoming release.
I will have to finish the async part to make it faster, though. As this is unfinished work anyway, I really would like to finish it ASAP. We have to talk about priorities...

@trackow
Copy link
Contributor

trackow commented Apr 26, 2021

I wanted to add that this was a serious issue and painfully slow also on the ECMWF machine (cray) for the Dyamond runs. It was still bearable with the core grid, but for the ORCA025 grid reading/writing one (!) restart was on the order of 45+ minutes. I tried the fixes from Jan once, but we could not get it to work at ECMWF when coupled to IFS.

The restart took 6500 seconds:

==========================================
MODEL SETUP took on mype=0 [seconds]
runtime setup total 6520.80273

runtime setup mesh 15.691925
runtime setup ocean 1.98124599
runtime setup forcing 2.488662
runtime setup ice 6.949400902E-2
runtime setup restart 6500.56738
runtime setup other 3.926753998E-3
============================================
MPI has been initialized in the atmospheric model

From an email from Nils Wedi: "I am not sure why you looped over 2D variables, but I recoded it and added openMP on the packing in the halo exchange, possibly I need the same for the write + gather now to do a 3D gather. you can have a look at my versions of

gen_halo_exchange.F90
io_restart.F90

I now get

MODEL SETUP took on mype=0 [seconds]
runtime setup total 122.529007

runtime setup mesh 15.724205
runtime setup ocean 1.51957893
runtime setup forcing 3.497099876E-2
runtime setup ice 8.305215836E-2
runtime setup restart 105.131508
runtime setup other 3.56900692E-2"

The reading of restarts is OK like this (https://gitlab.dkrz.de/FESOM/fesom2/-/commit/6619be5b3e9d4ad128eb9ec6ee192bdf1faa90e4#c62db497f6f45aec1ab97cf4bc52c005e29c68df ).

However, writing (every 3 hours following the Dyamond protocol) was still as slow as before. In the end, we wrote the whole 3D output at once as one big chunk (with @hegish : https://gitlab.dkrz.de/FESOM/fesom2/-/commit/0a61861a49c78e1d386951335ac2ffb343ddb799), which works for the 0.25° NEMO grid, but this won't work for the Rossby grid possibly because it needs too much memory.

@JanStreffing
Copy link
Collaborator Author

Hello @hegish,
I was talking to Dima Sein just now and we installed AWI-CM3 on Aleph for him. We concluded that fixing the restarts is more pressing than having a simpler rmp_ generation workflow at the moment. Would it be possible to prioritize this issue over workflow one?
Cheers, Jan

@JanStreffing
Copy link
Collaborator Author

@JanStreffing Foward JSC communication.

@JanStreffing
Copy link
Collaborator Author

The new transposed parallel netcdf restart by @hegish has been successfully tested by @tsemmler05 running AWICM3 via esm_tools on juwels. This greatly reduced the time needed for restarts, both from using internally transposed fields and from writing the 20 restart fields into separate files in parallel. A version 0.1 of memory dump restarts is also functional, but will still be improved further to reduce the number of files it produced. This can live in a separate issue though, as the transposed parallel netcdf restarts already allow for restarts that take seconds to minutes rather than hours.

@hegish
Copy link
Collaborator

hegish commented Jul 27, 2021

Has reading the transposed restart files also been tested?

@tsemmler05
Copy link
Collaborator

Actually the test that I have done so far was with the raw restart files (restart dumps). The warm start hasn't worked out probably because the transposed restarts haven't been copied by the esm tools:

/p/scratch/chhb19/semmler1/runtime/awicm-3.1/D3fsp2/scripts/D3fsp2_compute_19900101-19900101_4077997.log:

48: reading restart PARALLEL for ssh at /p/scratch/chhb19/semmler1/runtime/aw
icm-3.1//D3fsp2//run_19900101-19900101/work/fesom.1989.oce.restart/ssh.nc
48: error in line 254 io_netcdf_file_module.F90 No such file or dire
ctory
48: 1

@JanStreffing
Copy link
Collaborator Author

Ah, I see. I would assume that for the warm start via parallel netcdf restart you also need #157, because the restart files from standalone FESOM2 don't contain ice_albedo and ice_temp.

@tsemmler05
Copy link
Collaborator

O.k., I did try to just manually copy the restart files into the directory before putting the job to run but that didn't work out either (but it didn't even get to try reading any restart files but didn't do anything whatsoever in pre-existing D3fsp2 experiment). I will still give it another try in a fresh directory to verify that it will be looking for ice_albedo and ice_temp.

@tsemmler05
Copy link
Collaborator

@hegish, @JanStreffing: Do I need to switch to fesom branch awicm-3-frontiers_parallel-restart? Maybe only the restart dumps are merged into awicm-3-frontiers but not the transposed restarts?

@hegish
Copy link
Collaborator

hegish commented Jul 28, 2021

only awicm-3-frontiers_parallel-restart contains the parallel and transposed restart read/write functionality. Also the raw restart dumps are available via this branch.
I will push changes today to the raw restart where only a single file will be used per process. This speeds up copying the data by a factor of more than 10 (on aleph), though you should nod copy these files anyway. It will also help with the file count limit we have on some systems (e.g. juwels).

@JanStreffing
Copy link
Collaborator Author

Hey Jan, thanks for the update, I'm testing awicm-3-frontiers_parallel-restart out together with Dima Sein on Aleph right now.

@hegish
Copy link
Collaborator

hegish commented Jul 28, 2021

Has reading the transposed restart files also been tested?

Good news: reading the portable restarts in parallel seems to work on juwels with Fesom np576 according to run awicm-3.1//D3fsp3//run_19900101-19900101 done by @tsemmler05. Reading in parallel does not work on aleph so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants