-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FESOM2 writing restart files slowly on some machines #103
Comments
Is it maybe a good time to combine it with splitting restarts to individual files? |
Do you think splitting restarts to individual files also has a timeline of a day or two? That's what I was hoping for with chunking, since it essentially blocks us from doing anything but short scaling tests on Aleph. |
This is a question to @hegish :) I think we agreed on this strategy, so eventually it will be implemented, but no idea on the timeline :) |
We agreed to do it after the upcoming release. |
I wanted to add that this was a serious issue and painfully slow also on the ECMWF machine (cray) for the Dyamond runs. It was still bearable with the core grid, but for the ORCA025 grid reading/writing one (!) restart was on the order of 45+ minutes. I tried the fixes from Jan once, but we could not get it to work at ECMWF when coupled to IFS. The restart took 6500 seconds: ==========================================
From an email from Nils Wedi: "I am not sure why you looped over 2D variables, but I recoded it and added openMP on the packing in the halo exchange, possibly I need the same for the write + gather now to do a 3D gather. you can have a look at my versions of gen_halo_exchange.F90 I now get MODEL SETUP took on mype=0 [seconds]
The reading of restarts is OK like this (https://gitlab.dkrz.de/FESOM/fesom2/-/commit/6619be5b3e9d4ad128eb9ec6ee192bdf1faa90e4#c62db497f6f45aec1ab97cf4bc52c005e29c68df ). However, writing (every 3 hours following the Dyamond protocol) was still as slow as before. In the end, we wrote the whole 3D output at once as one big chunk (with @hegish : https://gitlab.dkrz.de/FESOM/fesom2/-/commit/0a61861a49c78e1d386951335ac2ffb343ddb799), which works for the 0.25° NEMO grid, but this won't work for the Rossby grid possibly because it needs too much memory. |
Hello @hegish, |
@JanStreffing Foward JSC communication. |
The new transposed parallel netcdf restart by @hegish has been successfully tested by @tsemmler05 running AWICM3 via esm_tools on juwels. This greatly reduced the time needed for restarts, both from using internally transposed fields and from writing the 20 restart fields into separate files in parallel. A version 0.1 of memory dump restarts is also functional, but will still be improved further to reduce the number of files it produced. This can live in a separate issue though, as the transposed parallel netcdf restarts already allow for restarts that take seconds to minutes rather than hours. |
Has reading the transposed restart files also been tested? |
Actually the test that I have done so far was with the raw restart files (restart dumps). The warm start hasn't worked out probably because the transposed restarts haven't been copied by the esm tools: /p/scratch/chhb19/semmler1/runtime/awicm-3.1/D3fsp2/scripts/D3fsp2_compute_19900101-19900101_4077997.log: 48: reading restart PARALLEL for ssh at /p/scratch/chhb19/semmler1/runtime/aw |
Ah, I see. I would assume that for the warm start via parallel netcdf restart you also need #157, because the restart files from standalone FESOM2 don't contain |
O.k., I did try to just manually copy the restart files into the directory before putting the job to run but that didn't work out either (but it didn't even get to try reading any restart files but didn't do anything whatsoever in pre-existing D3fsp2 experiment). I will still give it another try in a fresh directory to verify that it will be looking for ice_albedo and ice_temp. |
@hegish, @JanStreffing: Do I need to switch to fesom branch awicm-3-frontiers_parallel-restart? Maybe only the restart dumps are merged into awicm-3-frontiers but not the transposed restarts? |
only awicm-3-frontiers_parallel-restart contains the parallel and transposed restart read/write functionality. Also the raw restart dumps are available via this branch. |
Hey Jan, thanks for the update, I'm testing awicm-3-frontiers_parallel-restart out together with Dima Sein on Aleph right now. |
Good news: reading the portable restarts in parallel seems to work on juwels with Fesom np576 according to run awicm-3.1//D3fsp3//run_19900101-19900101 done by @tsemmler05. Reading in parallel does not work on aleph so far. |
After running AWI-CM3 on Aleph for the first time I found that we have an issue with writing restart files extremely slowly there. Are are looking at write speed of less than 3 MB/s. Some maschines (e.g Ollie and Mistral) apparently have their NetCDF IO layer configured differently and do this for us, but on Juwels and Aleph we need to order the dimensions correctly ourselves.
I had previously found and fixed this for Juwels with these two commits:
9bce28a
d710062
Shortly after I made these changes @hegish merged his substantial changes and improvements to regular io into the master branch. Merging what I did for Juwels became quite difficult thereafter and we never attempted it. Now I think we have to.
To recap from the old gitlab issue: https://gitlab.dkrz.de/FESOM/fesom2/-/issues/19
We are writing restart files like this (now outdated):
We are holding the first dimension, lev. Since we are writing from Fortran, NetCDF transposes the values in the output file, compared to what we tell it to write here. Order as shown via ncdump:
double u(time, elem, nz_1) ;
During writing, this means we hold fix, lev, the dimension that changes most often in the original data in memory and NetCDF has to start a new write process each nod2D block. In 2D imagine if you want to write one row of 100000 values, but what you do is write 100000 columns of length one. This is confirmed by using the darshan IO logging tool, showing the number of seek and write access for writing a single restart file, as well as the size of the write accesses. You can find the full logfile attached.
Before: streffin_fesom.x_id2406075_6-29-66061-10726804095939863824_1.darshan_1_.pdf
As you can see, we are generating 84 million accesses with just a CORE2 mesh restart. On Juwels and Aleph this is painfully slow (~25 min for a CORE2 restart).
For the old routines I had found a working solution on Juwels in chunking the netcdf data in io_restart.F90:
and io_meandata.F90:
After: streffin_fesom.x_id2412833_7-3-8403-13876901149393710826_1.darshan.pdf
This increased the output speed on a CORE2 restart on Juwels from 25 minutes to 40 seconds.
I think we have to revisit this issue and work on an implementation within the new IO scheme. @hegish @dsidoren
The text was updated successfully, but these errors were encountered: