output_dft should use parallel I/O under MPI #1707

FiodarM · 2021-07-27T20:55:57Z

I use meep with python interface under MPI on a cluster with 50-100 processes. With certain test parameters memory usage during simulation is ~25GB and run() finishes successfully. After the simulation I try to save the frequency domain fields to HDF5 using output_dft(). On output_dft() rapid increase in memory usage to >100GB occurs which exceeds the memory allocated for the simulation so I lose the simulation results. I also tried with get_dft_array() and saving to HDF5 manually with h5py and experienced the same problem. It is important that when using get_dft_array() memory leak occurs on get_dft_array() operation itself i.e, before the saving to HDF5. So the issue seems not to be related to HDF5. As I understand, output_dft() method just calls corresponding C++ method and creates no python objects. So this issue, as I can see, is neither related to python interface.

Is such behavior normal when using many MPI processes or is this a problem with meep?

I built meep from source from master branch.

Thanks in advance.

The text was updated successfully, but these errors were encountered:

FiodarM · 2021-07-27T21:01:00Z

The recorded memory consumption vs time chart:

FiodarM · 2021-07-28T08:51:54Z

The same problem happens with sim.get_array(mp.Dielectric) as well.

smartalecH · 2021-07-28T17:42:58Z

cc @oskooi

I often see similar behavior (although never this extreme). Usually each "field dump" operation (e.g. when pulling fields after forward and adjoint runs) induces a spike in memory usage that's 25-35% the current consumption (e.g. from 150 GB to 200 GB and then back down).

This might be due to the gathering that happens when the user makes a call like this. Simply put, every proc receives a copy of all the DFT fields (this is certainly the case when using the python interface). The distributed-memory paradigm breaks down at these IO junctions.

We've talked about getting around this for adjoint optimization (where the forward and adjoint fields are always locally stored on each proc that owns them, and then the final recombination step is also performed locally). It might be nice to generalize that approach (e.g. when you just want to dump the fields and aren't necessarily doing adjoint optimization).

An even better solution (IMHO) is to use the hybrid multithreading/multiprocessing approach in #1628. This only requires one process per node.

stevengj · 2021-07-28T17:55:08Z

For get_dft_array, every process gets a copy of all the DFT data, so that will potentially be a huge amount of memory.

For output_dft, however, each process should ideally write only its portion of the data to the HDF5 file.

stevengj · 2021-07-28T17:59:50Z

It looks like the output_dft implementation (cc @HomerReid) currently does not use parallel I/O (

meep/src/dft.cpp

Line 1129 in ef2c9ca

file = new h5file(filename, h5file::WRITE, false /*parallel*/);

) — that means that it gathers all of the data to the master process in order to write it.

This should really be fixed so that each process only computes and writes its own portion of the DFT data.

FiodarM · 2021-07-28T18:13:21Z

I also noticed another strange behavior. When I call get_epsilon() at the beginning of the simulation (only once):

~~the memory consumption increases and does not drop even after explicit del of the array and gc.collect();~~
the simulation speed drops as if it is single-core.

Two last steps in green line correspond to calls of get_dft_array() for two field components.

EDIT:
I strikethrough the first observation as it is was my incorrect interpretation of the measured RAM usage. In the graphs the maximum, i.e., not current RAM usage is shown. This is because I use slurm's sstat util which seem to can only log maximum RAM occupied by a job.

FiodarM · 2021-07-28T21:26:04Z

@stevengj @HomerReid Is it crucial to set parallel to false for dft output? Will it work if one just changes it to false to true in this line?

meep/src/dft.cpp

Line 1129 in ef2c9ca

file = new h5file(filename, h5file::WRITE, false /*parallel*/);

stevengj · 2021-07-28T22:35:51Z

No, simply changing false to true is not sufficient — one would have to also change the subsequent logic to ensure that each process only writes its "own" data to the file.

stevengj added the enhancement label Jul 28, 2021

stevengj changed the title ~~Huge memory consumption when using get_dft_array() or output_dft() under MPI~~ output_dft should use parallel I/O under MPI Jul 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

output_dft should use parallel I/O under MPI #1707

output_dft should use parallel I/O under MPI #1707

FiodarM commented Jul 27, 2021

FiodarM commented Jul 27, 2021

FiodarM commented Jul 28, 2021

smartalecH commented Jul 28, 2021

stevengj commented Jul 28, 2021

stevengj commented Jul 28, 2021

FiodarM commented Jul 28, 2021 •

edited

Loading

FiodarM commented Jul 28, 2021

stevengj commented Jul 28, 2021

output_dft should use parallel I/O under MPI #1707

output_dft should use parallel I/O under MPI #1707

Comments

FiodarM commented Jul 27, 2021

FiodarM commented Jul 27, 2021

FiodarM commented Jul 28, 2021

smartalecH commented Jul 28, 2021

stevengj commented Jul 28, 2021

stevengj commented Jul 28, 2021

FiodarM commented Jul 28, 2021 • edited Loading

FiodarM commented Jul 28, 2021

stevengj commented Jul 28, 2021

FiodarM commented Jul 28, 2021 •

edited

Loading