Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output_dft should use parallel I/O under MPI #1707

Open
FiodarM opened this issue Jul 27, 2021 · 8 comments
Open

output_dft should use parallel I/O under MPI #1707

FiodarM opened this issue Jul 27, 2021 · 8 comments

Comments

@FiodarM
Copy link

FiodarM commented Jul 27, 2021

I use meep with python interface under MPI on a cluster with 50-100 processes. With certain test parameters memory usage during simulation is ~25GB and run() finishes successfully. After the simulation I try to save the frequency domain fields to HDF5 using output_dft(). On output_dft() rapid increase in memory usage to >100GB occurs which exceeds the memory allocated for the simulation so I lose the simulation results. I also tried with get_dft_array() and saving to HDF5 manually with h5py and experienced the same problem. It is important that when using get_dft_array() memory leak occurs on get_dft_array() operation itself i.e, before the saving to HDF5. So the issue seems not to be related to HDF5. As I understand, output_dft() method just calls corresponding C++ method and creates no python objects. So this issue, as I can see, is neither related to python interface.

Is such behavior normal when using many MPI processes or is this a problem with meep?

I built meep from source from master branch.

Thanks in advance.

@FiodarM
Copy link
Author

FiodarM commented Jul 27, 2021

The recorded memory consumption vs time chart:
ram-use

@FiodarM
Copy link
Author

FiodarM commented Jul 28, 2021

The same problem happens with sim.get_array(mp.Dielectric) as well.

@smartalecH
Copy link
Collaborator

cc @oskooi

I often see similar behavior (although never this extreme). Usually each "field dump" operation (e.g. when pulling fields after forward and adjoint runs) induces a spike in memory usage that's 25-35% the current consumption (e.g. from 150 GB to 200 GB and then back down).

This might be due to the gathering that happens when the user makes a call like this. Simply put, every proc receives a copy of all the DFT fields (this is certainly the case when using the python interface). The distributed-memory paradigm breaks down at these IO junctions.

We've talked about getting around this for adjoint optimization (where the forward and adjoint fields are always locally stored on each proc that owns them, and then the final recombination step is also performed locally). It might be nice to generalize that approach (e.g. when you just want to dump the fields and aren't necessarily doing adjoint optimization).

An even better solution (IMHO) is to use the hybrid multithreading/multiprocessing approach in #1628. This only requires one process per node.

@stevengj
Copy link
Collaborator

For get_dft_array, every process gets a copy of all the DFT data, so that will potentially be a huge amount of memory.

For output_dft, however, each process should ideally write only its portion of the data to the HDF5 file.

@stevengj
Copy link
Collaborator

It looks like the output_dft implementation (cc @HomerReid) currently does not use parallel I/O (

meep/src/dft.cpp

Line 1129 in ef2c9ca

file = new h5file(filename, h5file::WRITE, false /*parallel*/);
) — that means that it gathers all of the data to the master process in order to write it.

This should really be fixed so that each process only computes and writes its own portion of the DFT data.

@stevengj stevengj changed the title Huge memory consumption when using get_dft_array() or output_dft() under MPI output_dft should use parallel I/O under MPI Jul 28, 2021
@FiodarM
Copy link
Author

FiodarM commented Jul 28, 2021

I also noticed another strange behavior. When I call get_epsilon() at the beginning of the simulation (only once):

  • the memory consumption increases and does not drop even after explicit del of the array and gc.collect();
  • the simulation speed drops as if it is single-core.
    ram-use

Two last steps in green line correspond to calls of get_dft_array() for two field components.

EDIT:
I strikethrough the first observation as it is was my incorrect interpretation of the measured RAM usage. In the graphs the maximum, i.e., not current RAM usage is shown. This is because I use slurm's sstat util which seem to can only log maximum RAM occupied by a job.

@FiodarM
Copy link
Author

FiodarM commented Jul 28, 2021

@stevengj @HomerReid Is it crucial to set parallel to false for dft output? Will it work if one just changes it to false to true in this line?

meep/src/dft.cpp

Line 1129 in ef2c9ca

file = new h5file(filename, h5file::WRITE, false /*parallel*/);

@stevengj
Copy link
Collaborator

No, simply changing false to true is not sufficient — one would have to also change the subsequent logic to ensure that each process only writes its "own" data to the file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants