Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MPI tag creation #661

Merged
merged 14 commits into from
Apr 28, 2022
Merged

Fix MPI tag creation #661

merged 14 commits into from
Apr 28, 2022

Conversation

pgrete
Copy link
Collaborator

@pgrete pgrete commented Mar 25, 2022

PR Summary

Fixes #660

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • CI has been triggered on Darwin for performance regression tests.
  • (@lanl.gov employees) Update copyright on changed files

Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks right to me. At some point, I do thing we should rethink how we do the boundary variables more deeply than this. But this at least gets us to a "correct" implementation.

Thanks for digging in to this!

@lroberts36
Copy link
Collaborator

Doesn't this still leave open the possibility that the phyd_id overruns the five bits reserved for it in BoundaryBase::CreateBvalsMPITag? A check for this should at least be added as noted in the TODO at https://github.com/lanl/parthenon/blob/1d72227179ce2e34dcba9736587c0f167326d32b/src/bvals/bvals.cpp#L136.

@pgrete
Copy link
Collaborator Author

pgrete commented Mar 30, 2022

Doesn't this still leave open the possibility that the phyd_id overruns the five bits reserved for it in BoundaryBase::CreateBvalsMPITag? A check for this should at least be added as noted in the TODO at

https://github.com/lanl/parthenon/blob/1d72227179ce2e34dcba9736587c0f167326d32b/src/bvals/bvals.cpp#L136
.

Good point. I added an appropriate check with error message.

@pgrete
Copy link
Collaborator Author

pgrete commented Mar 30, 2022

While going through this PR again,I think this needs additional input from @brryan

For BoundarySwarms, the id is always 0 (which is technically reserved for AMR)
https://github.com/lanl/parthenon/blob/97f181b79f145946372a64b9ba2e0382601df632/src/bvals/bvals.hpp#L125
as also set here (for BoundarySwarms)
https://github.com/lanl/parthenon/blob/97f181b79f145946372a64b9ba2e0382601df632/src/bvals/bvals.cpp#L197
but
https://github.com/lanl/parthenon/pull/661/files#diff-05bb8e353c4c92e6744e925fad3f395d17ea4e7a98f61978e183873d09e5b11c
actually uses the pbval pointer and not the pbswarm pointer.

So it looks to me that some things are not properly aligned.
Any comments (also from others)?

@Yurlungur
Copy link
Collaborator

Is the swarm thing not automatically fixed by just setting a bit for swarms? Or alternatively, making a new communicator for swarms?

@pgrete
Copy link
Collaborator Author

pgrete commented Mar 30, 2022

Is the swarm thing not automatically fixed by just setting a bit for swarms? Or alternatively, making a new communicator for swarms?

Yes, there are various straightforward fixes. I think it'd make sense to go with the one that @brryan considers most appropriate given the other open PRs -- especially thinking ahead in terms of supporting multiple Swarms.
So I could imagine that a separate communicator would make the most sense.

@brryan
Copy link
Collaborator

brryan commented Mar 31, 2022

Thanks for pinging me on this -- yes I would imagine this PR would potentially cause issues for swarms. I think @Yurlungur / @pgrete 's solution of one communicator per swarm is a good path forward. Right now all particle variables are always communicated simultaneously so there shouldn't be any additional subtleties to worry about at the moment. I am out of the office this week but I can think through this in detail on Monday. Probably the tracers example with MPI is the best place to test these changes, since that has both cc vars and a swarm, but I don't think that's included in our automated testing currently.

@@ -41,7 +41,7 @@ namespace parthenon {
BoundarySwarm::BoundarySwarm(std::weak_ptr<MeshBlock> pmb)
: bswarm_index(), pmy_block(pmb), pmy_mesh_(pmb.lock()->pmy_mesh) {
#ifdef MPI_PARALLEL
swarm_id_ = pmb.lock()->pbval->bvars_next_phys_id_;
swarm_id_ = pmb.lock()->pbval->AdvanceCounterPhysID(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
swarm_id_ = pmb.lock()->pbval->AdvanceCounterPhysID(1);
swarm_id_ = pmb.lock()->pbswarm->AdvanceCounterPhysID(1);

Yes this is a bug, thanks for noticing this @pgrete!

@brryan
Copy link
Collaborator

brryan commented Apr 4, 2022

@pgrete Looking at this now, my thinking for moving forward with swarm phys ids is to:

  1. change to pbswarm in src/bvals/bvals_swarm.cpp
  2. Use the same logic in AdvanceCounterPhysId for swarms as for variables, except have num_phys always be 1 i.e. the counter is incremented once per swarm (void argument):
int BoundarySwarm::AdvanceCounterPhysID() {
#ifdef MPI_PARALLEL
  int start_id = bvars_next_phys_id_;
  bvars_next_phys_id_ += 1;
  PARTHENON_REQUIRE_THROWS(
      bvars_next_phys_id_ < 32,
      "Next phys_id would be >= 32, which is currently not supported as Parthenon only "
      "reserves 5 bits for phys_id. Please open an issue on GitHub if you see this "
      "message to discuss options.")
  return start_id;
#else
  return 0;
#endif
}

In this pattern each swarm has a specific phys_id. I don't anticipating needing large numbers of swarms so a small byte limit might be fine here. To generalize each swarm could have its own communicator later on.

One thing this prevents is different communication patterns for different subsets of a swarm's ParticleVariables. We always communicate all ParticleVariables currently and I think this is mostly reasonable but probably could lead to unnecessary communication in some cases. I think, though, that that is an issue best addressed later if necessary (dealing with that would probably necessitate a one-communicator-per-swarm model).

Does that seem reasonable to you? I certainly could be missing something.

@Yurlungur
Copy link
Collaborator

@pgrete Looking at this now, my thinking for moving forward with swarm phys ids is to:

1. change to `pbswarm` in `src/bvals/bvals_swarm.cpp`

2. Use the same logic in `AdvanceCounterPhysId` for swarms as for variables, except have `num_phys` always be `1` i.e. the counter is incremented once per swarm (void argument):
int BoundarySwarm::AdvanceCounterPhysID() {
#ifdef MPI_PARALLEL
  int start_id = bvars_next_phys_id_;
  bvars_next_phys_id_ += 1;
  PARTHENON_REQUIRE_THROWS(
      bvars_next_phys_id_ < 32,
      "Next phys_id would be >= 32, which is currently not supported as Parthenon only "
      "reserves 5 bits for phys_id. Please open an issue on GitHub if you see this "
      "message to discuss options.")
  return start_id;
#else
  return 0;
#endif
}

In this pattern each swarm has a specific phys_id. I don't anticipating needing large numbers of swarms so a small byte limit might be fine here. To generalize each swarm could have its own communicator later on.

One thing this prevents is different communication patterns for different subsets of a swarm's ParticleVariables. We always communicate all ParticleVariables currently and I think this is mostly reasonable but probably could lead to unnecessary communication in some cases. I think, though, that that is an issue best addressed later if necessary (dealing with that would probably necessitate a one-communicator-per-swarm model).

Does that seem reasonable to you? I certainly could be missing something.

Won't this fail unless there's a separate communicator for swarms vs grid variables?

@brryan
Copy link
Collaborator

brryan commented Apr 4, 2022

Won't this fail unless there's a separate communicator for swarms vs grid variables?

Oh yes you're right... I suppose we now need either (1) a separate communicator for swarms (2) a bit reserved in the tag for swarm vs cc variable

@Yurlungur
Copy link
Collaborator

I am in favor of a separate communicator for swarms, which frees up more bits for variables in their communicator.

@pgrete
Copy link
Collaborator Author

pgrete commented Apr 8, 2022

I now added a separate communicator (in addition to the tag fix, which is technically not required [any more]).
However, sth goes wrong and I get a deadlock in Receive. Just replacing the three MPI_COMM_SWARM with MPI_COMM_WORLD int the BoundarySwarm::Send and BoundarySwarm::Receive methods makes everything work again.
If anyone has a quick idea about what I'm missing, I'm happy for input.

@brryan
Copy link
Collaborator

brryan commented Apr 8, 2022

@pgrete Thanks for adding the swarm communicator! Unrelated (or not) I just ran into an issue today with particles hanging with MPI (on my heavily modified branches), I'm guessing due to Swarm::Receive. I'll take a look at your branch now and see if I notice anything.

@pgrete pgrete force-pushed the pgrete/fix_physid branch from a1188f0 to 5842a27 Compare April 8, 2022 20:53
@pgrete pgrete force-pushed the pgrete/fix_physid branch from 6b1f580 to 08090a7 Compare April 10, 2022 18:54
@pgrete pgrete requested review from Yurlungur and brryan April 13, 2022 16:30
@pgrete
Copy link
Collaborator Author

pgrete commented Apr 13, 2022

With respect to whether we need resetBoundary or not depends on a basic design decision.
For variables across multiple stages the fluxes and boundary vars are getting reused.
So depending on whether we want to allow for sending data and fluxes from other than the base container we need to repoint the data pointer or not.
AthenaPK currently works under the assumption that only the base stage contains data/fluxes to be sent.

@Yurlungur
Copy link
Collaborator

AthenaPK currently works under the assumption that only the base stage contains data/fluxes to be sent.

I don't think Phoebus or rage make this assumption. But I think in pack_in_one, it should not matter, because we use restrict_in_one to do the restriction, right? In which case, the metadata in vbvar doesn't participate. Or do we still use per-meshblock restriction?

@lroberts36
Copy link
Collaborator

Yes, anywhere we are working with MeshData instead of MeshBlocks I think it is unnecessary.

Copy link
Collaborator

@brryan brryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! A long-overdue fix.

- ctest -R regression_mpi_test:output_hdf5
# and one that uses swarms
- make -j${J} particle-leapfrog
- ctest -R regression_mpi_test:particle_leapfrog
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@lroberts36 lroberts36 mentioned this pull request Apr 19, 2022
9 tasks
@pgrete
Copy link
Collaborator Author

pgrete commented Apr 22, 2022

Quick update: the fix I tried didn't work after reintroducing the resetBoundary calls as now there's a clash between (reused) bvars across stages.
I need to further investigate...

@pgrete
Copy link
Collaborator Author

pgrete commented Apr 25, 2022

@brryan @Yurlungur this is ready for complete re-review now.
I think I got everything (including an index bug that is currently in develop and resulted in the pipeline to fail there [1]).
This PR actually fixes an additional bugs around MPI comm (apart from the tags), i.e., the bval should now be properly shared among all shallow copies of each Variable (originally there were multiple extra copies as the PARTHENON_REQUIRE conditional was off).
I can now see how this could have messed up comms in some circumstances.

[1] I think it may be worth to make sure we always trigger the full suite before merging any open PR in the future
[edit] This reminds me that I need to fix/remove the coverage stage as the original sparse PR resulted in the coverage stage taking hours and then timing out [/edit]

@pgrete
Copy link
Collaborator Author

pgrete commented Apr 25, 2022

Also ping @lroberts36 so that this is ready to be tested against the MPI issue you encountered for the sparse comm.

@Yurlungur
Copy link
Collaborator

Thanks @pgrete I'll try to look at this today.

? [1] I think it may be worth to make sure we always trigger the full suite before merging any open PR in the future
[edit] This reminds me that I need to fix/remove the coverage stage as the original sparse PR resulted in the coverage stage taking hours and then timing out [/edit]

I think it depends. For infrastructure PRs I think that's for sure. For smaller changes probably not necessary. We can use our judgement.

@pgrete
Copy link
Collaborator Author

pgrete commented Apr 26, 2022

One more comment/conundrum regarding the failing regression test on Spock #659

The regression_mpi_test:output_hdf5 currently passes with develop but it still fails with this PR.
I just confirmed this in a single session (same node) with fresh binaries.
Even the "conserved" quantity is off, see advection_3d.hst of the regression test (third column)

  #  History data
  # [1]=time     [2]=dt       [3]=total_advected[4]=max_advected[5]=min_advected
    0.00000e+00  4.68750e-03  1.39160e-02  1.00000e+00  0.00000e+00
    2.53125e-01  4.68750e-03  1.39160e-02  9.38903e-01  0.00000e+00
    5.01563e-01  4.68750e-03  1.39160e-02  7.44601e-01  6.35003e-16
    7.54687e-01  4.68750e-03  1.39160e-02  5.89906e-01  1.27520e-09
    1.00000e+00  3.12500e-03  1.39275e-02  4.80914e-01  1.45889e-07

So I'm now wondering if there's sth else we're missing (here).

@pgrete
Copy link
Collaborator Author

pgrete commented Apr 26, 2022

Update: develop is also broken (just in a slightly different way).
Running the test code for longer results in

cycle=288 time=1.3499999999999928e+00 dt=4.6874999999999998e-03 zone-cycles/wsec_step=4.06e+06 wsec_total=1.14e+01 wsec_step=2.31e-02 zone-cycles/wsec=3.84e+06 wsec_AMR=1.36e-03
terminate called after throwing an instance of 'std::runtime_error'
  what():  ### PARTHENON ERROR
  Condition:   !pvar_cc->IsAllocated()
  Message:     FinishRecvSameLevel: Received variable that was not allocated on sending block but it is allocated on receiving block
  File:        /ccs/home/pgrete/src/parthenon/src/mesh/amr_loadbalance.cpp
  Line number: 1313

in both cases. So sth is definitely off with MPI comm (and may or may not be system/env dependent).

@Yurlungur
Copy link
Collaborator

Oof. I guess we should figure out what's going on here before merge. I also still need to run it through Phoebus to see if that breaks anything... though if develop is broken, it probably will.

Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks reasonable to me, but I want to run it through phoebus and understand what is causing the tests to fail before merge.

Comment on lines +145 to +147
This means that any kind of communication (most prominently flux correction and ghost zone
exchange) of a given variable at a given stage should not be interleaved with any other
modifications/communication of said variable as it may result in undefined behavior.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can send and then modify before receiving right? Although obviously you probably don't want to do that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all depends on the driver design (and how the tasks are ordered in one or more task regions).
My intention with this comment was that people should be really mindful with the async nature of tasks in combination with a shared data (flux, coarse, ghost zone buffers) across stages/containers of a single variable.

Comment on lines +288 to +290
// Technically we could query MPI_TAG_UB here. However, using a fixed (lower) bound
// guaranteed by the standard for now to reduce the additional functions calls or
// adding a new internal upper lim variable.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favor of just using the standard-guaranteed value forever. This is better for portability. Otherwise I can imagine the code dying on some machines and not others...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically MPI_TAG_UB should provide the right info for the given MPI lib at compile time.
Practically, I'm also in favor of keeping the hard coded limit for now until this actually becomes an issue.

Comment on lines +193 to +196
// TODO(someone): double check if following todo is still applicable with the
// "restrict-in-one" machinery. The original code added `vbvar` to
// `pmb->pbval->bvars`, but we already do this in the constructor. Also, we
// potentially don't iterate over `bvars` any more in the "restrict-in-one" case.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RestrictBoundaries is called per meshblock because the physical boundary condition machinery is still per-meshblock. We need to move the boundary stuff to be "in-one" for this to TODO to become non-relevant.

@@ -485,6 +478,11 @@ Mesh::Mesh(ParameterInput *pin, ApplicationInput *app_in, Packages_t &packages,

mesh_data.SetMeshPointer(this);

resolved_packages = ResolvePackages(packages);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops good catch

Comment on lines +1056 to +1057
// TODO(mpi people) do we still need the pbval part? Discuss also in the context of
// other than cellvariables, see comment above on communicators.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any non-"in one" communication we still do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood your comment above correctly then RestrictBoundaries may be the last place.
However, as far as I can tell the following line can independently be removed because the vbar ctor already enlists the vbars in the pbval list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable.

@pgrete
Copy link
Collaborator Author

pgrete commented Apr 27, 2022

Oof. I guess we should figure out what's going on here before merge. I also still need to run it through Phoebus to see if that breaks anything... though if develop is broken, it probably will.

Overall looks reasonable to me, but I want to run it through phoebus and understand what is causing the tests to fail before merge.

Are you referring to the failed tests on Spock? I'm still not sure what's going on there but I was not able to reproduce the issue anywhere else so far.
Given that current develop has clear bugs (tags, vbars, and index) that are all (hopefully) fixed here, my preference is to merge this first and then tackle the other issues #659 #668

@Yurlungur
Copy link
Collaborator

Given that current develop has clear bugs (tags, vbars, and index) that are all (hopefully) fixed here, my preference is to merge this first and then tackle the other issues #659 #668

That works for me. I'll try running a test with Phoebus today.

@Yurlungur
Copy link
Collaborator

I ran a 2d blast wave in Phoebus with the updated machinery, and 24 MPI ranks. Things seemed to work just fine. Might be good to run a more strenuous test such as a radiation test though, if @brryan or @lroberts36 feel up to it.

Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the updated machinery seems to not break Phoebus's integration, I'm calling this good.

@pgrete
Copy link
Collaborator Author

pgrete commented Apr 27, 2022

@brryan or @lroberts36 do you also want to give this PR another look before merging?

Copy link
Collaborator

@lroberts36 lroberts36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I also have this merged into #663.

return (lid << 11) | (bufid << 5) | phys;
int BoundaryBase::CreateBvalsMPITag(int lid, int bufid) {
PARTHENON_REQUIRE_THROWS(lid >= 0 && bufid >= 0, "Ids are expected to be positive.")
int tag = (lid << 6) | bufid;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be saving more bits for the bufid now that there are no physics bits?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is required at the moment as we have a max. of 57 buffers globally.

@@ -212,6 +212,11 @@ class Mesh {
// size of default MeshBlockPacks
int default_pack_size_;

#ifdef MPI_PARALLEL
// Global map of MPI comms for separate variables
std::map<std::string, MPI_Comm> mpi_comm_map_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe should be unordered_map, although I doubt it matters.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a misplaced const that did nothing and I am about to fix. Will update this at the same time.

@pgrete
Copy link
Collaborator Author

pgrete commented Apr 28, 2022

Thanks for double checking with Phoebus. Pulling the trigger now so that develop works again (pending sparse variables).

@pgrete pgrete enabled auto-merge (squash) April 28, 2022 08:10
@pgrete pgrete disabled auto-merge April 28, 2022 10:03
@pgrete pgrete enabled auto-merge (squash) April 28, 2022 10:04
@pgrete pgrete merged commit f3d5432 into develop Apr 28, 2022
@pgrete pgrete deleted the pgrete/fix_physid branch May 17, 2022 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Are we properly incrementing phys_id?
4 participants