Fix MPI tag creation #661

pgrete · 2022-03-25T16:42:18Z

PR Summary

Fixes #660

PR Checklist

Code passes cpplint
New features are documented.
Adds a test for any bugs fixed. Adds tests for new features.
Code is formatted
Changes are summarized in CHANGELOG.md
CI has been triggered on Darwin for performance regression tests.
(@lanl.gov employees) Update copyright on changed files

Yurlungur

Overall this looks right to me. At some point, I do thing we should rethink how we do the boundary variables more deeply than this. But this at least gets us to a "correct" implementation.

Thanks for digging in to this!

lroberts36 · 2022-03-25T18:42:13Z

Doesn't this still leave open the possibility that the phyd_id overruns the five bits reserved for it in BoundaryBase::CreateBvalsMPITag? A check for this should at least be added as noted in the TODO at https://github.com/lanl/parthenon/blob/1d72227179ce2e34dcba9736587c0f167326d32b/src/bvals/bvals.cpp#L136.

pgrete · 2022-03-30T14:58:50Z

Doesn't this still leave open the possibility that the phyd_id overruns the five bits reserved for it in BoundaryBase::CreateBvalsMPITag? A check for this should at least be added as noted in the TODO at

https://github.com/lanl/parthenon/blob/1d72227179ce2e34dcba9736587c0f167326d32b/src/bvals/bvals.cpp#L136
.

Good point. I added an appropriate check with error message.

pgrete · 2022-03-30T15:04:37Z

While going through this PR again,I think this needs additional input from @brryan

For BoundarySwarms, the id is always 0 (which is technically reserved for AMR)
https://github.com/lanl/parthenon/blob/97f181b79f145946372a64b9ba2e0382601df632/src/bvals/bvals.hpp#L125
as also set here (for BoundarySwarms)
https://github.com/lanl/parthenon/blob/97f181b79f145946372a64b9ba2e0382601df632/src/bvals/bvals.cpp#L197
but
https://github.com/lanl/parthenon/pull/661/files#diff-05bb8e353c4c92e6744e925fad3f395d17ea4e7a98f61978e183873d09e5b11c
actually uses the pbval pointer and not the pbswarm pointer.

So it looks to me that some things are not properly aligned.
Any comments (also from others)?

Yurlungur · 2022-03-30T16:23:55Z

Is the swarm thing not automatically fixed by just setting a bit for swarms? Or alternatively, making a new communicator for swarms?

pgrete · 2022-03-30T18:25:11Z

Is the swarm thing not automatically fixed by just setting a bit for swarms? Or alternatively, making a new communicator for swarms?

Yes, there are various straightforward fixes. I think it'd make sense to go with the one that @brryan considers most appropriate given the other open PRs -- especially thinking ahead in terms of supporting multiple Swarms.
So I could imagine that a separate communicator would make the most sense.

brryan · 2022-03-31T01:04:05Z

Thanks for pinging me on this -- yes I would imagine this PR would potentially cause issues for swarms. I think @Yurlungur / @pgrete 's solution of one communicator per swarm is a good path forward. Right now all particle variables are always communicated simultaneously so there shouldn't be any additional subtleties to worry about at the moment. I am out of the office this week but I can think through this in detail on Monday. Probably the tracers example with MPI is the best place to test these changes, since that has both cc vars and a swarm, but I don't think that's included in our automated testing currently.

brryan · 2022-04-04T19:50:17Z

src/bvals/bvals_swarm.cpp

@@ -41,7 +41,7 @@ namespace parthenon {
 BoundarySwarm::BoundarySwarm(std::weak_ptr<MeshBlock> pmb)
    : bswarm_index(), pmy_block(pmb), pmy_mesh_(pmb.lock()->pmy_mesh) {
 #ifdef MPI_PARALLEL
-  swarm_id_ = pmb.lock()->pbval->bvars_next_phys_id_;
+  swarm_id_ = pmb.lock()->pbval->AdvanceCounterPhysID(1);


Suggested change

swarm_id_ = pmb.lock()->pbval->AdvanceCounterPhysID(1);

swarm_id_ = pmb.lock()->pbswarm->AdvanceCounterPhysID(1);

Yes this is a bug, thanks for noticing this @pgrete!

brryan · 2022-04-04T20:00:22Z

@pgrete Looking at this now, my thinking for moving forward with swarm phys ids is to:

change to pbswarm in src/bvals/bvals_swarm.cpp
Use the same logic in AdvanceCounterPhysId for swarms as for variables, except have num_phys always be 1 i.e. the counter is incremented once per swarm (void argument):

int BoundarySwarm::AdvanceCounterPhysID() {
#ifdef MPI_PARALLEL
  int start_id = bvars_next_phys_id_;
  bvars_next_phys_id_ += 1;
  PARTHENON_REQUIRE_THROWS(
      bvars_next_phys_id_ < 32,
      "Next phys_id would be >= 32, which is currently not supported as Parthenon only "
      "reserves 5 bits for phys_id. Please open an issue on GitHub if you see this "
      "message to discuss options.")
  return start_id;
#else
  return 0;
#endif
}

In this pattern each swarm has a specific phys_id. I don't anticipating needing large numbers of swarms so a small byte limit might be fine here. To generalize each swarm could have its own communicator later on.

One thing this prevents is different communication patterns for different subsets of a swarm's ParticleVariables. We always communicate all ParticleVariables currently and I think this is mostly reasonable but probably could lead to unnecessary communication in some cases. I think, though, that that is an issue best addressed later if necessary (dealing with that would probably necessitate a one-communicator-per-swarm model).

Does that seem reasonable to you? I certainly could be missing something.

Yurlungur · 2022-04-04T20:05:06Z

@pgrete Looking at this now, my thinking for moving forward with swarm phys ids is to:
1. change to `pbswarm` in `src/bvals/bvals_swarm.cpp`

2. Use the same logic in `AdvanceCounterPhysId` for swarms as for variables, except have `num_phys` always be `1` i.e. the counter is incremented once per swarm (void argument):
int BoundarySwarm::AdvanceCounterPhysID() {
#ifdef MPI_PARALLEL
  int start_id = bvars_next_phys_id_;
  bvars_next_phys_id_ += 1;
  PARTHENON_REQUIRE_THROWS(
      bvars_next_phys_id_ < 32,
      "Next phys_id would be >= 32, which is currently not supported as Parthenon only "
      "reserves 5 bits for phys_id. Please open an issue on GitHub if you see this "
      "message to discuss options.")
  return start_id;
#else
  return 0;
#endif
}
In this pattern each swarm has a specific phys_id. I don't anticipating needing large numbers of swarms so a small byte limit might be fine here. To generalize each swarm could have its own communicator later on.

One thing this prevents is different communication patterns for different subsets of a swarm's ParticleVariables. We always communicate all ParticleVariables currently and I think this is mostly reasonable but probably could lead to unnecessary communication in some cases. I think, though, that that is an issue best addressed later if necessary (dealing with that would probably necessitate a one-communicator-per-swarm model).

Does that seem reasonable to you? I certainly could be missing something.

Won't this fail unless there's a separate communicator for swarms vs grid variables?

brryan · 2022-04-04T20:09:34Z

Won't this fail unless there's a separate communicator for swarms vs grid variables?

Oh yes you're right... I suppose we now need either (1) a separate communicator for swarms (2) a bit reserved in the tag for swarm vs cc variable

Yurlungur · 2022-04-04T20:10:37Z

I am in favor of a separate communicator for swarms, which frees up more bits for variables in their communicator.

pgrete · 2022-04-08T01:10:09Z

I now added a separate communicator (in addition to the tag fix, which is technically not required [any more]).
However, sth goes wrong and I get a deadlock in Receive. Just replacing the three MPI_COMM_SWARM with MPI_COMM_WORLD int the BoundarySwarm::Send and BoundarySwarm::Receive methods makes everything work again.
If anyone has a quick idea about what I'm missing, I'm happy for input.

brryan · 2022-04-08T02:18:28Z

@pgrete Thanks for adding the swarm communicator! Unrelated (or not) I just ran into an issue today with particles hanging with MPI (on my heavily modified branches), I'm guessing due to Swarm::Receive. I'll take a look at your branch now and see if I notice anything.

pgrete · 2022-04-13T20:54:35Z

With respect to whether we need resetBoundary or not depends on a basic design decision.
For variables across multiple stages the fluxes and boundary vars are getting reused.
So depending on whether we want to allow for sending data and fluxes from other than the base container we need to repoint the data pointer or not.
AthenaPK currently works under the assumption that only the base stage contains data/fluxes to be sent.

Yurlungur · 2022-04-13T22:10:36Z

AthenaPK currently works under the assumption that only the base stage contains data/fluxes to be sent.

I don't think Phoebus or rage make this assumption. But I think in pack_in_one, it should not matter, because we use restrict_in_one to do the restriction, right? In which case, the metadata in vbvar doesn't participate. Or do we still use per-meshblock restriction?

lroberts36 · 2022-04-13T22:12:51Z

Yes, anywhere we are working with MeshData instead of MeshBlocks I think it is unnecessary.

brryan

LGTM! A long-overdue fix.

brryan · 2022-04-14T18:27:07Z

.gitlab-ci.yml

    - ctest -R regression_mpi_test:output_hdf5
+    # and one that uses swarms
+    - make -j${J} particle-leapfrog
+    - ctest -R regression_mpi_test:particle_leapfrog


pgrete · 2022-04-22T11:35:24Z

Quick update: the fix I tried didn't work after reintroducing the resetBoundary calls as now there's a clash between (reused) bvars across stages.
I need to further investigate...

pgrete · 2022-04-25T19:05:10Z

@brryan @Yurlungur this is ready for complete re-review now.
I think I got everything (including an index bug that is currently in develop and resulted in the pipeline to fail there [1]).
This PR actually fixes an additional bugs around MPI comm (apart from the tags), i.e., the bval should now be properly shared among all shallow copies of each Variable (originally there were multiple extra copies as the PARTHENON_REQUIRE conditional was off).
I can now see how this could have messed up comms in some circumstances.

[1] I think it may be worth to make sure we always trigger the full suite before merging any open PR in the future
[edit] This reminds me that I need to fix/remove the coverage stage as the original sparse PR resulted in the coverage stage taking hours and then timing out [/edit]

pgrete · 2022-04-25T19:14:53Z

Also ping @lroberts36 so that this is ready to be tested against the MPI issue you encountered for the sparse comm.

Yurlungur · 2022-04-25T19:53:25Z

Thanks @pgrete I'll try to look at this today.

? [1] I think it may be worth to make sure we always trigger the full suite before merging any open PR in the future
[edit] This reminds me that I need to fix/remove the coverage stage as the original sparse PR resulted in the coverage stage taking hours and then timing out [/edit]

I think it depends. For infrastructure PRs I think that's for sure. For smaller changes probably not necessary. We can use our judgement.

pgrete · 2022-04-26T13:22:25Z

One more comment/conundrum regarding the failing regression test on Spock #659

The regression_mpi_test:output_hdf5 currently passes with develop but it still fails with this PR.
I just confirmed this in a single session (same node) with fresh binaries.
Even the "conserved" quantity is off, see advection_3d.hst of the regression test (third column)

  #  History data
  # [1]=time     [2]=dt       [3]=total_advected[4]=max_advected[5]=min_advected
    0.00000e+00  4.68750e-03  1.39160e-02  1.00000e+00  0.00000e+00
    2.53125e-01  4.68750e-03  1.39160e-02  9.38903e-01  0.00000e+00
    5.01563e-01  4.68750e-03  1.39160e-02  7.44601e-01  6.35003e-16
    7.54687e-01  4.68750e-03  1.39160e-02  5.89906e-01  1.27520e-09
    1.00000e+00  3.12500e-03  1.39275e-02  4.80914e-01  1.45889e-07

So I'm now wondering if there's sth else we're missing (here).

pgrete · 2022-04-26T13:37:00Z

Update: develop is also broken (just in a slightly different way).
Running the test code for longer results in

cycle=288 time=1.3499999999999928e+00 dt=4.6874999999999998e-03 zone-cycles/wsec_step=4.06e+06 wsec_total=1.14e+01 wsec_step=2.31e-02 zone-cycles/wsec=3.84e+06 wsec_AMR=1.36e-03
terminate called after throwing an instance of 'std::runtime_error'
  what():  ### PARTHENON ERROR
  Condition:   !pvar_cc->IsAllocated()
  Message:     FinishRecvSameLevel: Received variable that was not allocated on sending block but it is allocated on receiving block
  File:        /ccs/home/pgrete/src/parthenon/src/mesh/amr_loadbalance.cpp
  Line number: 1313

in both cases. So sth is definitely off with MPI comm (and may or may not be system/env dependent).

Yurlungur · 2022-04-26T17:08:54Z

Oof. I guess we should figure out what's going on here before merge. I also still need to run it through Phoebus to see if that breaks anything... though if develop is broken, it probably will.

Yurlungur

Overall looks reasonable to me, but I want to run it through phoebus and understand what is causing the tests to fail before merge.

Yurlungur · 2022-04-25T19:55:33Z

docs/interface/state.md

+This means that any kind of communication (most prominently flux correction and ghost zone
+exchange) of a given variable at a given stage should not be interleaved with any other
+modifications/communication of said variable as it may result in undefined behavior.


We can send and then modify before receiving right? Although obviously you probably don't want to do that.

It all depends on the driver design (and how the tasks are ordered in one or more task regions).
My intention with this comment was that people should be really mindful with the async nature of tasks in combination with a shared data (flux, coarse, ghost zone buffers) across stages/containers of a single variable.

Yurlungur · 2022-04-25T19:56:53Z

src/bvals/bvals_base.cpp

+  // Technically we could query MPI_TAG_UB here. However, using a fixed (lower) bound
+  // guaranteed by the standard for now to reduce the additional functions calls or
+  // adding a new internal upper lim variable.


I am in favor of just using the standard-guaranteed value forever. This is better for portability. Otherwise I can imagine the code dying on some machines and not others...

Technically MPI_TAG_UB should provide the right info for the given MPI lib at compile time.
Practically, I'm also in favor of keeping the hard coded limit for now until this actually becomes an issue.

Yurlungur · 2022-04-26T00:50:31Z

src/interface/variable.cpp

+      // TODO(someone): double check if following todo is still applicable with the
+      // "restrict-in-one" machinery. The original code added `vbvar` to
+      // `pmb->pbval->bvars`, but we already do this in the constructor. Also, we
+      // potentially don't iterate over `bvars` any more in the "restrict-in-one" case.


RestrictBoundaries is called per meshblock because the physical boundary condition machinery is still per-meshblock. We need to move the boundary stuff to be "in-one" for this to TODO to become non-relevant.

Yurlungur · 2022-04-26T00:51:05Z

src/mesh/mesh.cpp

@@ -485,6 +478,11 @@ Mesh::Mesh(ParameterInput *pin, ApplicationInput *app_in, Packages_t &packages,

  mesh_data.SetMeshPointer(this);

+  resolved_packages = ResolvePackages(packages);


oops good catch

Yurlungur · 2022-04-26T17:22:30Z

src/mesh/mesh.cpp

+      // TODO(mpi people) do we still need the pbval part? Discuss also in the context of
+      // other than cellvariables, see comment above on communicators.


Is there any non-"in one" communication we still do?

If I understood your comment above correctly then RestrictBoundaries may be the last place.
However, as far as I can tell the following line can independently be removed because the vbar ctor already enlists the vbars in the pbval list.

That seems reasonable.

pgrete · 2022-04-27T15:00:56Z

Oof. I guess we should figure out what's going on here before merge. I also still need to run it through Phoebus to see if that breaks anything... though if develop is broken, it probably will.

Overall looks reasonable to me, but I want to run it through phoebus and understand what is causing the tests to fail before merge.

Are you referring to the failed tests on Spock? I'm still not sure what's going on there but I was not able to reproduce the issue anywhere else so far.
Given that current develop has clear bugs (tags, vbars, and index) that are all (hopefully) fixed here, my preference is to merge this first and then tackle the other issues #659 #668

Yurlungur · 2022-04-27T15:07:22Z

Given that current develop has clear bugs (tags, vbars, and index) that are all (hopefully) fixed here, my preference is to merge this first and then tackle the other issues #659 #668

That works for me. I'll try running a test with Phoebus today.

Yurlungur · 2022-04-27T17:20:54Z

I ran a 2d blast wave in Phoebus with the updated machinery, and 24 MPI ranks. Things seemed to work just fine. Might be good to run a more strenuous test such as a radiation test though, if @brryan or @lroberts36 feel up to it.

Yurlungur

Since the updated machinery seems to not break Phoebus's integration, I'm calling this good.

pgrete · 2022-04-27T20:15:05Z

@brryan or @lroberts36 do you also want to give this PR another look before merging?

lroberts36

LGTM. I also have this merged into #663.

lroberts36 · 2022-04-27T20:33:01Z

src/bvals/bvals_base.cpp

-  return (lid << 11) | (bufid << 5) | phys;
+int BoundaryBase::CreateBvalsMPITag(int lid, int bufid) {
+  PARTHENON_REQUIRE_THROWS(lid >= 0 && bufid >= 0, "Ids are expected to be positive.")
+  int tag = (lid << 6) | bufid;


Should we be saving more bits for the bufid now that there are no physics bits?

I don't think this is required at the moment as we have a max. of 57 buffers globally.

lroberts36 · 2022-04-27T20:34:48Z

src/mesh/mesh.hpp

@@ -212,6 +212,11 @@ class Mesh {
  // size of default MeshBlockPacks
  int default_pack_size_;

+#ifdef MPI_PARALLEL
+  // Global map of MPI comms for separate variables
+  std::map<std::string, MPI_Comm> mpi_comm_map_;


Maybe should be unordered_map, although I doubt it matters.

I had a misplaced const that did nothing and I am about to fix. Will update this at the same time.

pgrete · 2022-04-28T08:10:20Z

Thanks for double checking with Phoebus. Pulling the trigger now so that develop works again (pending sparse variables).

Fix phys_id and cleanup old code

1b29f09

Yurlungur reviewed Mar 25, 2022

View reviewed changes

Add error check for phys_id bounds

97f181b

brryan reviewed Apr 4, 2022

View reviewed changes

pgrete added 2 commits April 7, 2022 19:29

Add separate communicator for swarms

f399191

Add leapfrog test to short CI stage

48eb696

pgrete added 2 commits April 8, 2022 14:58

Merge branch 'develop' into pgrete/fix_physid

ff5df3e

Remove ids in tags in favor of separate comm

5842a27

pgrete force-pushed the pgrete/fix_physid branch from a1188f0 to 5842a27 Compare April 8, 2022 20:53

pgrete added 4 commits April 8, 2022 17:37

Remove extra mpi includes

73859bd

Introduce global comm map

dd042de

Move mpi comm gen to mesh constructor

7a70d4e

Guard GetMPIComm call

08090a7

pgrete force-pushed the pgrete/fix_physid branch from 6b1f580 to 08090a7 Compare April 10, 2022 18:54

pgrete requested review from Yurlungur and brryan April 13, 2022 16:30

brryan approved these changes Apr 14, 2022

View reviewed changes

lroberts36 mentioned this pull request Apr 19, 2022

Add sparse communication buffers #663

Merged

9 tasks

pgrete added 3 commits April 25, 2022 19:16

Reuse vbvar across all shallow copies of a Variable

bc9773a

Merge remote-tracking branch 'origin/develop' into pgrete/fix_physid

b32dc88

Fix index error

145cd6c

Yurlungur reviewed Apr 26, 2022

View reviewed changes

pgrete mentioned this pull request Apr 27, 2022

sparse_advection-example unallocated sparse fluxes #668

Open

Yurlungur approved these changes Apr 27, 2022

View reviewed changes

lroberts36 approved these changes Apr 27, 2022

View reviewed changes

Use unordered_map

c61a17c

pgrete enabled auto-merge (squash) April 28, 2022 08:10

pgrete disabled auto-merge April 28, 2022 10:03

pgrete enabled auto-merge (squash) April 28, 2022 10:04

pgrete merged commit f3d5432 into develop Apr 28, 2022

pgrete deleted the pgrete/fix_physid branch May 17, 2022 10:19

	swarm_id_ = pmb.lock()->pbval->AdvanceCounterPhysID(1);
	swarm_id_ = pmb.lock()->pbswarm->AdvanceCounterPhysID(1);

		@@ -485,6 +478,11 @@ Mesh::Mesh(ParameterInput pin, ApplicationInput app_in, Packages_t &packages,

		mesh_data.SetMeshPointer(this);

		resolved_packages = ResolvePackages(packages);

		// TODO(mpi people) do we still need the pbval part? Discuss also in the context of
		// other than cellvariables, see comment above on communicators.

Fix MPI tag creation #661

Fix MPI tag creation #661

Conversation

pgrete commented Mar 25, 2022 • edited Loading

PR Summary

PR Checklist

Yurlungur left a comment

Choose a reason for hiding this comment

lroberts36 commented Mar 25, 2022

pgrete commented Mar 30, 2022

pgrete commented Mar 30, 2022

Yurlungur commented Mar 30, 2022

pgrete commented Mar 30, 2022

brryan commented Mar 31, 2022

Choose a reason for hiding this comment

brryan commented Apr 4, 2022

Yurlungur commented Apr 4, 2022

brryan commented Apr 4, 2022

Yurlungur commented Apr 4, 2022

pgrete commented Apr 8, 2022

brryan commented Apr 8, 2022 • edited Loading

pgrete commented Apr 13, 2022

Yurlungur commented Apr 13, 2022

lroberts36 commented Apr 13, 2022

brryan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgrete commented Apr 22, 2022

pgrete commented Apr 25, 2022 • edited Loading

pgrete commented Apr 25, 2022

Yurlungur commented Apr 25, 2022

pgrete commented Apr 26, 2022

pgrete commented Apr 26, 2022

Yurlungur commented Apr 26, 2022

Yurlungur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgrete commented Apr 27, 2022

Yurlungur commented Apr 27, 2022

Yurlungur commented Apr 27, 2022

Yurlungur left a comment

Choose a reason for hiding this comment

pgrete commented Apr 27, 2022

lroberts36 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgrete commented Apr 28, 2022

pgrete commented Mar 25, 2022 •

edited

Loading

brryan commented Apr 8, 2022 •

edited

Loading

pgrete commented Apr 25, 2022 •

edited

Loading