Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart bugfix #311

Merged
merged 12 commits into from
Oct 26, 2020
Merged

Restart bugfix #311

merged 12 commits into from
Oct 26, 2020

Conversation

nmsriram
Copy link
Collaborator

@nmsriram nmsriram commented Sep 30, 2020

PR Summary

This merge fixes a parallel restart bug in the restart path. Parallel restarts with same (or different) number of processors now work for the advection example.

Also fixes a restart bug with reflective boundary conditions with an ugly hack in mesh/mesh_refinement.cpp to apply reflective boundary conditions. It is expected that @Yurlungur or @jonahm-LANL will fix this when doing the boundary condition refactor. As a result this should get merged in before the boundary condition fix.

We can now restart with a number of processors different from the original run, and even restart serial from parallel (or vice versa) and still get identical answers.

I believe #61 is fixed with this PR.

Known issue:

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md

@nmsriram nmsriram mentioned this pull request Sep 30, 2020
@JoshuaSBrown JoshuaSBrown added the bug Something isn't working label Sep 30, 2020
Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. I don't love the function name GetNbs as nb* is a cryptic name. But since that's what they're called internally, I think it's fine.

Copy link
Collaborator

@JoshuaSBrown JoshuaSBrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still getting errors when running the tests, which tests is this supposed to fix?

27/36 Test #27: regression_test:advection_performance .............................***Failed  336.32 sec
      Start 28: regression_mpi_test:advection_performance
28/36 Test #28: regression_mpi_test:advection_performance .........................***Failed  412.27 sec
      Start 29: regression_test:restart
29/36 Test #29: regression_test:restart ...........................................   Passed   69.50 sec
      Start 30: regression_mpi_test:restart
30/36 Test #30: regression_mpi_test:restart .......................................***Failed    1.38 sec

src/mesh/mesh.cpp Outdated Show resolved Hide resolved
src/mesh/mesh.hpp Outdated Show resolved Hide resolved
src/mesh/mesh_refinement.hpp Outdated Show resolved Hide resolved
return std::vector<int>(
{refine_flag_, neighbor_rflag_, deref_count_, deref_threshold_});
}
void SetInternals(std::vector<int> a) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a should be const reference.

@@ -116,6 +116,7 @@ class RestartReader {
std::vector<T> ReadDataset(const char *name, size_t *count = nullptr) {
// Returns entire 1D array.
// status, never checked. We should...
std::vector<T> data;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Oh nevermind, I guess this didn't actually really change anything...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ifdef behavior actually what we want? (and/or is this a leftover from Athena++?)
I'm asking as having the code fail seems like a preferable solution rather than returning an empty object (due to lack of implementation or similar).

src/parthenon_manager.cpp Outdated Show resolved Hide resolved
src/mesh/mesh.cpp Outdated Show resolved Hide resolved
@JoshuaSBrown
Copy link
Collaborator

It looks like the test will run but something is still throwing an error:

/projects/opt/ppc64le/p9/openmpi/4.0.2-gcc_7.4.0/bin/mpiexec -n 2 '/home/joshuasbrown/Software/parthenon_x86_64/parthenon/build/example/advection/advection-example' -i '/home/joshuasbrown/Software/parthenon_x86_64/parthenon/tst/regression/test_suites/restart/parthinput.restart' 'parthenon/job/problem_id=gold'
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   cn2020
  Local device: mlx5_1
--------------------------------------------------------------------------
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
### Warning in Mesh::Initialize
The number of MeshBlocks increased more than twice during initialization.
More computing power than you expected may be required.
### Warning in Mesh::Initialize
The number of MeshBlocks increased more than twice during initialization.
More computing power than you expected may be required.

Setup complete, executing driver...

cycle=0 time=0.0000000000000000e+00 dt=1.7578125000000000e-03 zone-cycles/wsec = 0.00e+00
cycle=1 time=1.7578125000000000e-03 dt=1.7578125000000000e-03 zone-cycles/wsec = 1.85e+04
cycle=2 time=3.5156250000000001e-03 dt=1.7578125000000000e-03 zone-cycles/wsec = 1.36e+05
cycle=3 time=5.2734375000000003e-03 dt=1.7578125000000000e-03 zone-cycles/wsec = 6.90e+04
[cn2020:30009] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init
[cn2020:30009] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Copy link
Collaborator

@JoshuaSBrown JoshuaSBrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than my optional improvements this is good. The error I was running into is located in the python regression testing framework so this has my green light.

@Yurlungur
Copy link
Collaborator

@nmsriram is this ready for merge?

src/mesh/mesh.cpp Outdated Show resolved Hide resolved
src/mesh/mesh_refinement.hpp Outdated Show resolved Hide resolved
@@ -76,6 +76,16 @@ class MeshRefinement {
// and/or in BoundaryValues::ProlongateBoundaries() (for SMR and AMR)
int AddToRefinement(ParArrayND<Real> pvar_cc, ParArrayND<Real> pcoarse_cc);
int AddToRefinement(FaceField *pvar_fc, FaceField *pcoarse_fc);
std::vector<int> GetInternals() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that int32_t is used above to read, I suggest to also be more specific here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@@ -116,6 +116,7 @@ class RestartReader {
std::vector<T> ReadDataset(const char *name, size_t *count = nullptr) {
// Returns entire 1D array.
// status, never checked. We should...
std::vector<T> data;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ifdef behavior actually what we want? (and/or is this a leftover from Athena++?)
I'm asking as having the code fail seems like a preferable solution rather than returning an empty object (due to lack of implementation or similar).

@codecov
Copy link

codecov bot commented Oct 2, 2020

Codecov Report

Merging #311 into develop will increase coverage by 0.00%.
The diff coverage is 76.81%.

Impacted file tree graph

@@           Coverage Diff            @@
##           develop     #311   +/-   ##
========================================
  Coverage    55.82%   55.82%           
========================================
  Files          107      107           
  Lines        11103    11128   +25     
========================================
+ Hits          6198     6212   +14     
- Misses        4905     4916   +11     
Impacted Files Coverage Δ
src/mesh/mesh_refinement.cpp 27.21% <30.43%> (+0.10%) ⬆️
src/interface/variable_pack.hpp 82.70% <100.00%> (+0.26%) ⬆️
src/mesh/mesh.cpp 53.86% <100.00%> (+0.12%) ⬆️
src/outputs/parthenon_hdf5.cpp 96.61% <100.00%> (+2.34%) ⬆️
src/outputs/restart.cpp 92.73% <100.00%> (+0.06%) ⬆️
src/outputs/restart.hpp 85.18% <100.00%> (ø)
src/parthenon_manager.cpp 80.15% <100.00%> (+0.65%) ⬆️
src/mesh/amr_loadbalance.cpp 75.21% <0.00%> (-0.15%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 66afc3e...194dc5d. Read the comment docs.

@nmsriram
Copy link
Collaborator Author

@pgrete the reason to return an empty array is that we may change the physics on restart and I didn't want the request for those physics to kill the code. I could go either way on this, perhaps with a query function.

@pgrete
Copy link
Collaborator

pgrete commented Oct 23, 2020

@nmsriram there are a couple of comments left. It'd be great if they could be addressed (or replied to) before we go ahead and merge this.

Also (unrelated to this PR): I just noticed that quite a bit of the restart functionality has been implemented using macros.
What's the motivation behind those macros and could we potentially replace those (not in this PR) with standard functions?

@nmsriram nmsriram force-pushed the bugfix/restart_parallel branch from 2f76ee2 to 0e20618 Compare October 23, 2020 19:13
@nmsriram
Copy link
Collaborator Author

@pgrete the reason they are macros is because I couldn't figure out how to do it different. About a week ago @AndrewGaspar commented essentially the same thing and showed a std::function way. I'm a bit hesitant to change this code at this point and will do it in a subsequent PR.

Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work. this one was a doozy.

@@ -154,7 +154,7 @@ void FillVarView(const vpack_types::VarList<T> &vars, PackIndexMap *vmap,
auto host_sp = Kokkos::create_mirror_view(Kokkos::HostSpace(), sparse_assoc);

int vindex = 0;
int sparse_start;
int sparse_start = -22000; // Initialize to an obviously wrong number
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should perhaps formalize this throughout the code and make a habit of initializing to signal NaNs. That's for another time, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create an issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes good idea

@@ -38,6 +38,52 @@

namespace parthenon {

// TODO(Yurlunger) Identify an idiomatic way of applying reflective
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO(Yurlunger) Identify an idiomatic way of applying reflective
// TODO(JMM) Identify an idiomatic way of applying reflective

src/mesh/mesh_refinement.cpp Show resolved Hide resolved
tst/regression/test_suites/restart/restart.py Show resolved Hide resolved
src/parthenon_manager.cpp Show resolved Hide resolved
@@ -192,7 +192,9 @@ ParthenonManager::ProcessPackagesDefault(std::unique_ptr<ParameterInput> &pin) {
void ParthenonManager::RestartPackages(Mesh &rm, RestartReader &resfile) {
// Restart packages with information for blocks in ids from the restart file
// Assumption: blocks are contiguous in restart file, may have to revisit this.
const IndexDomain interior = IndexDomain::interior;
// const IndexDomain interior = IndexDomain::interior;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// const IndexDomain interior = IndexDomain::interior;

@@ -129,6 +175,7 @@ void MeshRefinement::RestrictCellCenteredValues(const ParArrayND<Real> &fine,
(fine(n, 0, j, i + 1) * vol01 + fine(n, 0, j + 1, i + 1) * vol11)) /
tvol;
});
applyBounds(pmb, coarse, cib, cjb);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we only have a function like this for the 2d case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. I think this should be for all 3 cases, @nmsriram

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh brother. yes. of course. @JoshuaSBrown / @Yurlungur Should I put in another PR to fix this, or should I just depend on @Yurlungur to fix it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoever, gets to it first I suppose.

Comment on lines +201 to +202
const IndexDomain theDomain =
(output_params.include_ghost_zones ? IndexDomain::entire : IndexDomain::interior);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment on lines +243 to +246
auto nx1 = out_ib.e - out_ib.s + 1; // SS mb.block_size.nx1;
auto nx2 = out_jb.e - out_jb.s + 1; // SS mb.block_size.nx2;
auto nx3 = out_kb.e - out_kb.s + 1; // SS mb.block_size.nx3;
int bsize[3] = {mb.block_size.nx1, mb.block_size.nx2, mb.block_size.nx3};
Copy link
Collaborator

@JoshuaSBrown JoshuaSBrown Oct 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments are confusing to me, what does the SS mean, this kind of looks like you are saying nx1 is the same as mb.block_size.nx1. If so why are you using mb.block_size.nx1 to define bsize and not the more concise nx1 nx2 and nx3 variables?

Ok, I see, nx1 etc will change depending on theDomain.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, SS are my initials and I use // SS or //SS where I think I may have screwed up and may need to go back.

@@ -180,6 +179,9 @@ class RestartReader {
// perhaps belongs in a destructor?
void Close();

// Does file have ghost cells?
int hasGhost;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why int instead of bool?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is read in from the HDF file and this requires fewer lines of code

@JoshuaSBrown
Copy link
Collaborator

    Start  1: unit_tests:Just check everything
 1/37 Test  #1: unit_tests:Just check everything ..................................   Passed    0.27 sec
      Start  2: unit_tests:Task Object Lifecycle
 2/37 Test  #2: unit_tests:Task Object Lifecycle ..................................   Passed    0.27 sec
      Start  3: unit_tests:Can create a vector-valued face-variable
 3/37 Test  #3: unit_tests:Can create a vector-valued face-variable ...............   Passed    0.25 sec
      Start  4: unit_tests:Add and Get is called
 4/37 Test  #4: unit_tests:Add and Get is called ..................................   Passed    0.26 sec
      Start  5: unit_tests:reset is called
 5/37 Test  #5: unit_tests:reset is called ........................................   Passed    0.26 sec
      Start  6: unit_tests:when hasKey is called
 6/37 Test  #6: unit_tests:when hasKey is called ..................................   Passed    0.29 sec
      Start  7: unit_tests:Physical constants
 7/37 Test  #7: unit_tests:Physical constants .....................................   Passed    0.29 sec
      Start  8: unit_tests:Checking IndexShape indices
 8/37 Test  #8: unit_tests:Checking IndexShape indices ............................   Passed    0.25 sec
      Start  9: unit_tests:Checking IndexShape cell counts
 9/37 Test  #9: unit_tests:Checking IndexShape cell counts ........................   Passed    0.24 sec
      Start 10: unit_tests:par_for loops
10/37 Test #10: unit_tests:par_for loops ..........................................   Passed    0.61 sec
      Start 11: unit_tests:nested par_for loops
11/37 Test #11: unit_tests:nested par_for loops ...................................   Passed    0.29 sec
      Start 12: unit_tests:Overlapping SpaceInstances
12/37 Test #12: unit_tests:Overlapping SpaceInstances .............................   Passed    0.34 sec
      Start 13: unit_tests:Built-in flags are registered
13/37 Test #13: unit_tests:Built-in flags are registered ..........................   Passed    0.34 sec
      Start 14: unit_tests:A Metadata flag is allocated
14/37 Test #14: unit_tests:A Metadata flag is allocated ...........................   Passed    0.26 sec
      Start 15: unit_tests:A Metadata struct is created
15/37 Test #15: unit_tests:A Metadata struct is created ...........................   Passed    0.25 sec
      Start 16: unit_tests:ParArrayND
16/37 Test #16: unit_tests:ParArrayND .............................................   Passed    0.31 sec
      Start 17: unit_tests:ParArrayND with LayoutLeft
17/37 Test #17: unit_tests:ParArrayND with LayoutLeft .............................   Passed    0.35 sec
      Start 18: unit_tests:Time simple stencil operations
18/37 Test #18: unit_tests:Time simple stencil operations .........................   Passed    0.27 sec
      Start 19: unit_tests:Check registry pressure
19/37 Test #19: unit_tests:Check registry pressure ................................   Passed    0.99 sec
      Start 20: unit_tests:Check many arrays
20/37 Test #20: unit_tests:Check many arrays ......................................   Passed    0.26 sec
      Start 21: unit_tests:Swarm memory management
21/37 Test #21: unit_tests:Swarm memory management ................................   Passed    0.31 sec
      Start 22: unit_tests:Can pull variables from containers based on Metadata
22/37 Test #22: unit_tests:Can pull variables from containers based on Metadata ...   Passed    0.67 sec
      Start 23: unit_tests:Test required/desired checking from inputs
23/37 Test #23: unit_tests:Test required/desired checking from inputs .............   Passed    0.26 sec
      Start 24: unit_tests:Parthenon Error Checking
24/37 Test #24: unit_tests:Parthenon Error Checking ...............................   Passed    0.27 sec
      Start 25: unit_tests:Check that integer division, rounded up, works
25/37 Test #25: unit_tests:Check that integer division, rounded up, works .........   Passed    0.27 sec
      Start 26: unit_tests:Check that partitioning a container works
26/37 Test #26: unit_tests:Check that partitioning a container works ..............   Passed    0.26 sec
      Start 27: performance_tests:Catch2 Container Iterator Performance
27/37 Test #27: performance_tests:Catch2 Container Iterator Performance ...........   Passed   11.80 sec
      Start 28: regression_test:advection_performance
28/37 Test #28: regression_test:advection_performance .............................   Passed  334.83 sec
      Start 29: regression_mpi_test:advection_performance

29/37 Test #29: regression_mpi_test:advection_performance .........................   Passed  1095.79 sec
      Start 30: regression_test:restart
30/37 Test #30: regression_test:restart ...........................................   Passed   69.44 sec
      Start 31: regression_mpi_test:restart
31/37 Test #31: regression_mpi_test:restart .......................................   Passed  112.76 sec
      Start 32: regression_test:calculate_pi
32/37 Test #32: regression_test:calculate_pi ......................................   Passed    2.77 sec
      Start 33: regression_mpi_test:calculate_pi
33/37 Test #33: regression_mpi_test:calculate_pi ..................................   Passed    2.92 sec
      Start 34: regression_test:advection_convergence
34/37 Test #34: regression_test:advection_convergence .............................   Passed  376.76 sec
      Start 35: regression_mpi_test:advection_convergence
35/37 Test #35: regression_mpi_test:advection_convergence .........................   Passed  682.34 sec
      Start 36: regression_test:output_hdf5
36/37 Test #36: regression_test:output_hdf5 .......................................   Passed  241.76 sec
      Start 37: regression_mpi_test:output_hdf5
37/37 Test #37: regression_mpi_test:output_hdf5 ...................................   Passed  368.85 sec

Congrats! It works.

@Yurlungur Yurlungur mentioned this pull request Oct 26, 2020
Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why my approval didn't stick. I approve.

@nmsriram nmsriram merged commit 3920633 into develop Oct 26, 2020
@Yurlungur Yurlungur deleted the bugfix/restart_parallel branch December 7, 2020 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants