Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crusher fix #1205

Closed
wants to merge 6 commits into from
Closed

Crusher fix #1205

wants to merge 6 commits into from

Conversation

psakievich
Copy link
Contributor

@psakievich psakievich commented Aug 29, 2023

Fixes solver convergence issues in run on crusher ROCM run. Not sure I understand the subtly of why yet. Below is the output of the first timestep in Nalu-Wind. This ran with no issues for the entire 30 min of wall clock time I allocated in the queue.

*******************************************************
Time Step Count: 1 Current Time: 0.00344353
 dtN: 0.00344353 dtNm1: 0.00344353 gammas: 1 -1 0
 DualNodalVolume min: 8.74301e-12 max: 0.737549 total: 119470
 Max Courant: 4073.99 Max Reynolds: 278166 (realm_1)
realm_1::advance_time_step() 
NLI    Name           Linear Iter      Linear Res     NLinear Res    Scaled NLR
---    ----           -----------      ----------     -----------    ----------
Aero models - Execute
1/4    Equation System Iteration
 1/1      myLowMach
        MomentumEQS_X          6        0.00219027       0.339946             1
        MomentumEQS_Y          6        0.00122359       0.208249             1
        MomentumEQS_Z          5       0.000648655      0.0273747             1
        MomentumEQS           17        0.00259137         0.3996             1
        ContinuityEQS         21          0.177085         7.0387             1
 1/1ShearStressTransportWrap
        TurbKineticEnergyEQS   5        0.00675613        0.70956             1
        SpecDissRateEQS        6           7257.94    1.08703e+06             1
2/4    Equation System Iteration
 1/1      myLowMach
        MomentumEQS_X          6          0.277873        14.9757       44.0533
        MomentumEQS_Y          6           0.12425        6.02642       28.9385
        MomentumEQS_Z          6          0.109949        3.21674       117.508
        MomentumEQS           18          0.323636        16.4602       41.1917
        ContinuityEQS         13           17.0319        584.207       82.9992
 1/1ShearStressTransportWrap
        TurbKineticEnergyEQS  98        0.00140091      0.0383274     0.0540157
        SpecDissRateEQS        9           21002.5    1.00507e+06      0.924603
3/4    Equation System Iteration
 1/1      myLowMach
        MomentumEQS_X         41          0.219607        6.31151       18.5662
        MomentumEQS_Y         37          0.112566        3.25961       15.6524
        MomentumEQS_Z         48         0.0387412        1.29877       47.4441
        MomentumEQS          126          0.249799        7.22129       18.0713
        ContinuityEQS         19           10.2296        319.018       45.3234
 1/1ShearStressTransportWrap
        TurbKineticEnergyEQS  13        0.00435818       0.122828      0.173104
        SpecDissRateEQS        1        1.1994e+06     8.6938e+10       79977.7
4/4    Equation System Iteration
 1/1      myLowMach
        MomentumEQS_X         10          0.100901        3.03511       8.92823
        MomentumEQS_Y         11         0.0316671        1.98288       9.52169
        MomentumEQS_Z          9         0.0137491       0.567972       20.7481
        MomentumEQS           30          0.106643        3.66965       9.18331
        ContinuityEQS         23           5.35502        145.578       20.6825
 1/1ShearStressTransportWrap
        TurbKineticEnergyEQS  11        0.00190954      0.0766999      0.108095
        SpecDissRateEQS        1            955315    3.46896e+11        319124
Aero models - Predict model time step

src/gcl/MeshVelocityAlg.C Outdated Show resolved Hide resolved
@@ -79,8 +79,7 @@ MeshVelocityAlg<AlgTraits>::MeshVelocityAlg(Realm& realm, stk::mesh::Part* part)
NUM_IP, isoParCoords_, isoCoordsShapeFcnHostView_.data());
Kokkos::deep_copy(isoCoordsShapeFcnDeviceView_, isoCoordsShapeFcnHostView_);

auto scsFaceNodeMapHostView =
Kokkos::create_mirror(scsFaceNodeMapDeviceView_);
Kokkos::View<int[12][4], Kokkos::HostSpace> scsFaceNodeMapHostView;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the change from create_mirror to a direct definition reverse a previous commit? I wonder if int[12][4] is unstable across platforms and should be int** instead. But if this gets Frontier up and running, great catch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree with these questions. It would seem like [12][4] is specific to a single element/face topology. Is this algorithm only going to run on a single kind of elements/faces? Generally create_mirror is considered safer because it automatically creates the same type/shape as the device view.
But like James says, if this makes it work then it's hard to argue against.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanw0 It is hard coded to a single element type right now. The code throws if it is not hex.

@overfelt The unstable one was int**. I tested this one, but create_mirror was suggested in a code review. I must have forgot to test it.

Chatting with Kokkos people this seems to be wrong still and surprising that it works. I'm wondering if our loop is hardcoded to a layout right execution, and we are going out of index bounds in the device loop. It may still work if the memory is all adjacent. Let's dig into this a little more before merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the CI failure indicates something is amiss here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any unit tests for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This algorithm gets used only when FSI is activated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the test that is failing CI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If James' solution of making it into a const int array is feasible, that sounds good.
But I agree with Paul, it should be easy to query the sizes and then that should guarantee correct looping. You can get those sizes with .extent(0) and .extent(1), I think. Would need to verify exact syntax...

Copy link
Contributor

@ashesh2512 ashesh2512 Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@psakievich Might be worthwhile to query the sizes just as a sanity check, even though this is hard coded for one element type. Might reveal our underlying issue, although I would be really surprised if it is not [12][4], considering we hard code scsFaceNodeMap_.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. the gcl unit test doesn't run on device. need to find out why. We probably need to spin up a parallel test suite of this on summit to make sure we don't break cuda too

@psakievich
Copy link
Contributor Author

For reference #1193 was the PR where we made these changes.

@@ -71,8 +71,7 @@ MeshVelocityEdgeAlg<AlgTraits>::MeshVelocityEdgeAlg(
19, &isoParCoords_[0], isoCoordsShapeFcnHostView_.data());
Kokkos::deep_copy(isoCoordsShapeFcnDeviceView_, isoCoordsShapeFcnHostView_);

auto scsFaceNodeMapHostView =
Kokkos::create_mirror(scsFaceNodeMapDeviceView_);
Kokkos::View<int[12][4], Kokkos::HostSpace> scsFaceNodeMapHostView;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not initialize the view, the pointer will be a null ptr. Needs a string as argument to the constructor.

Copy link
Contributor

@ashesh2512 ashesh2512 Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so Kokkos::View<int[12][4], Kokkos::HostSpace> scsFaceNodeMapHostView (<some string label>) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but that will not allow for the deep_copy below because device and host view will have different layouts.

@psakievich
Copy link
Contributor Author

This is incorrect behavior. Summary from slack conversation

So I have a hypothesis for what is going on in our code. Currently my hypothesis is the deep copy doesn't happen but doesn't fail and the scsFaceNodeMapDeviceView_ is initialized to all zeros. This would make our source term zero which it is at the start of the simulation so everything would look fine.
However, when we create_mirror we populate it with the correct values, but it likely exposes another bug.

This hypothesis was verified by printf statements inside the device kernels.

@psakievich psakievich closed this Aug 30, 2023
@PaulMullowney
Copy link
Contributor

https://github.com/Exawind/nalu-wind/blob/master/src/gcl/MeshVelocityAlg.C#L84

Shouldn't these hardcoded loop dimensions be fixed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants