-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crusher fix #1205
Crusher fix #1205
Conversation
@@ -79,8 +79,7 @@ MeshVelocityAlg<AlgTraits>::MeshVelocityAlg(Realm& realm, stk::mesh::Part* part) | |||
NUM_IP, isoParCoords_, isoCoordsShapeFcnHostView_.data()); | |||
Kokkos::deep_copy(isoCoordsShapeFcnDeviceView_, isoCoordsShapeFcnHostView_); | |||
|
|||
auto scsFaceNodeMapHostView = | |||
Kokkos::create_mirror(scsFaceNodeMapDeviceView_); | |||
Kokkos::View<int[12][4], Kokkos::HostSpace> scsFaceNodeMapHostView; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the change from create_mirror to a direct definition reverse a previous commit? I wonder if int[12][4] is unstable across platforms and should be int** instead. But if this gets Frontier up and running, great catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree with these questions. It would seem like [12][4]
is specific to a single element/face topology. Is this algorithm only going to run on a single kind of elements/faces? Generally create_mirror
is considered safer because it automatically creates the same type/shape as the device view.
But like James says, if this makes it work then it's hard to argue against.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alanw0 It is hard coded to a single element type right now. The code throws if it is not hex.
@overfelt The unstable one was int**
. I tested this one, but create_mirror
was suggested in a code review. I must have forgot to test it.
Chatting with Kokkos people this seems to be wrong still and surprising that it works. I'm wondering if our loop is hardcoded to a layout right execution, and we are going out of index bounds in the device loop. It may still work if the memory is all adjacent. Let's dig into this a little more before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the CI failure indicates something is amiss here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any unit tests for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This algorithm gets used only when FSI is activated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the test that is failing CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If James' solution of making it into a const int array is feasible, that sounds good.
But I agree with Paul, it should be easy to query the sizes and then that should guarantee correct looping. You can get those sizes with .extent(0)
and .extent(1)
, I think. Would need to verify exact syntax...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@psakievich Might be worthwhile to query the sizes just as a sanity check, even though this is hard coded for one element type. Might reveal our underlying issue, although I would be really surprised if it is not [12][4], considering we hard code scsFaceNodeMap_
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay. the gcl unit test doesn't run on device. need to find out why. We probably need to spin up a parallel test suite of this on summit to make sure we don't break cuda too
For reference #1193 was the PR where we made these changes. |
@@ -71,8 +71,7 @@ MeshVelocityEdgeAlg<AlgTraits>::MeshVelocityEdgeAlg( | |||
19, &isoParCoords_[0], isoCoordsShapeFcnHostView_.data()); | |||
Kokkos::deep_copy(isoCoordsShapeFcnDeviceView_, isoCoordsShapeFcnHostView_); | |||
|
|||
auto scsFaceNodeMapHostView = | |||
Kokkos::create_mirror(scsFaceNodeMapDeviceView_); | |||
Kokkos::View<int[12][4], Kokkos::HostSpace> scsFaceNodeMapHostView; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not initialize the view, the pointer will be a null ptr. Needs a string as argument to the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so Kokkos::View<int[12][4], Kokkos::HostSpace> scsFaceNodeMapHostView (<some string label>)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, but that will not allow for the deep_copy below because device and host view will have different layouts.
This is incorrect behavior. Summary from slack conversation
This hypothesis was verified by |
https://github.com/Exawind/nalu-wind/blob/master/src/gcl/MeshVelocityAlg.C#L84 Shouldn't these hardcoded loop dimensions be fixed? |
Fixes solver convergence issues in run on crusher ROCM run. Not sure I understand the subtly of why yet. Below is the output of the first timestep in Nalu-Wind. This ran with no issues for the entire 30 min of wall clock time I allocated in the queue.