-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matrix update engines direct inversion #3470
Matrix update engines direct inversion #3470
Conversation
Failing test is failing CI |
Test this please |
remove extra resize_fill_constant small cleanup formatting files separating timer for DDB inverse and batched inverse
13821fa
to
78253d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Investigating a test failure
MP passes test_wavefunction_trial but FP doesn't.
Related to point 2 . |
not a change request: a_mats are not padded while inv_a_mats are. What is our logic for this (memory usage?) or is it a historical thing? |
It is more of a historic thing. When orbitals are produced by SPOSet, they are not padded thus a_mats are not padded. |
Any batch you think is useless if for unit testing. They aren't useless unless you trust NVIDIA and AMD completely. |
I will just drop the compute_mask from the API. My original direction on this was that all determinant data should be in per crowd data structures and not stored in the determinant, engine, or matrix objects. Didn't make much difference for async CUDA transfers so its premature optimization. I just didn't want to change it to get this code in and done diverging. Since this has already been delayed for a least a couple of months I might as well cut it. |
So what is wrong with the unit test for test_DiracMatrixComputeCUDA, I don't think there is anything wrong with mw_invertTranspose at full precision. Looks to me like full precision mw_invertTranspose works perfectly as does the batched calculation of the log dets. I will go over those unit tests again. I don't think its likely that is broken. So either something is up with the input or the higher level code expects additional state in the individual walkers to be updated in someway. |
I don't get what you mean here. We should have batching in unit tests. I don't even trust NVIDIA. |
I see what you mean. I was mixing two issues here. And really those unit tests should be moved to the Platform unit tests. The computeGetrf_batched is only in the API for the unit test, otherwise it could just be in the .cu. The computeGetri_batched call only exists at all in either file for the unit test, nothing is done in the function, the unit test should also just call the cuBLAS wrapper itself. I'll make an issue for moving the cublas wrapper unit tests. |
b71988f
to
555d3d8
Compare
…sion-Ye Fixes test_wavefunction_trial failing in full precision build After a build in a fresh repo this seems good.
FYI. I ran one NiO a64 benchmark in mixed precision, I saw inversion was taking 7% of total time. Before this PR, the inversion time was 3x. Encouraging. I will do more tests once the non-deterministic fix in upstream is merged. |
0a34412
to
22cc717
Compare
Ok by propagating the changes from #3514 I believe everything slated to possibly be const in the API is now const. So this is ready for be reviewed again. |
Test this please |
What the status on this? What is the status of the nondeterminism in develop? |
I need to do a performance check. The additional transfer after matrix inversion concerns me but as long as it doesn't slowdown the full run too muc (<20%)h, we will merge and improve later. I will let you know in this week. |
Start testing in-house |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pros. Acceleration is real
Cons. Need additional device memory.
I have prepared a PR for allowing selecting inverter from input. So users have larger degree of freedom.
Proposed changes
Connect the CUDA direct inversion with minimum change to both develop and my design for performant and flexible acceleration of DiracDeterminantBatched.
What type(s) of changes does this code introduce?
Does this introduce a breaking change?
What systems has this change been tested on?
Leconte
Checklist
Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is
simply a reminder of what we are going to look for before merging your code.