-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement vjps of vjps for HPE and SAP (CPU) #52
Conversation
f0d2ec8
to
d0cef09
Compare
d0cef09
to
142bcac
Compare
a346353
to
fa07621
Compare
40cfcec
to
531b23b
Compare
4b4abed
to
9bc1555
Compare
8fd61df
to
ff1eab3
Compare
ff1eab3
to
5f1a125
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good overall, but I don't understand what's going on for the actual implementations
Regarding the test failure: this could be caused by a data race in OpenMP. You could try running the tests without OpenMP to check if that's the case, and/or compiling the code with the threads sanitizer, which should catch some issues at runtime. |
1abe770
to
9791119
Compare
7912129
to
a86c3ca
Compare
a86c3ca
to
d28b1fe
Compare
An update on the bug: Thanks @Luthaf for the suggestion to look into parallelism and use thread sanitizer. That was indeed the issue because removing OpenMP parallelism from HPE indeed eliminates the test failure. However, I found it to be nearly impossible to test the code with thread sanitizer: on my laptop (Ubuntu 23.10), there is a known bug that prevents thread sanitizer to work properly ("unknown memory mapping"), on CI (ubuntu 20.04) there is another known bug for which the compiled thread sanitizer library is missing (linker fails), on Jed it is very difficult to build mops in the first place because it requires a workaround (due to the conflict of Jed having a cuda compiler and To that, I should add that the double backward of HPE is virtually useless for us: HPE cannot be used during training of equivariant models (where SAP would take its place). However, if one uses ACE/WK, the trained model can be re-expressed in terms of a few calls to HPE (rather than many to SAP), and that is its main use case. As a result, the double backward for HPE is there just for completeness, but it is virtually useless in our field (and I suspect in general... who's going to need VJP of VJP of homogeneous polynomials?). As a result, I disabled parallelism for the VJP of VJP of HPE and I'm going to open an issue to fix it (which will be very low-priority). I will also open an issue to document the workaround to build mops-torch without CUDA in environments where a CUDA compiler is available (this PR is a good start, because it adds a MOPS_CUDA option rather than just looking at the availability of a CUDA compiler). Regarding this PR, I will try to address the review from @Luthaf and then he will be able to review again. |
b8b1989
to
2947ab4
Compare
Let's keep this moving, we can improve the docs further later! |
This is needed to train models on forces and stresses, for example