Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement vjps of vjps for HPE and SAP (CPU) #52

Merged
merged 16 commits into from
Apr 29, 2024
Merged

Conversation

frostedoyster
Copy link
Collaborator

This is needed to train models on forces and stresses, for example

@frostedoyster frostedoyster force-pushed the second-derivatives-cpu branch 3 times, most recently from f0d2ec8 to d0cef09 Compare April 14, 2024 09:43
@frostedoyster frostedoyster force-pushed the second-derivatives-cpu branch from d0cef09 to 142bcac Compare April 14, 2024 12:13
@frostedoyster frostedoyster force-pushed the second-derivatives-cpu branch from a346353 to fa07621 Compare April 14, 2024 15:41
@frostedoyster frostedoyster changed the title Implement vjps of vjps on CPU Implement vjps of vjps for HPE and SAP (CPU) Apr 16, 2024
@frostedoyster frostedoyster force-pushed the second-derivatives-cpu branch from 40cfcec to 531b23b Compare April 16, 2024 06:26
@frostedoyster frostedoyster force-pushed the second-derivatives-cpu branch from 4b4abed to 9bc1555 Compare April 16, 2024 07:22
@frostedoyster frostedoyster requested a review from Luthaf April 16, 2024 07:29
@frostedoyster frostedoyster force-pushed the second-derivatives-cpu branch 3 times, most recently from 8fd61df to ff1eab3 Compare April 16, 2024 10:16
@frostedoyster frostedoyster force-pushed the second-derivatives-cpu branch from ff1eab3 to 5f1a125 Compare April 16, 2024 10:25
Copy link
Contributor

@Luthaf Luthaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall, but I don't understand what's going on for the actual implementations

@Luthaf
Copy link
Contributor

Luthaf commented Apr 17, 2024

Regarding the test failure: this could be caused by a data race in OpenMP. You could try running the tests without OpenMP to check if that's the case, and/or compiling the code with the threads sanitizer, which should catch some issues at runtime.

@frostedoyster frostedoyster force-pushed the second-derivatives-cpu branch 2 times, most recently from 1abe770 to 9791119 Compare April 19, 2024 05:30
@frostedoyster frostedoyster force-pushed the second-derivatives-cpu branch 3 times, most recently from 7912129 to a86c3ca Compare April 19, 2024 09:55
@frostedoyster frostedoyster force-pushed the second-derivatives-cpu branch from a86c3ca to d28b1fe Compare April 19, 2024 09:55
@frostedoyster
Copy link
Collaborator Author

frostedoyster commented Apr 19, 2024

An update on the bug:

Thanks @Luthaf for the suggestion to look into parallelism and use thread sanitizer. That was indeed the issue because removing OpenMP parallelism from HPE indeed eliminates the test failure. However, I found it to be nearly impossible to test the code with thread sanitizer: on my laptop (Ubuntu 23.10), there is a known bug that prevents thread sanitizer to work properly ("unknown memory mapping"), on CI (ubuntu 20.04) there is another known bug for which the compiled thread sanitizer library is missing (linker fails), on Jed it is very difficult to build mops in the first place because it requires a workaround (due to the conflict of Jed having a cuda compiler and libtorch_cuda.so missing).

To that, I should add that the double backward of HPE is virtually useless for us: HPE cannot be used during training of equivariant models (where SAP would take its place). However, if one uses ACE/WK, the trained model can be re-expressed in terms of a few calls to HPE (rather than many to SAP), and that is its main use case. As a result, the double backward for HPE is there just for completeness, but it is virtually useless in our field (and I suspect in general... who's going to need VJP of VJP of homogeneous polynomials?).

As a result, I disabled parallelism for the VJP of VJP of HPE and I'm going to open an issue to fix it (which will be very low-priority). I will also open an issue to document the workaround to build mops-torch without CUDA in environments where a CUDA compiler is available (this PR is a good start, because it adds a MOPS_CUDA option rather than just looking at the availability of a CUDA compiler).

Regarding this PR, I will try to address the review from @Luthaf and then he will be able to review again.

@frostedoyster frostedoyster marked this pull request as ready for review April 19, 2024 15:14
@frostedoyster frostedoyster requested a review from Luthaf April 19, 2024 15:14
@frostedoyster frostedoyster force-pushed the second-derivatives-cpu branch from b8b1989 to 2947ab4 Compare April 19, 2024 19:59
@frostedoyster frostedoyster requested a review from Luthaf April 28, 2024 13:26
@Luthaf
Copy link
Contributor

Luthaf commented Apr 29, 2024

Let's keep this moving, we can improve the docs further later!

@Luthaf Luthaf merged commit ca81398 into main Apr 29, 2024
4 checks passed
@Luthaf Luthaf deleted the second-derivatives-cpu branch April 29, 2024 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants