-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8.1.1 #14
base: master
Are you sure you want to change the base?
8.1.1 #14
Conversation
Bytes written is sometimes less than original ptx.size() and hipModuleLoad throws an a string too long exception. Setting binary output writes all the bytes.
- Port optimization from openmm/openmm#4070 to HIP for compatibility with upcoming OpenMM 8.1 release - It may be possible to revert some of the changes in amd@08c967d, which was optimizing for small systems as well
…p into develop_stream
Related to amd#7
…hip into develop_stream
The nonbonded kernel uses USE_NEIGHBOR_LIST (useNeighborList) so host code also must check it instead of useCutoff. See also openmm/openmm#3462
…xed bug in large blocks optimization with triclinic boxes" openmm/openmm@796ffaa openmm/openmm@4c10732
Large blocks
* hipModuleLoad sometimes fails to load modules for unknown reasons, use manual loading from the output file and hipModuleLoadDataEx; * use amdclang++ directly instead of hipcc; * use --offload-device-only instead of --genco;
Prepare 8.1.1
Spec: With
With
|
Thank you! As I said, hipRTC in ROCm 6.0.0 has issues with builtin vector and complex types, according to commit log they seem to be fixed, hopefully they'll be included in the next minor release of ROCm. By the way, use Regarding to the first error (without hipRTC), I have no ideas yet. It looks like the ROCm installation is not ok. -- Update: |
FWIW, hipRTC is also failing on Windows with the newly released 5.7.1 SDK when compiling vector operations. It worked fine with the 5.5.1 SDK. hipcc works properly in 5.7.1 and performance is similar to 5.5.1. |
hipconfig content of 6.0.0
For comparison working 5.3.0
Diff shows these differences between 5.3.0 and 6.0.0
But the 5.5.1 that works, has no differences between 5.5.1 and 6.0.0. So I think there should be no problem, except 6.0.0 and 5.7.1 (that also has the same problem) should be modified. With regard to
I will give an answer in the next 30minutes |
No, I mean what hipconfig prints when you run it (OpenMM-HIP doesn't use hipconfig, I just want to be sure that your ROCm installation is not broken because it looks like something is not right there) |
Something is definitely not right with your ROCm. Perhaps, it's worth to uninstall and install it from scratch (of course if you have admin privileges) |
I opened a ticket to repair it. It's weird, because the company states "we did regression testing on the machine with new ROCm, and didn't find any issues". |
Managed to solve the problem with administration, but the graphic cards drivers were updated. This resulted in inability to run ROCm that is older than 5.7.1. So the oldest that I can run is 5.7.1 at the moment. In the comments where I mentioned Segfault - it is for systems that are either missing on the graph or market with 0 performance. Usually it is for the amber benchmark and more than 1 gpu Here is the comparison ROCm 5.7.1 (OpenMM 8.0, with OpenMM HIP for 8.0) vs ROCm 6.0.0 (OpenMM 8.1.1 with OpenMM HIP 8.1.1). Graphic card is MI250, single GPU (so only 1/2 of MI250 in fact). I will update this post with different number of graphic cards as the time goes on. 2 GPUs (full MI250) ### Comment: With 2 GPUs, OPENMM 8.0.0/5.7.1 is segfaulting with "Memory access fault by GPU node-4 (Agent handle: 0x560656fbf4d0) on address 0x145113be5000. Reason: Unknown." whole amber suite 3 GPUs (3/4 MI250) ### Comment: With 4 GPUs, OPENMM 8.0.0/5.7.1 is segfaulting with "Memory access fault by GPU node-4 (Agent handle: 0x560656fbf4d0) on address 0x145113be5000. Reason: Unknown." whole amber suite ** Comment: With 4 GPUs, OPENMM 8.1.1/6.0.0 is segfaulting with "Memory access fault by GPU node-4 (Agent handle: 0x562454718330) on address 0x562472e00000. Reason: Unknown." ** 4 GPUs (2 MI250) ### Comment: With 4 GPUs, OPENMM 8.0.0/5.7.1 is segfaulting with "Memory access fault by GPU node-4 (Agent handle: 0x560656fbf4d0) on address 0x145113be5000. Reason: Unknown." whole amber suite 8 GPUs (4 full MI250): ### Comment: With 8 GPUs, OPENMM 8.0.0/5.7.1 is segfaulting with "Memory access fault by GPU node-4 (Agent handle: 0x560656fbf4d0) on address 0x145113be5000. Reason: Unknown." whole amber suite ** Comment: With 8 GPUs, OPENMM 8.1.1/6.0.0 is segfaulting with "Memory access fault by GPU node-4 (Agent handle: 0x562454718330) on address 0x562472e00000. Reason: Unknown." ** - Generally stm virus simulation |
Thanks you! I need to analyze the results especially these crashes in multi-GPU simulations.
There is a chance that some of the tests may freeze occasionally. I suspect a bug in
|
Hello. I tested the I can't comment on the multi-GPu failures. AToM-OpenMM uses multiple GPUs but in a distributed asynchronous mode rather than a parallel mode. Thanks. |
@egallicc Thank you! @DanielWicz I wonder if something may be wrong with your XGMI configuration. Could you post here what |
Hi! I'm especially interested in multi-GPU stability:
(and Thanks! |
I will retest with regard to the gba. Generally there were some updates on our nodes and it gets "1213.11 ns/day". But still I get these "segmentation faults" for multiple gpus. Here is the output of rocm-smi
|
Why do you recommend precision simple ? Usually OpenMM uses mixed (simple is not recommended in most cases). |
No reason, except that single is the default precision of the benchmark.py. But I assume, you get crashes for all 3 precisions in multi-GPU, right? |
I tried on single. Still I get Maybe should I try some env variable related to memory allocation or PME ? Those some of the people reported similar problem: Can I graph somehow vmem over time ? I have strong suspicion that the VRAM is not released between each run inside Python file. Edit: |
I asked my colleague with access to MI200 to run multi-GPU benchmarks: 2 GPUs and 4 GPUs ran without issues. The system has ROCm 5.7. So I think that something is configured incorrectly on your system. Did you try to run Regarding your question about vmem. You can use Even if there were VRAM leaks they unlikely caused the crash considering that most of benchmarks are very small and MI200 has a lot of memory. You can also try to run with |
@jdmaia Did you have a chance to run tests and benchmarks? I think the PR should be merged. |
This PR supersedes PRs #7, #8, #9.
Closes #11, closes #12, closes #16.
Known issues:
TestHipCompiler
fails andOPENMM_USE_HIPRTC
cannot be used.Whoever has opportunity, please build and run tests.
I've also prepared a new conda package
openmm-hip==8.1.1beta
(https://anaconda.org/StreamHPC/openmm-hip/files)The package is built on ROCm 6.0.0, due to binary incompatibilities between 5.* and 6., it won't work on ROCm 5..
I don't know if it's worth to support old ROCm versions and how to do it properly (upload packages with different labels like rocm-5.7, rocm-6.0 so the user will be able to choose the correct version?). I'm open to suggestions.
If everything is ok, we can merge it.