8.1.1 #14

ex-rzr · 2024-01-21T16:11:52Z

This PR supersedes PRs #7, #8, #9.
Closes #11, closes #12, closes #16.

Known issues:

hipRTC from ROCm 6.0.0 has issues so TestHipCompiler fails and OPENMM_USE_HIPRTC cannot be used.

Whoever has opportunity, please build and run tests.
I've also prepared a new conda package openmm-hip==8.1.1beta (https://anaconda.org/StreamHPC/openmm-hip/files)
The package is built on ROCm 6.0.0, due to binary incompatibilities between 5.* and 6., it won't work on ROCm 5..
I don't know if it's worth to support old ROCm versions and how to do it properly (upload packages with different labels like rocm-5.7, rocm-6.0 so the user will be able to choose the correct version?). I'm open to suggestions.

If everything is ok, we can merge it.

Bytes written is sometimes less than original ptx.size() and hipModuleLoad throws an a string too long exception. Setting binary output writes all the bytes.

- Port optimization from openmm/openmm#4070 to HIP for compatibility with upcoming OpenMM 8.1 release - It may be possible to revert some of the changes in amd@08c967d, which was optimizing for small systems as well

…p into develop_stream

…develop_stream

Related to amd#7

…hip into develop_stream

The nonbonded kernel uses USE_NEIGHBOR_LIST (useNeighborList) so host code also must check it instead of useCutoff. See also openmm/openmm#3462

openmm/openmm@9a0db72 openmm/openmm#4231

openmm/openmm@2ae50f9

neighbor list" openmm/openmm@3955033

…xed bug in large blocks optimization with triclinic boxes" openmm/openmm@796ffaa openmm/openmm@4c10732

openmm/openmm@71d9bb1

Large blocks

* hipModuleLoad sometimes fails to load modules for unknown reasons, use manual loading from the output file and hipModuleLoadDataEx; * use amdclang++ directly instead of hipcc; * use --offload-device-only instead of --genco;

Prepare 8.1.1

DanielWicz · 2024-01-22T16:49:15Z

Spec:
RHEL 8.8
ROCm 6.0
Driver version: 3602.0 (HSA1.1,LC)
GA: MI250

With OPENMM_USE_HIPRTC=False I get only fallowing error:

HIP platform error: Error launching HIP compiler: 256
clang++: error: cannot find HIP runtime; provide its path via '--rocm-path', or pass '-nogpuinc' to build without HIP runtime

With OPENMM_USE_HIPRTC=True, I get fallowing error:

OpenMM Version: 8.1.1
Git Revision: ec797acabe5de4ce9f56c92d349baa889f4b0821

There are 4 Platforms available:

1 Reference - Successfully computed forces
2 HIP - Error computing forces with HIP platform
3 CPU - Successfully computed forces
4 OpenCL - Successfully computed forces

HIP platform error: Error compiling program: /tmp/comgr-cbc8ab/input/CompileSourceiQBp71:443:13: error: use of overloaded operator '*' is ambiguous (with operand types 'floa
t2' (aka 'HIP_vector_type<float, 2>') and 'float')
    return a*scale;
           ~^~~~~~
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/rhel-8.8/8.8/build/hip-on-rocclr/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:2950:1: note: candidate functio
n
COMPLEX_SCALAR_PRODUCT(hipFloatComplex, float)
^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/rhel-8.8/8.8/build/hip-on-rocclr/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:2830:40: note: expanded from ma
cro 'COMPLEX_SCALAR_PRODUCT'
    __HOST_DEVICE__ static inline type operator*(const type& lhs, type1 rhs) {                 \
                                       ^
/tmp/comgr-cbc8ab/input/CompileSourceiQBp71:379:26: note: candidate function
inline __device__ float2 operator*(float2 a, float b) {
                         ^
/tmp/comgr-cbc8ab/input/CompileSourceiQBp71:458:13: error: use of overloaded operator '*' is ambiguous (with operand types 'double2' (aka 'HIP_vector_type<double, 2>') and '
double')
    return a*scale;

/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/rhel-8.8/8.8/build/hip-on-rocclr/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:2975:1: note: candidate functio
n
COMPLEX_SCALAR_PRODUCT(hipDoubleComplex, double)
^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/rhel-8.8/8.8/build/hip-on-rocclr/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:2830:40: note: expanded from macro 'COMPLEX_SCALAR_PRODUCT'
    __HOST_DEVICE__ static inline type operator*(const type& lhs, type1 rhs) {                 \
                                       ^
/tmp/comgr-cbc8ab/input/CompileSourceiQBp71:403:27: note: candidate function
inline __device__ double2 operator*(double2 a, double b) {
                          ^
/tmp/comgr-cbc8ab/input/CompileSourceiQBp71:540:13: error: use of overloaded operator '*' is ambiguous (with operand types 'float2' (aka 'HIP_vector_type<float, 2>') and 'float')
    return a*rsqrtf(a.x*a.x+a.y*a.y);
           ~^~~~~~~~~~~~~~~~~~~~~~~~
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/rhel-8.8/8.8/build/hip-on-rocclr/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:2950:1: note: candidate function
COMPLEX_SCALAR_PRODUCT(hipFloatComplex, float)
^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/rhel-8.8/8.8/build/hip-on-rocclr/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:2830:40: note: expanded from macro 'COMPLEX_SCALAR_PRODUCT'
    __HOST_DEVICE__ static inline type operator*(const type& lhs, type1 rhs) {                 \
                                       ^
/tmp/comgr-cbc8ab/input/CompileSourceiQBp71:379:26: note: candidate function
inline __device__ float2 operator*(float2 a, float b) {
                         ^
/tmp/comgr-cbc8ab/input/CompileSourceiQBp71:552:13: error: use of overloaded operator '*' is ambiguous (with operand types 'double2' (aka 'HIP_vector_type<double, 2>') and 'double')
    return a*rsqrt(a.x*a.x+a.y*a.y);
           ~^~~~~~~~~~~~~~~~~~~~~~~
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/rhel-8.8/8.8/build/hip-on-rocclr/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:2975:1: note: candidate function
COMPLEX_SCALAR_PRODUCT(hipDoubleComplex, double)

/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/rhel-8.8/8.8/build/hip-on-rocclr/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:2830:40: note: expanded from macro 'COMPLEX_SCALAR_PRODUCT'
    __HOST_DEVICE__ static inline type operator*(const type& lhs, type1 rhs) {                 \
                                       ^
/tmp/comgr-cbc8ab/input/CompileSourceiQBp71:403:27: note: candidate function
inline __device__ double2 operator*(double2 a, double b) {
                          ^

ex-rzr · 2024-01-22T17:04:19Z

Thank you!

As I said, hipRTC in ROCm 6.0.0 has issues with builtin vector and complex types, according to commit log they seem to be fixed, hopefully they'll be included in the next minor release of ROCm.

By the way, use OPENMM_USE_HIPRTC=1, not True. I don't know why it works considering that the code explicitly checks if the env varible is equal to 1 so both True and False must set it to 0. Strange...

Regarding to the first error (without hipRTC), I have no ideas yet. It looks like the ROCm installation is not ok.
Do other things work there?
Can you provide output of /opt/rocm/bin/hipconfig?

--

Update:
As a workaround I suggest t try with DEVICE_LIB_PATH=/opt/rocm/amdgcn/bitcode
(Please check if this path exists and contains a bunch of .bc files)

bdenhollander · 2024-01-22T22:25:24Z

FWIW, hipRTC is also failing on Windows with the newly released 5.7.1 SDK when compiling vector operations. It worked fine with the 5.5.1 SDK. hipcc works properly in 5.7.1 and performance is similar to 5.5.1.

DanielWicz · 2024-01-23T09:29:16Z

/opt/rocm/bin/hipconfig

hipconfig content of 6.0.0

# Need perl > 5.10 to use logic-defined or
use 5.006; use v5.10.1;

use warnings;

use File::Basename;
use File::Spec::Functions 'catfile';


#TODO: By default select perl script until change incorporated in HIP build script
my $USE_PERL_SCRIPT = $ENV{'HIP_USE_PERL_SCRIPTS'};
$USE_PERL_SCRIPT //= 1;  # use defined-or assignment operator.  Use env var, but if not defined default to 1.
my $isWindows =  ($^O eq 'MSWin32' or $^O eq 'msys');
my $SCRIPT_DIR=dirname(__FILE__);
if ($USE_PERL_SCRIPT) {
    #Invoke hipconfig.pl
    my $HIPCONFIG_PERL=catfile($SCRIPT_DIR, '/hipconfig.pl');
    system($^X, $HIPCONFIG_PERL, @ARGV);
} else {
    $BIN_NAME="/hipconfig.bin";
    if ($isWindows) {
        $BIN_NAME="/hipconfig.bin.exe";
    }
    my $HIPCONFIG_BIN=catfile($SCRIPT_DIR, $BIN_NAME);
    if ( -e $HIPCONFIG_BIN ) {
        #Invoke hipconfig.bin
        system($HIPCONFIG_BIN, @ARGV);
    } else {
        print "hipconfig.bin not present; Install HIPCC binaries before proceeding";
        exit(-1);
    }
}

# Because of this wrapper we need to check
# the output of the system command for perl and bin
# else the failures are ignored and build fails silently
if ($? == -1) {
    exit($?);
}
elsif ($? & 127) {
    exit($?);
}
else {
     $CMD_EXIT_CODE = $? >> 8;
}
exit($CMD_EXIT_CODE);

For comparison working 5.3.0

# Need perl > 5.10 to use logic-defined or
use 5.006; use v5.10.1;

use warnings;

use File::Basename;
use File::Spec::Functions 'catfile';


#TODO: By default select perl script until change incorporated in HIP build script
my $HIPCONFIG_USE_PERL_SCRIPT = 1;
my $isWindows =  ($^O eq 'MSWin32' or $^O eq 'msys');
my $SCRIPT_DIR=dirname(__FILE__);

if ($HIPCONFIG_USE_PERL_SCRIPT) {
    #Invoke hipconfig.pl
    my $HIPCONFIG_PERL=catfile($SCRIPT_DIR, '/hipconfig.pl');
    system($^X, $HIPCONFIG_PERL, @ARGV);
} else {
    $BIN_NAME="/hipconfig.bin";
    if ($isWindows) {
        $BIN_NAME="/hipconfig.bin.exe";
    }
    my $HIPCONFIG_BIN=catfile($SCRIPT_DIR, $BIN_NAME);
    if ( -e $HIPCONFIG_BIN ) {
        #Invoke hipconfig.bin
        system($HIPCONFIG_BIN, @ARGV);
    } else {
        print "hipconfig.bin not present; Install HIPCC binaries before proceeding";
        exit(-1);
    }
}

# Because of this wrapper we need to check
# the output of the system command for perl and bin
# else the failures are ignored and build fails silently
if ($? == -1) {
    exit($?);
}
elsif ($? & 127) {
    exit($?);
}
else {
     $CMD_EXIT_CODE = $? >> 8;
}
exit($CMD_EXIT_CODE);

Diff shows these differences between 5.3.0 and 6.0.0

31,32c31
< my $USE_PERL_SCRIPT = $ENV{'HIP_USE_PERL_SCRIPTS'};
< $USE_PERL_SCRIPT //= 1;  # use defined-or assignment operator.  Use env var, but if not defined default to 1.
---
> my $HIPCONFIG_USE_PERL_SCRIPT = 1;
35c34,35
< if ($USE_PERL_SCRIPT) {
---
> 
> if ($HIPCONFIG_USE_PERL_SCRIPT) {

But the 5.5.1 that works, has no differences between 5.5.1 and 6.0.0. So I think there should be no problem, except 6.0.0 and 5.7.1 (that also has the same problem) should be modified.

With regard to

Update:
As a workaround I suggest t try with DEVICE_LIB_PATH=/opt/rocm/amdgcn/bitcode
(Please check if this path exists and contains a bunch of .bc files)

I will give an answer in the next 30minutes

ex-rzr · 2024-01-23T09:33:23Z

No, I mean what hipconfig prints when you run it (OpenMM-HIP doesn't use hipconfig, I just want to be sure that your ROCm installation is not broken because it looks like something is not right there)

DanielWicz · 2024-01-23T09:35:58Z

No, I mean what hipconfig prints when you run it (OpenMM-HIP doesn't use hipconfig, I just want to be sure that your ROCm installation is not broken because it looks like something is not right there)

But how to run hipconfig ? I just run it as a sh script ?

Update:
As a workaround I suggest t try with DEVICE_LIB_PATH=/opt/rocm/amdgcn/bitcode
(Please check if this path exists and contains a bunch of .bc files)

When I set the variable as export DEVICE_LIB_PATH="/opt/rocm-6.0.0/amdgcn/bitcode", there's still an error:

HIP platform error: Error launching HIP compiler: 256
clang++: error: cannot find HIP runtime; provide its path via '--rocm-path', or pass '-nogpuinc' to build without HIP runtime

edit:
when I run hipconfig as a comand, it gives me:

Can't exec "/opt/rocm-6.0.0/hip/bin/hipcc": No such file or directory at /opt/rocm-6.0.0/bin//hipconfig.pl line 175.
Can't exec "/opt/rocm-6.0.0/hip/bin/hipcc": No such file or directory at /opt/rocm-6.0.0/bin//hipconfig.pl line 178.

The output in general is:

HIP version  : 5.7.0-0

== hipconfig
HIP_PATH     : /opt/rocm-6.0.0/hip
ROCM_PATH    : /opt/rocm-6.0.0
HIP_COMPILER : clang
HIP_PLATFORM : amd
HIP_RUNTIME  : rocclr
CPP_CONFIG   :  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.0.0/hip/include -I/lus/home/softs/rocm/6.0.0/lib/llvm/lib/clang/17.0.0
 

== hip-clang
HIP_CLANG_PATH   : /opt/rocm-6.0.0/llvm/bin
AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.0.0 23483 7208e8d15fbf218deb74483ea8c549c67ca4985e)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-6.0.0/llvm/bin
Configuration file: /lus/home/softs/rocm/6.0.0/lib/llvm/bin/clang++.cfg
AMD LLVM version 17.0.0git
  Optimized build.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: znver3

  Registered Targets:
    amdgcn - AMD GCN GPUs
    r600   - AMD GPUs HD2XXX-HD6XXX
    x86    - 32-bit X86: Pentium-Pro and above
    x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags : 
hip-clang-ldflags  :

It looks like there's no "hip" directory at all. Can I change it with some env. variable ? As the computer center will take weeks to repair it.

The 5.5.1 has hip/bin directory and it looks like:

The 5.7.1 has hip/bin directory and gives the same error as 6.0.0, and it looks like:

The 6.0.0 has no hip/bin. It has bin directly and it looks like:

But both 5.7.1 and 5.5.1 have this bin directory aside of the hip/bin

ex-rzr · 2024-01-23T10:29:31Z

Something is definitely not right with your ROCm. Perhaps, it's worth to uninstall and install it from scratch (of course if you have admin privileges)

DanielWicz · 2024-01-23T12:06:24Z

Something is definitely not right with your ROCm. Perhaps, it's worth to uninstall and install it from scratch (of course if you have admin privileges)

I opened a ticket to repair it. It's weird, because the company states "we did regression testing on the machine with new ROCm, and didn't find any issues".

DanielWicz · 2024-01-26T08:28:38Z

Managed to solve the problem with administration, but the graphic cards drivers were updated. This resulted in inability to run ROCm that is older than 5.7.1. So the oldest that I can run is 5.7.1 at the moment.

In the comments where I mentioned Segfault - it is for systems that are either missing on the graph or market with 0 performance. Usually it is for the amber benchmark and more than 1 gpu

Here is the comparison ROCm 5.7.1 (OpenMM 8.0, with OpenMM HIP for 8.0) vs ROCm 6.0.0 (OpenMM 8.1.1 with OpenMM HIP 8.1.1). Graphic card is MI250, single GPU (so only 1/2 of MI250 in fact).

MI250 1GPU (1/2 of the card)

I will update this post with different number of graphic cards as the time goes on.

2 GPUs (full MI250)

### Comment: With 2 GPUs, OPENMM 8.0.0/5.7.1 is segfaulting with "Memory access fault by GPU node-4 (Agent handle: 0x560656fbf4d0) on address 0x145113be5000. Reason: Unknown." whole amber suite

3 GPUs (3/4 MI250)

### Comment: With 4 GPUs, OPENMM 8.0.0/5.7.1 is segfaulting with "Memory access fault by GPU node-4 (Agent handle: 0x560656fbf4d0) on address 0x145113be5000. Reason: Unknown." whole amber suite

** Comment: With 4 GPUs, OPENMM 8.1.1/6.0.0 is segfaulting with "Memory access fault by GPU node-4 (Agent handle: 0x562454718330) on address 0x562472e00000. Reason: Unknown." **

4 GPUs (2 MI250)

### Comment: With 4 GPUs, OPENMM 8.0.0/5.7.1 is segfaulting with "Memory access fault by GPU node-4 (Agent handle: 0x560656fbf4d0) on address 0x145113be5000. Reason: Unknown." whole amber suite

8 GPUs (4 full MI250):

### Comment: With 8 GPUs, OPENMM 8.0.0/5.7.1 is segfaulting with "Memory access fault by GPU node-4 (Agent handle: 0x560656fbf4d0) on address 0x145113be5000. Reason: Unknown." whole amber suite

** Comment: With 8 GPUs, OPENMM 8.1.1/6.0.0 is segfaulting with "Memory access fault by GPU node-4 (Agent handle: 0x562454718330) on address 0x562472e00000. Reason: Unknown." ** - Generally stm virus simulation

ex-rzr · 2024-01-27T05:45:06Z

Thanks you! I need to analyze the results especially these crashes in multi-GPU simulations.

Can you run tests?

cd $CONDA_PREFIX/share/openmm/tests/
./test_openmm_hip.sh

There is a chance that some of the tests may freeze occasionally. I suspect a bug in hipModuleLoad, I'm going to create a reproducer to prove/disprove it.

The first chart, gbsa. Is it reproducible? I didn't see anything strange on MI100 and V620.

egallicc · 2024-01-27T19:17:48Z

Hello. I tested the openmm-hip:develop_stream branch built against Rocm 5.6 and OpenMM 8.1.1 with ATMForce() on a system with two cards (RX 6750 XT + RX 6650 XT). It works great with good performance in production with the AToM-OpenMM 8.1.1 middleware.

I can't comment on the multi-GPu failures. AToM-OpenMM uses multiple GPUs but in a distributed asynchronous mode rather than a parallel mode. Thanks.

ex-rzr · 2024-02-01T04:11:34Z

@egallicc Thank you!

@DanielWicz
I benchmarked on multiple GPUs: on one server with 2 MI100 and on another with 2 V620. GPUs on both servers are connected via PCIE, not XGMI. All benchmarks were stable. I even tried on to run with --device=0,1,0,1, i.e. use the devices twice, this was also stable.

I wonder if something may be wrong with your XGMI configuration. Could you post here what rocm-smi --showtopo reports?
/opt/rocm/bin/rocm-bandwidth-test also can be useful (you will likely need to install the rocm-bandwidth-test package though).

ex-rzr · 2024-02-01T04:20:10Z

@jdmaia

Hi!
Can you run tests and benchmarks on this branch? OpenMM's tag is 8.1.1, any GPUs, but MI200 and MI300 are more important because I don't have access to them.

I'm especially interested in multi-GPU stability:

python3 benchmark.py --verbose --style=table --platform=HIP --precision=single --device=0,1

(and --device=0,1,2,3 if you have more GPUs).

Thanks!

DanielWicz · 2024-02-01T14:03:51Z

@egallicc Thank you!

@DanielWicz I benchmarked on multiple GPUs: on one server with 2 MI100 and on another with 2 V620. GPUs on both servers are connected via PCIE, not XGMI. All benchmarks were stable. I even tried on to run with --device=0,1,0,1, i.e. use the devices twice, this was also stable.

I wonder if something may be wrong with your XGMI configuration. Could you post here what rocm-smi --showtopo reports? /opt/rocm/bin/rocm-bandwidth-test also can be useful (you will likely need to install the rocm-bandwidth-test package though).

I will retest with regard to the gba. Generally there were some updates on our nodes and it gets "1213.11 ns/day". But still I get these "segmentation faults" for multiple gpus.

Here is the output of rocm-smi

============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            15           15           30           30           30           15           30           
GPU1   15           0            30           15           30           15           30           45           
GPU2   15           30           0            15           15           30           30           30           
GPU3   30           15           15           0            30           45           30           15           
GPU4   30           30           15           30           0            15           15           30           
GPU5   30           15           30           45           15           0            30           15           
GPU6   15           30           30           30           15           30           0            15           
GPU7   30           45           30           15           30           15           15           0            

================================= Hops between two GPUs ==================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            1            1            1            1            1            1            1            
GPU1   1            0            1            1            1            1            1            1            
GPU2   1            1            0            1            1            1            1            1            
GPU3   1            1            1            0            1            1            1            1            
GPU4   1            1            1            1            0            1            1            1            
GPU5   1            1            1            1            1            0            1            1            
GPU6   1            1            1            1            1            1            0            1            
GPU7   1            1            1            1            1            1            1            0         
=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         
GPU1   XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         
GPU2   XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         
GPU3   XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         
GPU4   XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         
GPU5   XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         
GPU6   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         
GPU7   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            

======================================= Numa Nodes =======================================
GPU[0]          : (Topology) Numa Node: 3
GPU[0]          : (Topology) Numa Affinity: 3
GPU[1]          : (Topology) Numa Node: 3
GPU[1]          : (Topology) Numa Affinity: 3
GPU[2]          : (Topology) Numa Node: 1
GPU[2]          : (Topology) Numa Affinity: 1
GPU[3]          : (Topology) Numa Node: 1
GPU[3]          : (Topology) Numa Affinity: 1
GPU[4]          : (Topology) Numa Node: 0
GPU[4]          : (Topology) Numa Affinity: 0
GPU[5]          : (Topology) Numa Node: 0
GPU[5]          : (Topology) Numa Affinity: 0
GPU[6]          : (Topology) Numa Node: 2
GPU[6]          : (Topology) Numa Affinity: 2
GPU[7]          : (Topology) Numa Node: 2
GPU[7]          : (Topology) Numa Affinity: 2
================================== End of ROCm SMI Log ===================================
q

DanielWicz · 2024-02-08T11:13:04Z

@jdmaia

Hi! Can you run tests and benchmarks on this branch? OpenMM's tag is 8.1.1, any GPUs, but MI200 and MI300 are more important because I don't have access to them.

I'm especially interested in multi-GPU stability:
python3 benchmark.py --verbose --style=table --platform=HIP --precision=single --device=0,1
(and --device=0,1,2,3 if you have more GPUs).

Thanks!

Why do you recommend precision simple ? Usually OpenMM uses mixed (simple is not recommended in most cases).

ex-rzr · 2024-02-13T03:45:15Z

Why do you recommend precision simple ? Usually OpenMM uses mixed (simple is not recommended in most cases).

No reason, except that single is the default precision of the benchmark.py. But I assume, you get crashes for all 3 precisions in multi-GPU, right?

DanielWicz · 2024-02-16T12:20:22Z

Why do you recommend precision simple ? Usually OpenMM uses mixed (simple is not recommended in most cases).

No reason, except that single is the default precision of the benchmark.py. But I assume, you get crashes for all 3 precisions in multi-GPU, right?

I tried on single. Still I get
"Memory access fault by GPU node-4 (Agent handle: 0x55b40d3733a0) on address 0x14dd267e2000. Reason: Unknown."
for multiple GPUs. Both for the older version and the newer version. Where on the newer version is less often. Usually those who crash are the Amber systems. So it somehow related to the memory allocation.

Maybe should I try some env variable related to memory allocation or PME ?

Those some of the people reported similar problem:
pytorch/pytorch#95810

Can I graph somehow vmem over time ? I have strong suspicion that the VRAM is not released between each run inside Python file.

Edit:
When running in the same bash script benchmarks in a different order, where order is 1, 2, 3, 4, 5, 6, 7, 8, where number signifies NUMBER OF GPUs used; I got different number of segmentation faults. For example:
Running in order 1, 2, 3, 4, 5, 6, 7, 8 gives me 2 segmentation faults.
Running in order 8, 1, 2, 3, 4, 5, 6, 7 gives me 4 segmentation faults.

ex-rzr · 2024-02-20T13:29:54Z

I asked my colleague with access to MI200 to run multi-GPU benchmarks: 2 GPUs and 4 GPUs ran without issues. The system has ROCm 5.7. So I think that something is configured incorrectly on your system.

Did you try to run rocm-bandwidth-test? I expect that it will fail if XGMI does not work correctly.

Regarding your question about vmem. You can use rocm-smi. To get the info continuously use watch -n 1 rocm-smi.

Even if there were VRAM leaks they unlikely caused the crash considering that most of benchmarks are very small and MI200 has a lot of memory.

You can also try to run with AMD_SERIALIZE_KERNEL=3 AMD_SERIALIZE_COPY=3 AMD_LOG_LEVEL=4 but expect a flood of log messages unless the crash happens in the very beginning of the simulation. The last lines may help understand what happens.

ex-rzr · 2024-03-16T04:02:36Z

@jdmaia Did you have a chance to run tests and benchmarks? I think the PR should be merged.

egallicc and others added 18 commits August 5, 2023 23:10

Add ATMForce

a108f3c

Fix incomplete PTX file written to disk

5e07154

Bytes written is sometimes less than original ptx.size() and hipModuleLoad throws an a string too long exception. Setting binary output writes all the bytes.

Port skip neighbor list for very small systems

c426a49

- Port optimization from openmm/openmm#4070 to HIP for compatibility with upcoming OpenMM 8.1 release - It may be possible to revert some of the changes in amd@08c967d, which was optimizing for small systems as well

Merge branch 'fix-hiprtc-cache' of github.com:bdenhollander/openmm-hi…

25b6aa1

…p into develop_stream

Merge branch 'ATMForce' of github.com:Gallicchio-Lab/openmm-hip into …

74de84f

…develop_stream

Add missing tests for ATMForce

65e105f

Related to amd#7

Merge branch 'skip-neighbor-list' of github.com:bdenhollander/openmm-…

4e43e44

…hip into develop_stream

Fix exclusion tiles sorting on AMD CDNA (64 threads per wave)

e8fa1d8

The nonbonded kernel uses USE_NEIGHBOR_LIST (useNeighborList) so host code also must check it instead of useCutoff. See also openmm/openmm#3462

Port "CustomCPPForceImpl for writing forces in C++"

52ee803

openmm/openmm@9a0db72 openmm/openmm#4231

Port "Store bounding box sizes in half precision"

9d9394a

openmm/openmm@2ae50f9

Port "Use large blocks to optimize building the

d9ecea5

neighbor list" openmm/openmm@3955033

Port "Improved sorting of blocks when building neighbor list" and "Fi…

e0d2324

…xed bug in large blocks optimization with triclinic boxes" openmm/openmm@796ffaa openmm/openmm@4c10732

Port "Optimize sorting of non-uniformly distributed data"

a2032fc

openmm/openmm@71d9bb1

Merge pull request #1 from StreamHPC/large-blocks

4618271

Large blocks

Support ROCm 6.0: do not use deprecated gcnArch

4164737

Support ROCm 6.0: improve module compilation and loading

25afefb

* hipModuleLoad sometimes fails to load modules for unknown reasons, use manual loading from the output file and hipModuleLoadDataEx; * use amdclang++ directly instead of hipcc; * use --offload-device-only instead of --genco;

Update README for 8.1.1

6b1c5a9

Merge pull request #2 from StreamHPC/prepare-8-1-1

3fa989d

Prepare 8.1.1

This was referenced Jan 21, 2024

Port skip neighbor list for very small systems #9

Open

Fix incomplete PTX file written to disk #8

Open

Add ATMForce #7

Open

Support for OpenMM 8.1 #12

Open

ex-rzr mentioned this pull request Feb 13, 2024

Not compatible with latest 6.0+ ROCM HIP #15

Open

ex-rzr marked this pull request as ready for review March 16, 2024 04:00

ex-rzr mentioned this pull request Apr 20, 2024

Make HIP a standard platform openmm/openmm#4199

Closed

Make hipFFT dependency optional

388f411

ex-rzr mentioned this pull request Apr 20, 2024

Remove requirement for hipFFT #16

Open

ex-rzr mentioned this pull request Jun 7, 2024

HIP Kernel Compiler Issue #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8.1.1 #14

8.1.1 #14

ex-rzr commented Jan 21, 2024 •

edited

Loading

DanielWicz commented Jan 22, 2024 •

edited

Loading

ex-rzr commented Jan 22, 2024 •

edited

Loading

bdenhollander commented Jan 22, 2024

DanielWicz commented Jan 23, 2024 •

edited

Loading

ex-rzr commented Jan 23, 2024

DanielWicz commented Jan 23, 2024 •

edited

Loading

ex-rzr commented Jan 23, 2024

DanielWicz commented Jan 23, 2024

DanielWicz commented Jan 26, 2024 •

edited

Loading

ex-rzr commented Jan 27, 2024

egallicc commented Jan 27, 2024

ex-rzr commented Feb 1, 2024

ex-rzr commented Feb 1, 2024

DanielWicz commented Feb 1, 2024

DanielWicz commented Feb 8, 2024 •

edited

Loading

ex-rzr commented Feb 13, 2024

DanielWicz commented Feb 16, 2024 •

edited

Loading

ex-rzr commented Feb 20, 2024

ex-rzr commented Mar 16, 2024

8.1.1 #14

Are you sure you want to change the base?

8.1.1 #14

Conversation

ex-rzr commented Jan 21, 2024 • edited Loading

DanielWicz commented Jan 22, 2024 • edited Loading

ex-rzr commented Jan 22, 2024 • edited Loading

bdenhollander commented Jan 22, 2024

DanielWicz commented Jan 23, 2024 • edited Loading

ex-rzr commented Jan 23, 2024

DanielWicz commented Jan 23, 2024 • edited Loading

ex-rzr commented Jan 23, 2024

DanielWicz commented Jan 23, 2024

DanielWicz commented Jan 26, 2024 • edited Loading

ex-rzr commented Jan 27, 2024

egallicc commented Jan 27, 2024

ex-rzr commented Feb 1, 2024

ex-rzr commented Feb 1, 2024

DanielWicz commented Feb 1, 2024

DanielWicz commented Feb 8, 2024 • edited Loading

ex-rzr commented Feb 13, 2024

DanielWicz commented Feb 16, 2024 • edited Loading

ex-rzr commented Feb 20, 2024

ex-rzr commented Mar 16, 2024

ex-rzr commented Jan 21, 2024 •

edited

Loading

DanielWicz commented Jan 22, 2024 •

edited

Loading

ex-rzr commented Jan 22, 2024 •

edited

Loading

DanielWicz commented Jan 23, 2024 •

edited

Loading

DanielWicz commented Jan 23, 2024 •

edited

Loading

DanielWicz commented Jan 26, 2024 •

edited

Loading

DanielWicz commented Feb 8, 2024 •

edited

Loading

DanielWicz commented Feb 16, 2024 •

edited

Loading