-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port large blocks enhancement #11
Comments
I did some experiments with porting these changes some time ago (including also half precision for block sizes) but I couldn't see any measurable difference in performance on CDNA (MI100 and MI210) for standard benchmarks. Now I have access to RNDA, let me check how it performs there. I'll also push a new branch which you can try on your GPU. Btw, do you have a script for those synthetic simulations (water boxes)? I couldn't find it in my files though I'm sure I tried it before. |
There is definitely a trade-off as mentioned in the comment in the source code. The cutoffs of 90K (CUDA) and 100K (OpenCL) atoms seemed correct but later I noticed that 250K might be better for a wider range of simulations. Here's the water box script: from openmm import *
from openmm.app import *
from openmm.unit import *
import sys
width = float(sys.argv[1])
steps = int(sys.argv[2])
ff = ForceField('tip3pfb.xml')
modeller = Modeller(Topology(), [])
modeller.addSolvent(ff, boxSize=Vec3(width, width, width)*nanometers)
system = ff.createSystem(modeller.topology, nonbondedMethod=PME, nonbondedCutoff=1.0)
print(f'{width} {system.getNumParticles()}' )
integrator = LangevinMiddleIntegrator(300, 1.0, 0.004)
platform = Platform.getPlatformByName('HIP')
simulation = Simulation(modeller.topology, system, integrator, platform)
simulation.reporters.append(StateDataReporter(sys.stdout, int(steps/5), step=True, elapsedTime=True, speed=True))
simulation.context.setPositions(modeller.positions)
simulation.minimizeEnergy(tolerance=100*kilojoules_per_mole/nanometer)
simulation.step(steps) I call it from a script like this: python run-water-box.py 3 5000
python run-water-box.py 4 5000
python run-water-box.py 5 5000 |
I have pushed the current stage of porting: https://github.com/StreamHPC/openmm-hip/commits/large-blocks I collected benchmarking and profiling results for all standard benchmarks and the water box test. And I don't understand these results yet:
I understand that the water box is a very "synthetic" system but I didn't expect such a big difference in behavior. If possible please run it on your 6600. |
Gains on large systems are definitely less than expected compared to OpenCL on my RX 6600 on Windows 10. I suspect this is partially due to the flat kernel. There is less of a long tail of declining utilization with HIP than OpenCL. It takes a lot more atoms to make the extra computation of the large blocks worthwhile. My results are inline with yours for 250K+ atoms except I didn't see a decline in amber20-stmv.
Water boxes
|
For running FAH projects, I have the following script: from simtk import openmm, unit
import time
import
nsteps = 5000
runs = ['12271']
runs.sort()
platform = openmm.Platform.getPlatformByName('HIP')
#platform.setPropertyDefaultValue('DisablePmeStream', '1')
print(platform.getOpenMMVersion())
platform.setPropertyDefaultValue('Precision', 'mixed')
def load(run, filename):
with open(os.path.join(run, filename), 'rt') as infile:
return openmm.XmlSerializer.deserialize(infile.read())
for run in runs:
folder = run + "/01/"
system = load(folder, 'system.xml')
state = load(folder, 'state.xml')
integrator = load(folder, 'integrator.xml')
context = openmm.Context(system, integrator, platform)
context.setState(state)
integrator.step(10)
state = context.getState(getEnergy=True)
# print("Start simulation")
initial_time = time.time()
integrator.step(nsteps)
state = context.getState(getEnergy=True)
elapsed_time = (time.time() - initial_time) * unit.seconds
time_per_step = elapsed_time / nsteps
ns_per_day = (nsteps * integrator.getStepSize()) / elapsed_time / (unit.nanoseconds/unit.day)
print(f'{run} {system.getNumParticles()} particles : {ns_per_day:.4f} ns/day') I collected various projects over the years by doing the following:
|
I've pushed a final (so far) commit. The limit is not changed (90000), I'm not sure how to tune it, considering how much it depends on parameters of the simulation, the GPU (its number of compute units) etc. There is one idea I want to check: large blocks cannot be used with blocks sorted by sizes (which is essential without large blocks for good performance). But what if we sort large blocks by size? Blocks within each large blocks stay unsorted. |
This version is throwing NaNs more often on my RX 6600. I would wait for an outcome from openmm/openmm#4334 to see if there's a way to improve sorting in combination with large blocks. |
Thanks! I'll debug it, because it is definitely not right. By "the previous version" you mean the previous version of the large-blocks branch or of develop_stream branch? -- |
"Previous version" is the first implementation of large blocks. The first time I benchmarked it there was a NaN on one of the water boxes. A few months ago I tested the develop_stream branch and overall it just seemed less stable. I don't recall exactly what the issues were. This is on Windows so only one version of ROCm available, 5.5.1, which is based on 5.5.0. I'm hoping ROCm 6.0 for Windows is included in AMD's December 6 event.
Singe precision test failures are sporadic. TestHipFFTImplHipFFTSingle and TestHipVariableLangevinIntegratorSingle fail consistently in Release builds.
TestHipVariableLangevinIntegratorSingle passed a few times in Debug mode. TestHipFFTImplHipFFTSingle fails on the first
|
I ran standard benchmarks for 600 seconds and water box for 50 000 steps: no NaNs. Can it be that the failures are caused by the runtime/driver? I hope the new version of ROCm for Windows will be released soon so we can see if it's related to these stability issues.
This is interesting. But kernels are always compiled -O3, hipcc adds it be default unless a different optimization level is passed.
(Meanwhile I'm experimenting with openmm/openmm#4343, but it seems I also need to port one of changes in CudaSort, new keys are even less uniform than old block sizes, and this makes the sorting kernel a bottleneck) |
I've pushed to https://github.com/StreamHPC/openmm-hip/commits/large-blocks These changes provide significant improvement for large systems and none/negligible regressions for small systems (tested on gfx908 and gfx1030, ROCm 5.7). For example, stmv is 39 ns/day on MI100 and 38 ns/day on V620. Regarding stability issues: benchmarks and tests work without NaNs.
But Single and Mixed pass.
I.e. |
OpenMM 8.1 has a performance improvement for larger systems: openmm/openmm#4147.
I attempted to port it myself but was unsuccessful. findInteractingBlocks.hip has diverged a lot from findInteractingBlocks.cu and it is also modified to execute as a flat kernel. Is there any chance this enhancement could be ported to HIP?
The text was updated successfully, but these errors were encountered: