-
Notifications
You must be signed in to change notification settings - Fork 2
Performance analysis on small problem sizes part 1
1) Timing of DMC legacy CUDA driver with dmc-a32-e384-gpu-w56 running on Summit, 56 walkers per GPU. On a Summit node, we put 6 MPI ranks each with 7 threads.
DMCcuda 6.2804 0.3358 1 6.280354462
DMCcuda::Branch 0.1751 0.1751 25 0.007002299
DMCcuda::Drift_Diffuse 4.9510 2.1270 25 0.198041360
WaveFunction::SlaterDet_VGL 1.4216 1.4216 28800 0.000049361
WaveFunction::SlaterDet_accept 0.2762 0.2762 9600 0.000028766
WaveFunction::jastrow_V 0.5673 0.5673 38400 0.000014773
WaveFunction::jastrow_VGL 0.0009 0.0009 9600 0.000000097
WaveFunction::jastrow_derivs 0.5580 0.5580 28800 0.000019376
DMCcuda::Hamiltonian 0.6136 0.0001 25 0.024542427
Hamiltonian::ElecElec 0.1417 0.1417 25 0.005666318
Hamiltonian::IonIon 0.0000 0.0000 25 0.000000818
Hamiltonian::Kinetic 0.0015 0.0015 25 0.000059324
Hamiltonian::LocalECP 0.0757 0.0757 25 0.003028199
Hamiltonian::NonLocalECP 0.3946 0.0941 25 0.015784679
WaveFunction::SlaterDet_NLratio 0.2534 0.2534 50 0.005068591
WaveFunction::jastrow_VGL 0.0157 0.0157 50 0.000313796
WaveFunction::jastrow_accept 0.0314 0.0314 50 0.000627142
DMCcuda::resize 0.0029 0.0029 25 0.000117462
WaveFunction::SlaterDet_VGL 0.0207 0.0207 50 0.000414914
WaveFunction::SlaterDet_accept 0.0054 0.0054 25 0.000214380
WaveFunction::SlaterDet_recompute 0.0316 0.0316 1 0.031569478
WaveFunction::jastrow_NLratio 0.0000 0.0000 1 0.000000209
WaveFunction::jastrow_V 0.1297 0.1297 75 0.001729709
WaveFunction::jastrow_VGL 0.0000 0.0000 25 0.000000124
WaveFunction::jastrow_accept 0.0000 0.0000 1 0.000000311
WaveFunction::jastrow_derivs 0.0146 0.0146 50 0.000291362
2) Timing of DMC unified driver with dmc-a32-e384-cpu-XL-batch-w56 running on Summit, 7 batches and 8 walkers per batch. On a Summit node, we put 6 MPI ranks each with 7 threads.
DMCBatched::RunSteps 18.7519 0.2868 25 0.750077640
DMCBatched::Hamiltonian 10.8243 0.0002 25 0.432971240
Hamiltonian::ElecElec 0.2055 0.2055 25 0.008218260
Hamiltonian::IonIon 0.0001 0.0001 25 0.000002132
Hamiltonian::Kinetic 0.0001 0.0001 25 0.000005911
Hamiltonian::LocalECP 0.0400 0.0400 25 0.001600872
Hamiltonian::NonLocalECP 10.5784 0.0769 25 0.423137931
ParticleSet::update 1.9177 1.9177 34999 0.000054793
WaveFunction::J1OrbitalSoA_NLratio 0.0716 0.0716 34999 0.000002045
WaveFunction::J2OrbitalSoA_NLratio 1.0805 1.0805 34999 0.000030873
WaveFunction::SlaterDet_NLratio 7.4317 0.0128 34999 0.000212341
DiracDeterminantBase::spoval 7.4189 7.4189 34999 0.000211976
DMCBatched::MovePbyP 7.4942 0.0734 25 0.299766485
ParticleSet::computeNewPosDT 0.3596 0.3596 9600 0.000037462
ParticleSet::donePbyP 2.3471 2.3471 200 0.011735551
ParticleSet::setActive 0.3623 0.3623 9600 0.000037744
WaveFunction::J1OrbitalSoA_VGL 0.0571 0.0571 19200 0.000002976
WaveFunction::J1OrbitalSoA_accept 0.0113 0.0113 9800 0.000001151
WaveFunction::J2OrbitalSoA_VGL 0.5086 0.5086 19200 0.000026490
WaveFunction::J2OrbitalSoA_accept 0.6719 0.6719 9800 0.000068564
WaveFunction::SlaterDet_VGL 1.9953 0.0434 19200 0.000103922
DiracDeterminantBase::ratio 0.3203 0.3203 153600 0.000002085
DiracDeterminantBase::spovgl 1.6316 1.6316 9600 0.000169960
WaveFunction::SlaterDet_accept 1.1074 0.0326 9800 0.000113000
DiracDeterminantBase::update 1.0748 1.0748 77105 0.000013939
3) Timing of DMC cpu driver with dmc-a32-e384-cpu-intel-ref-w56 running on dual socket Xeon 8180, 56 walkers per MPI rank. On this node, we put 8 MPI ranks each with 7 threads.
DMC 7.7408 0.0686 1 7.740798950
DMCUpdatePbyP::Hamiltonian 4.0683 0.0015 200 0.020341556
Hamiltonian::ElecElec 0.2101 0.2101 200 0.001050303
Hamiltonian::IonIon 0.0001 0.0001 200 0.000000632
Hamiltonian::Kinetic 0.0005 0.0005 200 0.000002314
Hamiltonian::LocalECP 0.0503 0.0503 200 0.000251600
Hamiltonian::NonLocalECP 3.8059 0.1174 200 0.019029355
ParticleSet::update 0.4503 0.4503 35335 0.000012745
WaveFunction::SlaterDet_NLratio 2.9691 0.0174 35335 0.000084028
DiracDeterminantBase::spoval 2.9517 2.9517 35335 0.000083536
WaveFunction::jastrow_NLratio 0.2690 0.2690 70670 0.000003807
DMCUpdatePbyP::movePbyP 3.2414 0.1897 200 0.016206890
ParticleSet::computeNewPosDT 0.0840 0.0840 76800 0.000001094
ParticleSet::donePbyP 0.0740 0.0740 200 0.000370136
ParticleSet::setActive 0.1053 0.1053 76800 0.000001371
WaveFunction::SlaterDet_VGL 1.7837 0.0651 153600 0.000011612
DiracDeterminantBase::ratio 0.1867 0.1867 153600 0.000001216
DiracDeterminantBase::spovgl 1.5319 1.5319 76800 0.000019946
WaveFunction::SlaterDet_accept 0.3866 0.0361 76896 0.000005028
DiracDeterminantBase::update 0.3505 0.3505 77096 0.000004546
WaveFunction::jastrow_VGL 0.3361 0.3361 307200 0.000001094
WaveFunction::jastrow_accept 0.2819 0.2819 153792 0.000001833
Hamiltonian which spent most of its time in NonLocalECP spoval. Currently there is only quadrature point virtual move batching but no walker batching. This is the reason for 34999 call counts, the time is basically spent in GPU runtime/driver overhead. Fixes: Need to implementing NLPP with batching namely propagating the batched interfaces down to NonLocalECPComponent and then connect the batched interfaces of TWF. The call counts should be by 1 to 2 order of magnitude. The time should drop from 7.4189 to below 1 sec.
ParticleSet::donePbyP the structure factor calculation is slow. Probably this has already been solved by introducing MASSV. We need to make it working with XL compiler. We can probably save 2 seconds.
In this small problem size, distance table computation can probably reduced by 2x even without offload. So we may gain ~1 seconds.
After solving 1-3, we should be able to run at least 2x faster than the current code and further higher throughput by increasing walker counts. In this kind of small problems, I expect to have >50 walkers per GPU.
When solving 512 atom there is only 1 walker per batch. To add more walker batching, NLPP(Hamiltonian) evaluation needs to be taken out from the movePbyP crowd scope and does walker batching at the population level. This requires quite a few steps.
- Split the updateBuffer to evaluateGL and copyToBuffer. updateBuffer handling some last bit computation and buffer handling. The role must be separated.
- Move copyFromBuffer and copyToBuffer outside the crowd scope but population scope. At the meantime, move H.evaluate to the population scope. This should give sufficient computation when doing walker batching in NLPP.
At PPPG boundary condition, the distance table computation is quite heavy in 128 and 256 atom cell.
2020-01-07 update: NLPP walker batching completed and MASS connected. Both bottleneck 1 and 2 are resolved. This is the same run as above 2) on Summit.
DMCBatched::RunSteps 9.8830 0.2178 25 0.395320220
DMCBatched::Hamiltonian 3.9805 0.0002 25 0.159221144
Hamiltonian::ElecElec 0.2110 0.2110 25 0.008440138
Hamiltonian::IonIon 0.0000 0.0000 25 0.000001949
Hamiltonian::Kinetic 0.0002 0.0002 25 0.000006393
Hamiltonian::LocalECP 0.0453 0.0453 25 0.001812433
Hamiltonian::NonLocalECP 3.7239 0.0545 25 0.148954065
ParticleSet::update 1.8335 1.8335 33443 0.000054826
WaveFunction::J1OrbitalSoA_NLratio 0.0695 0.0695 4673 0.000014876
WaveFunction::J2OrbitalSoA_NLratio 1.0767 1.0767 4673 0.000230406
WaveFunction::SlaterDet_NLratio 0.6896 0.0071 4673 0.000147569
DiracDeterminantBase::ratio 0.0584 0.0584 4673 0.000012499
DiracDeterminantBase::spoval 0.6241 0.6241 4673 0.000133552
DMCBatched::MovePbyP 5.5434 0.0590 25 0.221735064
ParticleSet::acceptMove 0.0292 0.0292 76710 0.000000381
ParticleSet::computeNewPosDT 0.7353 0.7353 9600 0.000076589
ParticleSet::donePbyP 0.2715 0.2715 200 0.001357680
WaveFunction::J1OrbitalSoA_VGL 0.0577 0.0577 19200 0.000003005
WaveFunction::J1OrbitalSoA_accept 0.0114 0.0114 9800 0.000001159
WaveFunction::J2OrbitalSoA_VGL 0.5101 0.5101 19200 0.000026568
WaveFunction::J2OrbitalSoA_accept 0.6705 0.6705 9800 0.000068423
WaveFunction::SlaterDet_VGL 2.1688 0.0481 19200 0.000112956
DiracDeterminantBase::ratio 0.3256 0.3256 153600 0.000002120
DiracDeterminantBase::spovgl 1.7950 1.7950 9600 0.000186982
WaveFunction::SlaterDet_accept 1.0299 0.0309 9800 0.000105090
DiracDeterminantBase::update 0.9989 0.9989 77110 0.000012955
NonLocalECP spoval only needs 0.6241 seconds and ParticleSet::donePbyP goes down to 0.27 seconds.