forked from QMCPACK/qmcpack
-
Notifications
You must be signed in to change notification settings - Fork 2
Performance analysis on small problem sizes part 2
Ye Luo edited this page Jul 14, 2020
·
5 revisions
Timing of DMC unified driver with dmc-a32-e384-cpu-Clang-batch-w1344 running on Summit, 7 batches and 192 walkers per batch. On a Summit node, we put 6 MPI ranks each with 7 threads.
DMCBatched::RunSteps 139.6967 1.3720 25 5.587869120
DMCBatched::Hamiltonian 76.8406 0.0103 25 3.073624086
Hamiltonian::ElecElec 4.8134 4.8134 25 0.192537746
Hamiltonian::IonIon 0.0009 0.0009 25 0.000035486
Hamiltonian::Kinetic 0.0048 0.0048 25 0.000191803
Hamiltonian::LocalECP 1.2211 1.2211 25 0.048843880
Hamiltonian::NonLocalECP 70.7901 1.8457 25 2.831604805
ParticleSet::update 31.6759 31.6759 5308 0.005967586
WaveFunction::J1OrbitalSoA_NLratio 3.3165 3.3165 5308 0.000624810
WaveFunction::J2OrbitalSoA_NLratio 26.5605 26.5605 5308 0.005003858
WaveFunction::SlaterDet_NLratio 7.3915 0.1761 5308 0.001392525
DiracDeterminantBase::ratio 0.3734 0.3734 5308 0.000070344
DiracDeterminantBase::spoval 6.8421 0.4537 5308 0.001289007
SplineC2ROMP::offload 6.3883 6.3883 5308 0.001203532
DMCBatched::MovePbyP 59.5932 1.7929 25 2.383727245
ParticleSet::acceptMove 2.3101 2.3101 1840858 0.000001255
ParticleSet::computeNewPosDT 13.6748 13.6748 9600 0.001424464
ParticleSet::donePbyP 3.9055 3.9055 4800 0.000813638
WaveFunction::J1OrbitalSoA_VGL 3.4353 3.4353 19200 0.000178924
WaveFunction::J1OrbitalSoA_accept 0.3011 0.3011 9625 0.000031284
WaveFunction::J2OrbitalSoA_VGL 9.7676 9.7676 19200 0.000508727
WaveFunction::J2OrbitalSoA_accept 12.5381 12.5381 9625 0.001302659
WaveFunction::SlaterDet_VGL 5.0797 0.2779 19200 0.000264568
DiracDeterminantBase::ratio 1.2942 1.2942 9600 0.000134814
DiracDeterminantBase::spovgl 3.5076 0.2003 9600 0.000365374
SplineC2ROMP::offload 3.3073 3.3073 9600 0.000344506
WaveFunction::SlaterDet_accept 6.7881 0.2510 9625 0.000705258
DiracDeterminantBase::update 1.1774 1.1774 9650 0.000122007
DiracDeterminantBatched::D2H 5.3597 5.3597 50 0.107194562
ParticleSet::update and WaveFunction::J2OrbitalSoA are taking the 60% of the time, 84.4 seconds of totally 139.7 seconds
Timing of DMC unified driver with dmc-a32-e384-cpu-Clang-batch-w1344 without e-e distance table running on Summit, 7 batches and 192 walkers per batch. On a Summit node, we put 6 MPI ranks each with 7 threads.
DMCBatched::RunSteps 84.0625 1.1251 25 3.362500429
DMCBatched::Hamiltonian 50.4087 0.0099 25 2.016346045
Hamiltonian::IonIon 0.0008 0.0008 25 0.000032129
Hamiltonian::Kinetic 0.0053 0.0053 25 0.000211229
Hamiltonian::LocalECP 1.2310 1.2310 25 0.049241915
Hamiltonian::NonLocalECP 49.1617 1.7872 25 1.966466036
ParticleSet::update 3.8670 3.8670 6532 0.000592009
WaveFunction::J1OrbitalSoA_NLratio 2.9536 2.9536 6532 0.000452171
WaveFunction::SlaterDet_NLratio 40.5539 0.1461 6532 0.006208494
DiracDeterminantBase::ratio 0.3096 0.3096 6532 0.000047397
DiracDeterminantBase::spoval 40.0982 0.6757 6532 0.006138737
SplineC2ROMP::offload 39.4225 39.4225 6532 0.006035293
DMCBatched::MovePbyP 29.7778 1.1524 25 1.191113958
ParticleSet::acceptMove 0.3582 0.3582 1839049 0.000000195
ParticleSet::computeNewPosDT 1.1062 1.1062 9600 0.000115233
ParticleSet::donePbyP 3.9309 3.9309 4800 0.000818944
WaveFunction::J1OrbitalSoA_VGL 1.6935 1.6935 19200 0.000088204
WaveFunction::J1OrbitalSoA_accept 0.1444 0.1444 9625 0.000014999
WaveFunction::SlaterDet_VGL 14.9466 0.2291 19200 0.000778468
DiracDeterminantBase::ratio 6.5763 6.5763 9600 0.000685036
DiracDeterminantBase::spovgl 8.1412 0.1294 9600 0.000848041
SplineC2ROMP::offload 8.0118 8.0118 9600 0.000834567
WaveFunction::SlaterDet_accept 6.4456 0.1648 9625 0.000669676
DiracDeterminantBase::update 0.9459 0.9459 9650 0.000098017
DiracDeterminantBatched::D2H 5.3350 5.3350 50 0.106699076