Performance analysis on small problem sizes part 1

1) Timing of DMC legacy CUDA driver with dmc-a32-e384-gpu-w56 running on Summit, 56 walkers per GPU. On a Summit node, we put 6 MPI ranks each with 7 threads.

  DMCcuda                                   6.2804     0.3358              1       6.280354462
    DMCcuda::Branch                         0.1751     0.1751             25       0.007002299
    DMCcuda::Drift_Diffuse                  4.9510     2.1270             25       0.198041360
      WaveFunction::SlaterDet_VGL           1.4216     1.4216          28800       0.000049361
      WaveFunction::SlaterDet_accept        0.2762     0.2762           9600       0.000028766
      WaveFunction::jastrow_V               0.5673     0.5673          38400       0.000014773
      WaveFunction::jastrow_VGL             0.0009     0.0009           9600       0.000000097
      WaveFunction::jastrow_derivs          0.5580     0.5580          28800       0.000019376
    DMCcuda::Hamiltonian                    0.6136     0.0001             25       0.024542427
      Hamiltonian::ElecElec                 0.1417     0.1417             25       0.005666318
      Hamiltonian::IonIon                   0.0000     0.0000             25       0.000000818
      Hamiltonian::Kinetic                  0.0015     0.0015             25       0.000059324
      Hamiltonian::LocalECP                 0.0757     0.0757             25       0.003028199
      Hamiltonian::NonLocalECP              0.3946     0.0941             25       0.015784679
        WaveFunction::SlaterDet_NLratio     0.2534     0.2534             50       0.005068591
        WaveFunction::jastrow_VGL           0.0157     0.0157             50       0.000313796
        WaveFunction::jastrow_accept        0.0314     0.0314             50       0.000627142
    DMCcuda::resize                         0.0029     0.0029             25       0.000117462
    WaveFunction::SlaterDet_VGL             0.0207     0.0207             50       0.000414914
    WaveFunction::SlaterDet_accept          0.0054     0.0054             25       0.000214380
    WaveFunction::SlaterDet_recompute       0.0316     0.0316              1       0.031569478
    WaveFunction::jastrow_NLratio           0.0000     0.0000              1       0.000000209
    WaveFunction::jastrow_V                 0.1297     0.1297             75       0.001729709
    WaveFunction::jastrow_VGL               0.0000     0.0000             25       0.000000124
    WaveFunction::jastrow_accept            0.0000     0.0000              1       0.000000311
    WaveFunction::jastrow_derivs            0.0146     0.0146             50       0.000291362

2) Timing of DMC unified driver with dmc-a32-e384-cpu-XL-batch-w56 running on Summit, 7 batches and 8 walkers per batch. On a Summit node, we put 6 MPI ranks each with 7 threads.

    DMCBatched::RunSteps                        18.7519     0.2868             25       0.750077640
      DMCBatched::Hamiltonian                   10.8243     0.0002             25       0.432971240
        Hamiltonian::ElecElec                    0.2055     0.2055             25       0.008218260
        Hamiltonian::IonIon                      0.0001     0.0001             25       0.000002132
        Hamiltonian::Kinetic                     0.0001     0.0001             25       0.000005911
        Hamiltonian::LocalECP                    0.0400     0.0400             25       0.001600872
        Hamiltonian::NonLocalECP                10.5784     0.0769             25       0.423137931
          ParticleSet::update                    1.9177     1.9177          34999       0.000054793
          WaveFunction::J1OrbitalSoA_NLratio     0.0716     0.0716          34999       0.000002045
          WaveFunction::J2OrbitalSoA_NLratio     1.0805     1.0805          34999       0.000030873
          WaveFunction::SlaterDet_NLratio        7.4317     0.0128          34999       0.000212341
            DiracDeterminantBase::spoval         7.4189     7.4189          34999       0.000211976
      DMCBatched::MovePbyP                       7.4942     0.0734             25       0.299766485
        ParticleSet::computeNewPosDT             0.3596     0.3596           9600       0.000037462
        ParticleSet::donePbyP                    2.3471     2.3471            200       0.011735551
        ParticleSet::setActive                   0.3623     0.3623           9600       0.000037744
        WaveFunction::J1OrbitalSoA_VGL           0.0571     0.0571          19200       0.000002976
        WaveFunction::J1OrbitalSoA_accept        0.0113     0.0113           9800       0.000001151
        WaveFunction::J2OrbitalSoA_VGL           0.5086     0.5086          19200       0.000026490
        WaveFunction::J2OrbitalSoA_accept        0.6719     0.6719           9800       0.000068564
        WaveFunction::SlaterDet_VGL              1.9953     0.0434          19200       0.000103922
          DiracDeterminantBase::ratio            0.3203     0.3203         153600       0.000002085
          DiracDeterminantBase::spovgl           1.6316     1.6316           9600       0.000169960
        WaveFunction::SlaterDet_accept           1.1074     0.0326           9800       0.000113000
          DiracDeterminantBase::update           1.0748     1.0748          77105       0.000013939

3) Timing of DMC cpu driver with dmc-a32-e384-cpu-intel-ref-w56 running on dual socket Xeon 8180, 56 walkers per MPI rank. On this node, we put 8 MPI ranks each with 7 threads.

  DMC                                       7.7408     0.0686              1       7.740798950
    DMCUpdatePbyP::Hamiltonian              4.0683     0.0015            200       0.020341556
      Hamiltonian::ElecElec                 0.2101     0.2101            200       0.001050303
      Hamiltonian::IonIon                   0.0001     0.0001            200       0.000000632
      Hamiltonian::Kinetic                  0.0005     0.0005            200       0.000002314
      Hamiltonian::LocalECP                 0.0503     0.0503            200       0.000251600
      Hamiltonian::NonLocalECP              3.8059     0.1174            200       0.019029355
        ParticleSet::update                 0.4503     0.4503          35335       0.000012745
        WaveFunction::SlaterDet_NLratio     2.9691     0.0174          35335       0.000084028
          DiracDeterminantBase::spoval      2.9517     2.9517          35335       0.000083536
        WaveFunction::jastrow_NLratio       0.2690     0.2690          70670       0.000003807
    DMCUpdatePbyP::movePbyP                 3.2414     0.1897            200       0.016206890
      ParticleSet::computeNewPosDT          0.0840     0.0840          76800       0.000001094
      ParticleSet::donePbyP                 0.0740     0.0740            200       0.000370136
      ParticleSet::setActive                0.1053     0.1053          76800       0.000001371
      WaveFunction::SlaterDet_VGL           1.7837     0.0651         153600       0.000011612
        DiracDeterminantBase::ratio         0.1867     0.1867         153600       0.000001216
        DiracDeterminantBase::spovgl        1.5319     1.5319          76800       0.000019946
      WaveFunction::SlaterDet_accept        0.3866     0.0361          76896       0.000005028
        DiracDeterminantBase::update        0.3505     0.3505          77096       0.000004546
      WaveFunction::jastrow_VGL             0.3361     0.3361         307200       0.000001094
      WaveFunction::jastrow_accept          0.2819     0.2819         153792       0.000001833

bottleneck 1

Hamiltonian which spent most of its time in NonLocalECP spoval. Currently there is only quadrature point virtual move batching but no walker batching. This is the reason for 34999 call counts, the time is basically spent in GPU runtime/driver overhead. Fixes: Need to implementing NLPP with batching namely propagating the batched interfaces down to NonLocalECPComponent and then connect the batched interfaces of TWF. The call counts should be by 1 to 2 order of magnitude. The time should drop from 7.4189 to below 1 sec.

bottleneck 2

ParticleSet::donePbyP the structure factor calculation is slow. Probably this has already been solved by introducing MASSV. We need to make it working with XL compiler. We can probably save 2 seconds.

bottleneck 3

In this small problem size, distance table computation can probably reduced by 2x even without offload. So we may gain ~1 seconds.

After solving 1-3, we should be able to run at least 2x faster than the current code and further higher throughput by increasing walker counts. In this kind of small problems, I expect to have >50 walkers per GPU.

Changes needed for all problem sizes.

NLPP walker batching is needed for all problem sizes even 512 atom.

When solving 512 atom there is only 1 walker per batch. To add more walker batching, NLPP(Hamiltonian) evaluation needs to be taken out from the movePbyP crowd scope and does walker batching at the population level. This requires quite a few steps.

Split the updateBuffer to evaluateGL and copyToBuffer. updateBuffer handling some last bit computation and buffer handling. The role must be separated.
Move copyFromBuffer and copyToBuffer outside the crowd scope but population scope. At the meantime, move H.evaluate to the population scope. This should give sufficient computation when doing walker batching in NLPP.

Distance table computation needs to be offloaded to GPU.

At PPPG boundary condition, the distance table computation is quite heavy in 128 and 256 atom cell.

2020-01-07 update: NLPP walker batching completed and MASS connected. Both bottleneck 1 and 2 are resolved. This is the same run as above 2) on Summit.

    DMCBatched::RunSteps                         9.8830     0.2178             25       0.395320220
      DMCBatched::Hamiltonian                    3.9805     0.0002             25       0.159221144
        Hamiltonian::ElecElec                    0.2110     0.2110             25       0.008440138
        Hamiltonian::IonIon                      0.0000     0.0000             25       0.000001949
        Hamiltonian::Kinetic                     0.0002     0.0002             25       0.000006393
        Hamiltonian::LocalECP                    0.0453     0.0453             25       0.001812433
        Hamiltonian::NonLocalECP                 3.7239     0.0545             25       0.148954065
          ParticleSet::update                    1.8335     1.8335          33443       0.000054826
          WaveFunction::J1OrbitalSoA_NLratio     0.0695     0.0695           4673       0.000014876
          WaveFunction::J2OrbitalSoA_NLratio     1.0767     1.0767           4673       0.000230406
          WaveFunction::SlaterDet_NLratio        0.6896     0.0071           4673       0.000147569
            DiracDeterminantBase::ratio          0.0584     0.0584           4673       0.000012499
            DiracDeterminantBase::spoval         0.6241     0.6241           4673       0.000133552
      DMCBatched::MovePbyP                       5.5434     0.0590             25       0.221735064
        ParticleSet::acceptMove                  0.0292     0.0292          76710       0.000000381
        ParticleSet::computeNewPosDT             0.7353     0.7353           9600       0.000076589
        ParticleSet::donePbyP                    0.2715     0.2715            200       0.001357680
        WaveFunction::J1OrbitalSoA_VGL           0.0577     0.0577          19200       0.000003005
        WaveFunction::J1OrbitalSoA_accept        0.0114     0.0114           9800       0.000001159
        WaveFunction::J2OrbitalSoA_VGL           0.5101     0.5101          19200       0.000026568
        WaveFunction::J2OrbitalSoA_accept        0.6705     0.6705           9800       0.000068423
        WaveFunction::SlaterDet_VGL              2.1688     0.0481          19200       0.000112956
          DiracDeterminantBase::ratio            0.3256     0.3256         153600       0.000002120
          DiracDeterminantBase::spovgl           1.7950     1.7950           9600       0.000186982
        WaveFunction::SlaterDet_accept           1.0299     0.0309           9800       0.000105090
          DiracDeterminantBase::update           0.9989     0.9989          77110       0.000012955

NonLocalECP spoval only needs 0.6241 seconds and ParticleSet::donePbyP goes down to 0.27 seconds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly