Skip to content

Performance analysis on small problem sizes part 2

Ye Luo edited this page Jul 14, 2020 · 5 revisions

Timing of DMC unified driver with dmc-a32-e384-cpu-Clang-batch-w1344 running on Summit, 7 batches and 192 walkers per batch. On a Summit node, we put 6 MPI ranks each with 7 threads.

    DMCBatched::RunSteps                       139.6967     1.3720             25       5.587869120
      DMCBatched::Hamiltonian                   76.8406     0.0103             25       3.073624086
        Hamiltonian::ElecElec                    4.8134     4.8134             25       0.192537746
        Hamiltonian::IonIon                      0.0009     0.0009             25       0.000035486
        Hamiltonian::Kinetic                     0.0048     0.0048             25       0.000191803
        Hamiltonian::LocalECP                    1.2211     1.2211             25       0.048843880
        Hamiltonian::NonLocalECP                70.7901     1.8457             25       2.831604805
          ParticleSet::update                   31.6759    31.6759           5308       0.005967586
          WaveFunction::J1OrbitalSoA_NLratio     3.3165     3.3165           5308       0.000624810
          WaveFunction::J2OrbitalSoA_NLratio    26.5605    26.5605           5308       0.005003858
          WaveFunction::SlaterDet_NLratio        7.3915     0.1761           5308       0.001392525
            DiracDeterminantBase::ratio          0.3734     0.3734           5308       0.000070344
            DiracDeterminantBase::spoval         6.8421     0.4537           5308       0.001289007
              SplineC2ROMP::offload              6.3883     6.3883           5308       0.001203532
      DMCBatched::MovePbyP                      59.5932     1.7929             25       2.383727245
        ParticleSet::acceptMove                  2.3101     2.3101        1840858       0.000001255
        ParticleSet::computeNewPosDT            13.6748    13.6748           9600       0.001424464
        ParticleSet::donePbyP                    3.9055     3.9055           4800       0.000813638
        WaveFunction::J1OrbitalSoA_VGL           3.4353     3.4353          19200       0.000178924
        WaveFunction::J1OrbitalSoA_accept        0.3011     0.3011           9625       0.000031284
        WaveFunction::J2OrbitalSoA_VGL           9.7676     9.7676          19200       0.000508727
        WaveFunction::J2OrbitalSoA_accept       12.5381    12.5381           9625       0.001302659
        WaveFunction::SlaterDet_VGL              5.0797     0.2779          19200       0.000264568
          DiracDeterminantBase::ratio            1.2942     1.2942           9600       0.000134814
          DiracDeterminantBase::spovgl           3.5076     0.2003           9600       0.000365374
            SplineC2ROMP::offload                3.3073     3.3073           9600       0.000344506
        WaveFunction::SlaterDet_accept           6.7881     0.2510           9625       0.000705258
          DiracDeterminantBase::update           1.1774     1.1774           9650       0.000122007
          DiracDeterminantBatched::D2H           5.3597     5.3597             50       0.107194562

ParticleSet::update and WaveFunction::J2OrbitalSoA are taking the 60% of the time, 84.4 seconds of totally 139.7 seconds

Timing of DMC unified driver with dmc-a32-e384-cpu-Clang-batch-w1344 without e-e distance table running on Summit, 7 batches and 192 walkers per batch. On a Summit node, we put 6 MPI ranks each with 7 threads.

    DMCBatched::RunSteps                        84.0625     1.1251             25       3.362500429
      DMCBatched::Hamiltonian                   50.4087     0.0099             25       2.016346045
        Hamiltonian::IonIon                      0.0008     0.0008             25       0.000032129
        Hamiltonian::Kinetic                     0.0053     0.0053             25       0.000211229
        Hamiltonian::LocalECP                    1.2310     1.2310             25       0.049241915
        Hamiltonian::NonLocalECP                49.1617     1.7872             25       1.966466036
          ParticleSet::update                    3.8670     3.8670           6532       0.000592009
          WaveFunction::J1OrbitalSoA_NLratio     2.9536     2.9536           6532       0.000452171
          WaveFunction::SlaterDet_NLratio       40.5539     0.1461           6532       0.006208494
            DiracDeterminantBase::ratio          0.3096     0.3096           6532       0.000047397
            DiracDeterminantBase::spoval        40.0982     0.6757           6532       0.006138737
              SplineC2ROMP::offload             39.4225    39.4225           6532       0.006035293
      DMCBatched::MovePbyP                      29.7778     1.1524             25       1.191113958
        ParticleSet::acceptMove                  0.3582     0.3582        1839049       0.000000195
        ParticleSet::computeNewPosDT             1.1062     1.1062           9600       0.000115233
        ParticleSet::donePbyP                    3.9309     3.9309           4800       0.000818944
        WaveFunction::J1OrbitalSoA_VGL           1.6935     1.6935          19200       0.000088204
        WaveFunction::J1OrbitalSoA_accept        0.1444     0.1444           9625       0.000014999
        WaveFunction::SlaterDet_VGL             14.9466     0.2291          19200       0.000778468
          DiracDeterminantBase::ratio            6.5763     6.5763           9600       0.000685036
          DiracDeterminantBase::spovgl           8.1412     0.1294           9600       0.000848041
            SplineC2ROMP::offload                8.0118     8.0118           9600       0.000834567
        WaveFunction::SlaterDet_accept           6.4456     0.1648           9625       0.000669676
          DiracDeterminantBase::update           0.9459     0.9459           9650       0.000098017
          DiracDeterminantBatched::D2H           5.3350     5.3350             50       0.106699076