Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delayed update on CPU #1170

Merged
merged 91 commits into from
Nov 28, 2018
Merged
Show file tree
Hide file tree
Changes from 84 commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
3a763cc
Implement delayed update.
ye-luo Oct 30, 2017
dc376ea
Use FP for matrix inversion + a bit cleaning.
ye-luo Oct 30, 2017
2b435f3
Engage timer in restore for a reject move.
ye-luo Oct 30, 2017
1912b88
Merge branch 'remove-ParticleBase' into delayed-update
ye-luo Oct 31, 2017
573d513
Merge branch 'remove-ParticleBase' into delayed-update
ye-luo Nov 2, 2017
8eaa60f
add delay switch in acceptMove for Tmove.
ye-luo Nov 3, 2017
2c52eb3
Add completeUpdates to deplete delayed updates.
ye-luo Nov 3, 2017
ebc4df7
Merge branch 'remove-ParticleBase' into delayed-update
ye-luo Nov 7, 2017
896dad4
Merge branch 'rebuild-master' into delayed-update
ye-luo Nov 11, 2017
b5b9da4
Complete adding completeUpdates.
ye-luo Nov 11, 2017
ce052c0
Old way is better for recomputing.
ye-luo Nov 11, 2017
93ca05f
Merge branch 'rebuild-master' into delayed-update
ye-luo Nov 15, 2017
13ee23d
Merge branch 'rebuild-master' into delayed-update
ye-luo Nov 15, 2017
fb2265e
Timer not needed for reject.
ye-luo Nov 16, 2017
fcdf409
Need to call completeUpdates at every substep.
ye-luo Nov 17, 2017
dc8e596
Merge branch 'rebuild-master' into delayed-update
ye-luo Nov 19, 2017
44fb110
Merge branch 'rebuild-master' into delayed-update
ye-luo Nov 19, 2017
a4288bd
Change acceptMove calls and add PS's name.
jngkim Nov 21, 2017
c388e6e
Revert adding bool to acceptMove.
ye-luo Nov 21, 2017
79aed0b
Minor changes.
ye-luo Nov 21, 2017
78bbb9f
curRatio cannot be used affter acceptMove.
ye-luo Nov 21, 2017
605f4c4
Make one electron update brief.
ye-luo Nov 21, 2017
a5869b5
Add a unit test for delayed update.
ye-luo Nov 21, 2017
af40516
Fix cplx MP build.
ye-luo Nov 26, 2017
1ee1047
Add an miniapp to test delayed update using DiracMatrix.
jngkim Dec 1, 2017
1e99df2
Correct typo.
ye-luo Dec 1, 2017
b244754
Add the last accept call.
ye-luo Dec 1, 2017
1aa3dba
Printout more for post processing.
jngkim Dec 1, 2017
e4c5666
Add timer on the last accept call.
ye-luo Dec 1, 2017
3233af2
Use psiM0 for debugging.
jngkim Dec 1, 2017
7e5bc21
Merge branch 'delayed-update' of xgitlab.cels.anl.gov:QMCPACK/qmcpack…
ye-luo Dec 1, 2017
8692986
Remove potential side effect.
ye-luo Dec 2, 2017
b33ae84
Revert to build det-delay with the given precision at the build time.
jngkim Dec 4, 2017
741983e
Add complex in delayed_update miniapp.
ye-luo Dec 15, 2017
7e7ca37
Move delay_rank tag to slaterdeterminant.
ye-luo Dec 22, 2017
1fecebe
Merge branch 'rebuild-master' into delayed-update-merge
ye-luo Jan 9, 2018
624383e
Fix Tmove v1 with delayed update.
ye-luo Jan 9, 2018
7b80fbe
Change the layout of tempMat for better perf.
ye-luo Jan 23, 2018
fc8525b
Merge branch 'rebuild-master' into delayed-update-merge
ye-luo Jan 25, 2018
838741c
Merge branch 'rebuild-master' into delayed-update-merge
ye-luo Feb 17, 2018
3c05a6f
Merge branch 'rebuild-master' into delayed-update-merge
ye-luo Apr 14, 2018
dacb4d9
Fix build and a bit cleaning.
ye-luo Apr 14, 2018
c63d5fc
Merge branch 'rebuild-master' into delayed-update-merge
ye-luo Aug 2, 2018
309c05a
Fix compilation error.
ye-luo Aug 9, 2018
4b558dc
Replace an unnecessary gemm with gemv.
ye-luo Aug 9, 2018
e2fd17e
Merge branch 'rebuild-master' into delayed-update-merge
ye-luo Aug 14, 2018
faffc5d
More compact expression.
ye-luo Sep 13, 2018
be94f9e
Merge remote-tracking branch 'github/develop' into delayed-update-merge
ye-luo Oct 10, 2018
11ff99a
Merge remote-tracking branch 'github/develop' into delayed-update-merge
ye-luo Oct 12, 2018
e7e17be
Fix Sandbox build without MPI.
ye-luo Oct 12, 2018
c9c75e3
Rename variables to be consistent with paper.
ye-luo Oct 25, 2018
f50abb7
Merge remote-tracking branch 'github/develop' into delayed-update-merge
ye-luo Oct 30, 2018
e8da159
Merge remote-tracking branch 'github/develop' into delayed-update-merge
ye-luo Nov 2, 2018
f5e55ab
new_AinvRow reuses existing memory.
ye-luo Nov 4, 2018
aa6686e
Remove unused dirac_computeGL.h
ye-luo Nov 4, 2018
cc7b680
Remove ifdef MIXED_PRECISION in DiracDeterminant.
ye-luo Nov 5, 2018
b807953
Merge remote-tracking branch 'github/develop' into delayed-update-merge
ye-luo Nov 5, 2018
05bbafb
Merge branch 'delayed-update-merge' of https://xgitlab.cels.anl.gov/Q…
ye-luo Nov 5, 2018
1a0b00e
Remove unused work vector in DiracDeterminant.h
ye-luo Nov 5, 2018
5795700
Concentrate determinant update routines.
ye-luo Nov 7, 2018
5d01301
Fix sandbox miniapp.
ye-luo Nov 7, 2018
7374f81
Put back ifdef MIXED_PRECISION
ye-luo Nov 7, 2018
8b5b38a
Rename delayedEng to udpateEng
ye-luo Nov 7, 2018
3a98f01
A small tweak
ye-luo Nov 7, 2018
46b78b7
Add delayed update integration test.
ye-luo Nov 7, 2018
398836c
Minor change.
ye-luo Nov 8, 2018
292775c
Remove commented code.
ye-luo Nov 8, 2018
1720515
Separate DelayedUpdate.h from DiracMatrix.h
ye-luo Nov 8, 2018
537e235
Remove VLA
ye-luo Nov 10, 2018
7ff4551
Replace direct inversion with a recursive one.
ye-luo Nov 10, 2018
106de18
Merge remote-tracking branch 'github/develop' into delayed-update-merge
ye-luo Nov 10, 2018
6aaff04
Replace CONSTEXPR with const in DelayedUpdate.h
ye-luo Nov 10, 2018
a3c62b3
Minor tweak.
ye-luo Nov 12, 2018
1445dec
Remove SM-1 codepath, now part of DU.
ye-luo Nov 15, 2018
bb304da
Improve safe-guard
ye-luo Nov 15, 2018
3f86cf8
Make ratioGrad more safe.
ye-luo Nov 15, 2018
b1d6ce7
Minor change.
ye-luo Nov 15, 2018
b7087e1
Add performance tests with delayed updates.
ye-luo Nov 15, 2018
489cbb2
Correct cmake printing.
ye-luo Nov 15, 2018
04f00d5
Merge remote-tracking branch 'github/develop' into delayed-update-merge
ye-luo Nov 20, 2018
27963ac
Remove unnecessary copy.
ye-luo Nov 21, 2018
d82229f
Add manual.
ye-luo Nov 25, 2018
e318bc4
Merge remote-tracking branch 'github/develop' into delayed-update-merge
ye-luo Nov 25, 2018
a1f1760
Remove unused argument in completeUpdates.
ye-luo Nov 25, 2018
19289dd
Change struct to class.
ye-luo Nov 26, 2018
2415dac
Add references and tweak words in manual.
ye-luo Nov 26, 2018
2ecd856
More elaborate error message.
ye-luo Nov 26, 2018
7dc2cc6
Update documentation in DelayedUpdate and its test.
ye-luo Nov 27, 2018
7b8c6cc
Merge remote-tracking branch 'github/develop' into delayed-update-merge
ye-luo Nov 27, 2018
3b5b236
Update comment.
ye-luo Nov 27, 2018
95ec0ee
Merge remote-tracking branch 'github/develop' into delayed-update-merge
ye-luo Nov 27, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions manual/qmcpack_manual.tex
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,7 @@ \chapter{Specifying the system to be simulated}

\chapter{Trial wavefunction specification}
\input{intro_wavefunction}
\input{singledeterminant}
\input{spo}
\input{jastrow}
\input{multideterminants}
Expand Down
52 changes: 52 additions & 0 deletions manual/singledeterminant.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
\section{Single determinant wavefunctions}
\label{sec:singledeterminant}
Placing a single determinant for each spin is the most used ansatz for the antisymmetric part of a trial wavefunction.
The input xml block for \texttt{slaterdeterminant} is give in Listing~\ref{listing:singledet}. A list of options is given in
Table~\ref{table:singledet}

\begin{table}[h]
\begin{center}
\begin{tabularx}{\textwidth}{l l l l l l }
\hline
\multicolumn{6}{l}{\texttt{slaterdeterminant} element} \\
\hline
\multicolumn{2}{l}{parent elements:} & \multicolumn{4}{l}{\texttt{determinantset}}\\
\multicolumn{2}{l}{child elements:} & \multicolumn{4}{l}{\texttt{determinant}}\\
\multicolumn{2}{l}{attribute :} & \multicolumn{4}{l}{}\\
& \bfseries name & \bfseries datatype & \bfseries values & \bfseries default & \bfseries description \\
& \texttt{delay\_rank} & integer & >0 & 1 & The number of delayed updates. \\
& \texttt{optimize} & text & yes/no & yes & Enable orbital optimization. \\
\hline
\end{tabularx}
\end{center}
\caption{Options for the \texttt{slaterdeterminant} xml-block.}
\label{table:singledet}
\end{table}

\begin{lstlisting}[caption=slaterdeterminant set XML element.\label{listing:singledet}]
<slaterdeterminant delay_rank="32">
<determinant id="updet" size="208">
<occupation mode="ground" spindataset="0">
</occupation>
</determinant>
<determinant id="downdet" size="208">
<occupation mode="ground" spindataset="0">
</occupation>
</determinant>
</slaterdeterminant>
\end{lstlisting}

Additional information:
\begin{itemize}
\item \texttt{delay\_rank}. This option enables the delayed updates of Slater matrix inverse when particle-by-particle move is used.
By default, \texttt{delay\_rank=1} uses the Fahy's variant of the Sherman-Morrison rank-1 update which is mostly using memory bandwidth bound BLAS-2 calls.
With \texttt{delay\_rank>1}, the delayed update algorithm turns most of the computation to compute bound BLAS-3 calls.
Tuning this parameter is highly recommended to gain the best performance on medium to large problem sizes ($>200$ electrons).
We have seen up to an order of magnitude speed-up on large problem sizes.
When studying the performance of QMCPACK, a scan of this parameter is required and we recommend to start from 32.
The best \texttt{delay\_rank} giving the maximal speed-up depends the problem size.
Usually the larger \texttt{delay\_rank} corresponds to a larger problem size.
On CPUs, \texttt{delay\_rank} must be chosen a multiple of SIMD vector length. The best \texttt{delay\_rank} depends on the processor micro architecture.
The GPU support is currently under development.
\end{itemize}

2 changes: 2 additions & 0 deletions src/QMCDrivers/CorrelatedSampling/CSVMCUpdatePbyP.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,8 @@ void CSVMCUpdatePbyP::advanceWalker(Walker_t& thisWalker, bool recompute)
{
++nAllRejected;
}
for(int ipsi=0; ipsi<nPsi; ipsi++)
Psi1[ipsi]->completeUpdates();
}
// myTimers[1]->stop();
// myTimers[2]->start();
Expand Down
2 changes: 1 addition & 1 deletion src/QMCDrivers/DMC/DMCUpdatePbyPFast.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ void DMCUpdatePbyPWithRejectionFast::advanceWalker(Walker_t& thisWalker, bool re
}
}
}

Psi.completeUpdates();
W.donePbyP();
myTimers[DMC_movePbyP]->stop();

Expand Down
2 changes: 2 additions & 0 deletions src/QMCDrivers/RMC/RMCUpdatePbyP.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,7 @@ namespace qmcplusplus
}
}
myTimers[1]->stop ();
Psi.completeUpdates();
W.donePbyP();

if (nAcceptTemp > 0)
Expand Down Expand Up @@ -344,6 +345,7 @@ namespace qmcplusplus
}
}
myTimers[1]->stop ();
Psi.completeUpdates();
W.donePbyP();
// In the rare case that all proposed moves fail, we bounce.
if (nAcceptTemp == 0)
Expand Down
1 change: 1 addition & 0 deletions src/QMCDrivers/VMC/VMCUpdatePbyP.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,7 @@ void VMCUpdatePbyP::advanceWalker(Walker_t& thisWalker, bool recompute)
}
}
}
Psi.completeUpdates();
}
W.donePbyP();
myTimers[1]->stop();
Expand Down
4 changes: 4 additions & 0 deletions src/QMCHamiltonians/NonLocalECPotential.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -313,6 +313,10 @@ NonLocalECPotential::makeNonLocalMovesPbyP(ParticleSet& P)
}
}
}

if(NonLocalMoveAccepted>0)
Psi.completeUpdates();

return NonLocalMoveAccepted;
}

Expand Down
158 changes: 158 additions & 0 deletions src/QMCWaveFunctions/Fermion/DelayedUpdate.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
//////////////////////////////////////////////////////////////////////////////////////
// This file is distributed under the University of Illinois/NCSA Open Source License.
// See LICENSE file in top directory for details.
//
// Copyright (c) 2017 QMCPACK developers.
//
// File developed by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory
//
// File created by: Ye Luo, yeluo@anl.gov, Argonne National Laboratory
//////////////////////////////////////////////////////////////////////////////////////

#ifndef QMCPLUSPLUS_DELAYED_UPDATE_H
#define QMCPLUSPLUS_DELAYED_UPDATE_H

#include "Numerics/Blasf.h"
#include <OhmmsPETE/OhmmsVector.h>
#include <OhmmsPETE/OhmmsMatrix.h>
#include <simd/simd.hpp>

namespace qmcplusplus {

template<typename T>
struct DelayedUpdate
{
Matrix<T> U, V, Binv, tempMat;
// temporal scratch space used by SM-1
Vector<T> temp;
// auxiliary arrays for B
Vector<T> p;
std::vector<int> delay_list;
int delay_count;

const T* Ainv_row_ptr;
// electron id of the up-to-date Ainv_row
int Ainv_row_ind;
T curRatio;

DelayedUpdate(): delay_count(0), Ainv_row_ptr(nullptr), Ainv_row_ind(-1) {}

///resize the internal storage, 0<delay<=norb
inline void resize(int norb, int delay)
{
V.resize(delay, norb);
U.resize(delay, norb);
p.resize(delay);
temp.resize(norb);
tempMat.resize(norb, delay);
Binv.resize(delay, delay);
delay_list.resize(delay);
}

inline void getInvRow(const Matrix<T>& Ainv, int rowchanged)
{
Ainv_row_ind = rowchanged;
if ( delay_count == 0 )
{
Ainv_row_ptr = Ainv[rowchanged];
return;
}
const T cone(1);
const T czero(0);
const T* AinvRow = Ainv[rowchanged];
const int norb = Ainv.rows();
const int lda_Binv = Binv.cols();
// save AinvRow to new_AinvRow
simd::copy_n(AinvRow, norb, V[delay_count]);
// multiply V (NxK) Binv(KxK) U(KxN) AinvRow right to the left
BLAS::gemv('T', norb, delay_count, cone, U.data(), norb, AinvRow, 1, czero, p.data(), 1);
BLAS::gemv('N', delay_count, delay_count, cone, Binv.data(), lda_Binv, p.data(), 1, czero, Binv[delay_count], 1);
BLAS::gemv('N', norb, delay_count, -cone, V.data(), norb, Binv[delay_count], 1, cone, V[delay_count], 1);
Ainv_row_ptr = V[delay_count];
}

template<typename VVT>
inline T ratio(const Matrix<T>& Ainv, int rowchanged, const VVT& psiV)
{
getInvRow(Ainv, rowchanged);
return curRatio = simd::dot(Ainv_row_ptr,psiV.data(),Ainv.cols());
}

template<typename GT>
inline GT evalGrad(const Matrix<T>& Ainv, int rowchanged, const GT* dpsiV)
{
getInvRow(Ainv, rowchanged);
return simd::dot(Ainv_row_ptr,dpsiV,Ainv.cols());
}

template<typename VVT, typename GGT, typename GT>
inline T ratioGrad(const Matrix<T>& Ainv, int rowchanged, const VVT& psiV, const GGT& dpsiV, GT& g)
{
if(Ainv_row_ind != rowchanged)
getInvRow(Ainv, rowchanged);
g = simd::dot(Ainv_row_ptr,dpsiV.data(),Ainv.cols());
return curRatio = simd::dot(Ainv_row_ptr,psiV.data(),Ainv.cols());
}

// accept with the update delayed
template<typename VVT>
inline void acceptRow(Matrix<T>& Ainv, int rowchanged, const VVT& psiV)
{
// safe mechanism
Ainv_row_ind = -1;

const T cminusone(-1);
const T czero(0);
const int norb = Ainv.rows();
const int lda_Binv = Binv.cols();
simd::copy_n(Ainv[rowchanged], norb, V[delay_count]);
simd::copy_n(psiV.data(), norb, U[delay_count]);
delay_list[delay_count] = rowchanged;
// the new Binv is [[X Y] [Z x]]
BLAS::gemv('T', norb, delay_count+1, cminusone, V.data(), norb, psiV.data(), 1, czero, p.data(), 1);
// x
T y = -p[delay_count];
for(int i=0; i<delay_count; i++)
y += Binv[delay_count][i] * p[i];
Binv[delay_count][delay_count] = y = T(1) / y;
// Y
BLAS::gemv('T', delay_count, delay_count, y, Binv.data(), lda_Binv, p.data(), 1, czero, Binv.data()+delay_count, lda_Binv);
// X
BLAS::ger(delay_count, delay_count, cminusone, Binv[delay_count], 1, Binv.data()+delay_count, lda_Binv, Binv.data(), lda_Binv);
// Z
for(int i=0; i<delay_count; i++)
Binv[delay_count][i] *= -y;
delay_count++;
if(delay_count==lda_Binv) updateInvMat(Ainv);
}

inline void updateInvMat(Matrix<T>& Ainv)
{
if(delay_count==0) return;
// update the inverse matrix
const T cone(1);
const T czero(0);
const int norb=Ainv.rows();
if(delay_count==1)
{
// Only use the first norb elements of tempMat as a temporal array
BLAS::gemv('T', norb, norb, cone, Ainv.data(), norb, U[0], 1, czero, temp.data(), 1);
temp[delay_list[0]] -= cone;
BLAS::ger(norb,norb,-Binv[0][0],V[0],1,temp.data(),1,Ainv.data(),norb);
}
else
{
const int lda_Binv=Binv.cols();
BLAS::gemm('T', 'N', delay_count, norb, norb, cone, U.data(), norb, Ainv.data(), norb, czero, tempMat.data(), lda_Binv);
for(int i=0; i<delay_count; i++) tempMat(delay_list[i], i) -= cone;
BLAS::gemm('N', 'N', norb, delay_count, delay_count, cone, V.data(), norb, Binv.data(), lda_Binv, czero, U.data(), norb);
BLAS::gemm('N', 'N', norb, norb, delay_count, -cone, U.data(), norb, tempMat.data(), lda_Binv, cone, Ainv.data(), norb);
}
delay_count = 0;
Ainv_row_ind = -1;
}
};
}

#endif // QMCPLUSPLUS_DELAYED_UPDATE_H

Loading