Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delayed update on CPU #1170

Merged
merged 91 commits into from
Nov 28, 2018
Merged

Delayed update on CPU #1170

merged 91 commits into from
Nov 28, 2018

Conversation

ye-luo
Copy link
Contributor

@ye-luo ye-luo commented Nov 25, 2018

Closes #1152
This feature can be enabled with all the QMC methods implemented in QMCPACK
via <slaterdeterminant delay_rank="32">

Manual, unit test and integration tests are added. The manual needs further update to add references.

Performance tests are set with delay_rank="32". Currently the input files are generated on the fly.
If they needed to be stored in the repo, I will add them after the review in order to reduce the amount of files.

ye-luo and others added 30 commits October 30, 2017 00:08
Fix VMC and DMC
TODO fix RMC and CorrelatedSampling.
Options: -d -n num_els -k delay -i iteration
@ghost ghost assigned ye-luo Nov 25, 2018
@ghost ghost added the in progress label Nov 25, 2018
Copy link
Contributor

@prckent prckent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

Some quick and incomplete feedback:

  1. Structs with functionality should be classes
  2. Code disabled behind preprocessor macros should be either unified behind an DELAYED_DEBUG setting or removed. ( We should have a note in the coding standards about this. )
  3. Tests should include delays that are not a factor of the electron count. EDIT: I see this is done. Perhaps a delay of 3 so that there are 2 updates and one partial in the diamond 2x2x1 case?
  4. The documentation should reference the delayed update and Fahy papers, where mentioned. There is also a statement that the delay should be a multiple of the SIMD length; I think this is only a performance recommendation.
  5. I am not keen on increasing the number of codes in Sandbox. These are undocumented and a bit of a wasteland. We might discuss a better location. Is this a utility that could be run to assess the delay?

@ye-luo
Copy link
Contributor Author

ye-luo commented Nov 26, 2018

  1. Will do. I will be more strict on public/private.
  2. Could you comment in-source? I'm confused about which macro you refer to.
  3. The diamond test has 8 electrons in each spin and the delay was made 5.
  4. Will do.
  5. Yes. It can be used to access the delay. However, it is not recommended for users because it cannot accurately reflect the code performance due to the lack of SPO evaluation. When it is moved to miniQMC, it will be a useful tool. Its setting is consistent with the other code in the Sandbox. I prefer leaving it in Sandbox.

@ye-luo ye-luo requested review from atillack and PDoakORNL November 26, 2018 14:49
adet->set(firstIndex,lastIndex-firstIndex);
if( delay_rank<=0 || delay_rank>lastIndex-firstIndex )
{
APP_ABORT("SlaterDetBuilder::putDeterminant delay_rank must be positive and no larger than the electron count within a determinant!\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message should print the values of delay_rank and lastIndex-firstIndex.

@PDoakORNL
Copy link
Contributor

PDoakORNL commented Nov 27, 2018

I do not want approve this until your delayed update work is hidden by an abstraction that preserves the previous functionality. Am I missing that, at first read I just see Dirac determinant that now works differently it doesn't look like I can retrieve row update behavior. I can not A/B the two different methods for the same commit. A feature completely replacing a previous feature should demonstrate identical or better test compliance on the same commit. The old feature can then be removed if necessary.

Work like this should be behind a feature switch preserving exisiting implementation and merged several times a week, this will allow other work to be done and force the new feature to sync with other work done in the codebase instead of motivating developers to block other work.

@ye-luo
Copy link
Contributor Author

ye-luo commented Nov 27, 2018

The previous rank-1 is within this implementation. it is IDENTICAL to the old code if rank_delay=1. There is no need to have two implementation if one can do both work. All the existing unit tests are still valid and rank_delay>1 unit test is also added.

@prckent
Copy link
Contributor

prckent commented Nov 27, 2018

Comments focus on making DelayedUpdate more accessible to our future selves:

  • Doxygen comments needed for all functions in DelayedUpdate.h
  • Describe all variables in the state machine in DelayedUpdate.h
  • Describe how the state machine works in DelayedUpdate.h (in the description of one of the functions would be fine)
  • Is there a tested reason for the large functions with inline hints in DelayedUpdate.h? The moment these are non-trivial in size or have function calls such as BLAS, they should probably be in a .cpp. We do this a lot across the application.
  • Add comments about the logic in DelayedUpdate.h. e.g. What does delay_count=0 mean/do?
  • Please can you comment the unit test. How are delays checked? (The obvious thing to do here is update one set of matrices rank 1 and another with a delay, and check the ratios, gradients etc.)
  • set should be called something more descriptive. e.g. resetFirstParticleAndDelay
  • This looks to be direct translation of Tyler's algorithm+implementation (great). Any sneaky extra tricks for efficiency not in that paper should be highlighted.

More general Q:

  • Do we ever resize a determinant in the entire application? (Not relevant for this PR, but it is extra complexity that might not be used.)

@ye-luo
Copy link
Contributor Author

ye-luo commented Nov 27, 2018

Hopefully I put sufficient documentation.

  1. inline is only a hint to the compiler, it is the compiler's decision to inline a function or not. inline a function with some BLAS call in it doesn't mean inline the BLAS functions but only put the line of BLAS call in the caller routine.
  2. DelayedUpdate is a template function. Putting some of it in .cpp is messy. All the functions here are quite light weight.
  3. What is the set() function you refer to? The one in the DiracDeterminant class? That is a virtual function and I'm expecting to change the class spaghetti in a separate PR.
  4. The math may look familiar to you but this is NOT Tyler's algorithm+implementation at all. I started with the Fahy's variant and solve all the drawback in Tyler's algorithm+implementation. There is no look-ahead needed which is impractical to use in DMC. There is no factorization of any matrix to compute the determinant. There is no fake acceptance needed. My algorithm has zero lines of code dealing with rejecting move. You will see more detail in the upcoming paper. This is first time having the full algorithm working for VMC/DMC/RMC. Although I also start originally from SMW identity like Tyler did, 'direct translation' is a wrong quote.

@prckent
Copy link
Contributor

prckent commented Nov 27, 2018

Thanks Ye. Yes set() is part of a larger mess.

I think short-diamondC_2x1x1_pp-delayed_update-vmc_sdj-1-16* needs test labels added. They report as "quality unknown". Happily this version already runs a smidge faster than the non-delayed version, at least on oxygen.

@prckent
Copy link
Contributor

prckent commented Nov 27, 2018

What are the rules around calling completeUpdates ?

@ye-luo
Copy link
Contributor Author

ye-luo commented Nov 27, 2018

  1. We should not reference performance on small tests unless it is 2x or more. Once the PR is merged, the performance tests will reflect the performance gain.
  2. tags are not controlled by me and I think we will visit later when categorizing more tests.
  3. I added comment in WaveFunctionComponent.h about completeUpdates. It must be called after each substep or step during pbyp move. For each step, we will evaluate hamiltonian and the inverse matrix must be updated. The algorithm allows moving electron in any order but doesn't allow delay the same electron twice. For this reason, it must be called also after each substep.
  4. To the previous question: we never resize a determinant in the entire application.

@ye-luo
Copy link
Contributor Author

ye-luo commented Nov 28, 2018

Can we merge this? The change in this PR is quite local and clean.

@prckent
Copy link
Contributor

prckent commented Nov 28, 2018

Will discuss with @PDoakORNL this afternoon. Comment to follow.

@PDoakORNL PDoakORNL merged commit 90b9f83 into QMCPACK:develop Nov 28, 2018
@ghost ghost removed the in progress label Nov 28, 2018
@ye-luo ye-luo deleted the delayed-update-merge branch November 29, 2018 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants