-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplifying fallback kernels #303
Conversation
@lgarrison Didn't realise I hadn't requested a review -- oops! |
Ohh forgot to mention that I ran the INTEGRATION_TESTS for this change and the exhaustive tests passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks fine to me! Did you do any tests to figure out where the speedup is coming from?
npairs[ibin]++; | ||
if(need_rpavg) { | ||
rpavg[ibin]+=rp; | ||
src_npairs[ibin]++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the idea here is that there's no point in making stack buffers, since the passed buffers are already local to the current thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. I am also considering whether it would be worthwhile to move to a malloc'ed buffer rather than the stack (but there may be side-effects of false sharing under OpenMP with such a malloc'ed src_npairs[nthreads][nbins]
kind of matrix)
Didn't actually do a line-by-line comparison timer. Will attempt to do that on my laptop; plus, I will also check that the runtime is not adversely affected on our local linux supercomputer (Skylake and AMD EPYC) |
Timed the tests on the Timed the tests on |
Totally forgot to merge this PR! |
Commenting to add the link to the original #296 that spurred this work |
Reduced the amount of code in the fallback kernels. At least on my M2 laptop, it runs faster - slightly faster (5-10%) for DD-type (i.e. small number-density) and significantly (~20-25%) faster for RR-type calculations