-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel gridlink #239
Parallel gridlink #239
Conversation
I'll note that I think an even faster version would be possible if the threads were doing sequential writes and random reads, instead of random writes and sequential reads. But the sequential write version requires a fast, parallel argsort of the cell indices. I couldn't find one readily available in C, and I think the current version is fast enough that gridlink is unlikely to be a bottleneck. Also, astropy bot appears to be afk...? |
@lgarrison Think we are almost there to a parallel counting sort - that might be a cleaner + faster solution. I will take a look regardless - changing the serial-ness of gridlink would be great. (I have long had goals for creating an OpenMP version of And yup - think astropy-bot is deprecated/non-functional - at least based on the (now archived) |
Thanks! This approach indeed might just be a hack until we have time to plan a better version. Does a counting sort actually work for this problem? My understanding was that counting sort worked by counting integer occurrences, sort of "compressing" the information into a histogram, and therefore losing the original indices. I guess a radix argsort might work since that preserves the original keys in the radix buckets, but I couldn't find a good parallel radix argsort in C. Definitely worth exploring though; we could write one ourselves. |
It seems that astropy bot may technically still be alive, but is in limbo porting between Heroku and GitHub Actions, based on astropy/astropy-bot#122 and astropy/astropy-bot#125 |
@lgarrison Can you check what happens if you change the sort to a heap-sort and enable parallelism through tasks (by editing the |
Just tried it; doesn't seem to sort correctly, even single threaded, unless I put a Single threaded (without any OpenMP), the heap sort runs at about 200 ms for 1e6 particles (not counting the time to compute the indices). Given that the implementation in this PR runs at 35 to 150 ms with 1 thread on 1e6 particles (depending on ppc), I think a sorting-based gridlink will have to use a really, really fast sort. Maybe a radix sort will be fast enough. It would be interesting to test the speed of a radix argsort versus a radix sort that swaps particles as it goes. My guess is that the argsort version will be faster, just because the data movement is smaller during the sort, and the "heavy" movement is saved until the end. |
@lgarrison Took a quick look through the updates - this is a great effort! Think I mostly understand what's going on and left some comments where I was not sure. I can totally see that the overall data movement is reduced and this will be faster for most cases. I am assuming that you have checked that the Ohh and did you check that there are no memory leaks/memory access violations with the |
I've applied the changes from above and added clarifying comments in places. I double-checked Ready for merge? I'll add a changelog entry. |
(sorry, the merge commits got a little messy at the end there, might want to squash and merge) |
@lgarrison There are a bunch of things that require my immediate attention. I am planning to come back to this later in the week and give it one final checkup before merging. I did have a few related ideas:
Thanks again! |
Thanks, no worries!
Not sure! I can give it a try. It may be hard to measure a difference though; that part of the code is really fast.
Yes, I agree, it's a very unusual way to go about parallelization! It was not at all obvious to me that it would work as well as it seems to. I think it's just a hack until we can write a sorting-based method, which would be much more elegant. Since that's the case, I'm inclined not to optimize it further.
It looks like
No problem! Was fun to write. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have not understood the parallel sorting :(. I will try again tomorrow
utils/gridlink_impl.c.src
Outdated
for(int64_t c = 0; c < totncells; c++){ | ||
for(int t = 1; t < max_nthreads; t++){ | ||
cell_occupation[0][c] += cell_occupation[t][c]; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this also a parallel prefix sum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost, it's a reduction into the "thread 0" element for each cell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh right, of course! Instead of writing into the memory address directly, how about this (hopefully) equivalent formulation
for(int64_t c = 0; c < totncells; c++) {
int64_t tsum = 0;
#if defined(_OPENMP)
#pragma omp simd for reduction(+:tsum)
#endif
for(int t = 0; t < max_nthreads; t++){
tsum += cell_occupation[t][c];
}
#if defined(_OPENMP)
#pragma omp single
cell_occupation[0][c] = tsum;
#endif
Very obviously untested and potentially buggy. Might need correct setting for nested parallelism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think much of the runtime is going here, so I'm inclined to leave the simpler version. Especially since we hope all this will be nuked by the sorting version!
utils/utils.c
Outdated
int64_t min_N_per_thread = 10000; // heuristic to avoid overheads | ||
if(N/min_N_per_thread < nthreads){ | ||
nthreads = N/min_N_per_thread; | ||
} | ||
if(nthreads < 1){ | ||
nthreads = 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you are changing the number of threads used for the computation, upper-limited by max_threads
. But max_threads
could be quite large (say 32) while requested nthreads
could be 4. Is it possible for the code to use a larger number of threads than requested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OMP_NUM_THREADS
and omp_set_num_threads()
control omp_get_max_threads()
, so this should be safe.
int64_t offset = 0; | ||
for(int t = 0; t < tid; t++){ | ||
offset += cumsum[N*(t+1)/nthreads-1]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be annotated with an #pragma omp simd for reduction(+:offset)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this is probably an immeasurable amount of time :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, seems hard to gain much here. I bet that would be safe with the reduction clause, but I also kind of doubt it would do a SIMD transform, given that the memory accesses are non-contiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgarrison Alright I have taken a final look and all seems to be fine. Left a couple of comments about calloc/malloc function calls and parameters etc.
The one thing that I am not certain about is in the parallel_cumsum function - will the code still work if N is smaller than the number of threads?
And will you please run the tests to check for invalid memory usage with the following: make distclean
export CFLAGS =-fsanitize=leak -fsanitize=undefined -fsanitize=bounds -fsanitize=address -fsanitize-address-use-after-scope -fsanitize-undefined-trap-on-error -fstack-protector-all
export CLINK =-fsanitize=leak -fsanitize=undefined -fsanitize=bounds -fsanitize=address -fsanitize-address-use-after-scope -fsanitize-undefined-trap-on-error -fstack-protector-all
make tests If you are feeling particularly cautious then feel free to add |
Just tested; the core theory and mock tests look clean under the sanitizers. The Python bindings are proving harder to test with sanitizers. They complain about
They all look like Python-internal things; I don't see any traces related to Corrfunc. I tested the master branch, and it shows the same errors. So I think the errors are unrelated to this PR. |
Ohh I have never ran the memcheck on the python API. That's not great that there are memory leaks (but obviously not relevant to this PR) @lgarrison All good to squash-merge the PR? |
Yes, ready! |
Re: astropy-bot -- Hello! Please open issues at https://github.com/astropy/astropy-bot for how the bot isn't working for you. I do notice that it is flaky in |
@pllim Astropy bot did eventually kick in; I see the checks as passing on this PR. I don't see the usual comment from the bot in this thread, but maybe that's because it happened to be down when we opened the PR. We'll keep an eye out on the next PR! |
@lgarrison The gridlink-parallel branch can now be deleted right? |
Yep! I'll go ahead and do that. |
Hey @manodeep, here's a stab at a parallel implementation of gridlink! It was motivated by a recent example that popped up in my research where the actual pair counting was very fast (due to low sample density and low rmax), but gridlink took about the same amount of time as the counting, due to the large Nmesh, and thus large number of mallocs and single-threaded particle writes. This version does one malloc for the whole grid (in detail, one malloc for each X/Y/Z/W column of particle data), calculates cell assignments in parallel, then uses that information to do parallel writes of the particle data into the appropriate cells.
Because this parallelization requires two reads of the particle data instead of one, I expected that the new version running with one thread might be slower than the old single-threaded code, especially at high particles-per-cell where the malloc overhead is lower. However, that does not appear to be the case in benchmarks. The reads are probably cheap compared to the writes, and it's possible that there are vectorization opportunities, e.g. in the computation of the cell index, that are present in this version that were not present in the old version.
Old:
New:
New, parallel:
These are direct benchmarks of
gridlink_double()
(i.e. calling directly with C code; mean of 10 runs) on a 12 core, 2 socket Intel machine. Nosort_on_z
, since that part is unchanged between the old and new versions. The thread scaling is far from perfect, but it's a nice little speedup. I expect this will be most helpful in latency-sensitive applications like HOD evaluations. The benchmark data is uniform random, but I've spot-checked with clustered data, which looks similar (and may be an even bigger win, because no more reallocs).The tests all pass, and I've checked with and without OpenMP and
copy_particles
. Could you take a look and let me know what you think?