Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel gridlink #239

Merged
merged 13 commits into from
Feb 11, 2021
Merged

Parallel gridlink #239

merged 13 commits into from
Feb 11, 2021

Conversation

lgarrison
Copy link
Collaborator

Hey @manodeep, here's a stab at a parallel implementation of gridlink! It was motivated by a recent example that popped up in my research where the actual pair counting was very fast (due to low sample density and low rmax), but gridlink took about the same amount of time as the counting, due to the large Nmesh, and thus large number of mallocs and single-threaded particle writes. This version does one malloc for the whole grid (in detail, one malloc for each X/Y/Z/W column of particle data), calculates cell assignments in parallel, then uses that information to do parallel writes of the particle data into the appropriate cells.

Because this parallelization requires two reads of the particle data instead of one, I expected that the new version running with one thread might be slower than the old single-threaded code, especially at high particles-per-cell where the malloc overhead is lower. However, that does not appear to be the case in benchmarks. The reads are probably cheap compared to the writes, and it's possible that there are vectorization opportunities, e.g. in the computation of the cell index, that are present in this version that were not present in the old version.

Old:

   6.0 ms (Ngrid   10, N     100000, ppc    100,  1 threads)
  54.4 ms (Ngrid   10, N    1000000, ppc   1000,  1 threads)
 556.5 ms (Ngrid   10, N   10000000, ppc  10000,  1 threads)
5166.1 ms (Ngrid   10, N  100000000, ppc 100000,  1 threads)
  68.9 ms (Ngrid   20, N    1000000, ppc    125,  1 threads)
 137.3 ms (Ngrid   40, N    1000000, ppc     15,  1 threads)
 313.0 ms (Ngrid  100, N    1000000, ppc      1,  1 threads)

New:

   4.4 ms (Ngrid   10, N     100000, ppc    100,  1 threads)
  34.5 ms (Ngrid   10, N    1000000, ppc   1000,  1 threads)
 338.9 ms (Ngrid   10, N   10000000, ppc  10000,  1 threads)
3472.7 ms (Ngrid   10, N  100000000, ppc 100000,  1 threads)
  46.6 ms (Ngrid   20, N    1000000, ppc    125,  1 threads)
  75.2 ms (Ngrid   40, N    1000000, ppc     15,  1 threads)
 150.1 ms (Ngrid  100, N    1000000, ppc      1,  1 threads)

New, parallel:

   2.3 ms (Ngrid   10, N     100000, ppc    100, 12 threads)
   8.5 ms (Ngrid   10, N    1000000, ppc   1000, 12 threads)
  70.5 ms (Ngrid   10, N   10000000, ppc  10000, 12 threads)
 721.8 ms (Ngrid   10, N  100000000, ppc 100000, 12 threads)
   9.9 ms (Ngrid   20, N    1000000, ppc    125, 12 threads)
  11.0 ms (Ngrid   40, N    1000000, ppc     15, 12 threads)
  32.8 ms (Ngrid  100, N    1000000, ppc      1, 12 threads)

These are direct benchmarks of gridlink_double() (i.e. calling directly with C code; mean of 10 runs) on a 12 core, 2 socket Intel machine. No sort_on_z, since that part is unchanged between the old and new versions. The thread scaling is far from perfect, but it's a nice little speedup. I expect this will be most helpful in latency-sensitive applications like HOD evaluations. The benchmark data is uniform random, but I've spot-checked with clustered data, which looks similar (and may be an even bigger win, because no more reallocs).

The tests all pass, and I've checked with and without OpenMP and copy_particles. Could you take a look and let me know what you think?

@lgarrison
Copy link
Collaborator Author

I'll note that I think an even faster version would be possible if the threads were doing sequential writes and random reads, instead of random writes and sequential reads. But the sequential write version requires a fast, parallel argsort of the cell indices. I couldn't find one readily available in C, and I think the current version is fast enough that gridlink is unlikely to be a bottleneck.

Also, astropy bot appears to be afk...?

@manodeep
Copy link
Owner

manodeep commented Jan 5, 2021

@lgarrison Think we are almost there to a parallel counting sort - that might be a cleaner + faster solution. I will take a look regardless - changing the serial-ness of gridlink would be great. (I have long had goals for creating an OpenMP version of sglib.h but the amount of macro gymnastics required to support both OpenMP and non-OpenMP has been too big of a hurdle!)

And yup - think astropy-bot is deprecated/non-functional - at least based on the (now archived) astropy-bot channel on the astropy slack. We might have to setup a different solution through GH actions.

@lgarrison
Copy link
Collaborator Author

Thanks! This approach indeed might just be a hack until we have time to plan a better version.

Does a counting sort actually work for this problem? My understanding was that counting sort worked by counting integer occurrences, sort of "compressing" the information into a histogram, and therefore losing the original indices. I guess a radix argsort might work since that preserves the original keys in the radix buckets, but I couldn't find a good parallel radix argsort in C. Definitely worth exploring though; we could write one ourselves.

@lgarrison
Copy link
Collaborator Author

It seems that astropy bot may technically still be alive, but is in limbo porting between Heroku and GitHub Actions, based on astropy/astropy-bot#122 and astropy/astropy-bot#125

@manodeep
Copy link
Owner

manodeep commented Jan 8, 2021

@lgarrison Can you check what happens if you change the sort to a heap-sort and enable parallelism through tasks (by editing the sglib.h)? Think if you add #pragma omp task here and here, a #pragma omp parallel and a #pragma omp single on the top of the first invocation for the heapsort, and a #pragma omp taskwait here, then it might all work out

@lgarrison
Copy link
Collaborator Author

Just tried it; doesn't seem to sort correctly, even single threaded, unless I put a taskwait inside both loops. Looking at the HEAP_DOWN source, I think the different tasks/loop iterations are racing with each other (e.g. the _l_ variable in some iterations will be equal to the _i_ variable in others).

Single threaded (without any OpenMP), the heap sort runs at about 200 ms for 1e6 particles (not counting the time to compute the indices). Given that the implementation in this PR runs at 35 to 150 ms with 1 thread on 1e6 particles (depending on ppc), I think a sorting-based gridlink will have to use a really, really fast sort. Maybe a radix sort will be fast enough.

It would be interesting to test the speed of a radix argsort versus a radix sort that swaps particles as it goes. My guess is that the argsort version will be faster, just because the data movement is smaller during the sort, and the "heavy" movement is saved until the end.

utils/gridlink_impl.c.src Outdated Show resolved Hide resolved
utils/gridlink_impl.c.src Outdated Show resolved Hide resolved
utils/gridlink_impl.c.src Outdated Show resolved Hide resolved
utils/gridlink_impl.c.src Outdated Show resolved Hide resolved
utils/utils.c Outdated Show resolved Hide resolved
utils/utils.c Outdated Show resolved Hide resolved
utils/utils.c Outdated Show resolved Hide resolved
@manodeep
Copy link
Owner

manodeep commented Jan 9, 2021

@lgarrison Took a quick look through the updates - this is a great effort! Think I mostly understand what's going on and left some comments where I was not sure. I can totally see that the overall data movement is reduced and this will be faster for most cases. I am assuming that you have checked that the parallel_cumsum implementation works for edge-cases of combinations of Nthreads > totncells, Nthreads < totncells, and Nthreads == totncells.

Ohh and did you check that there are no memory leaks/memory access violations with the -fsanitize family of flags

@lgarrison
Copy link
Collaborator Author

I've applied the changes from above and added clarifying comments in places. I double-checked parallel_cumsum for all the cases you mentioned, and it looks good (in detail, the function limits nthreads < ncell so that the algorithm never has to worry about empty threads). I checked -fsanitize=address -fsanitize=leak -fsanitize=undefined, and it passes with no errors.

Ready for merge? I'll add a changelog entry.

@lgarrison lgarrison added this to the v2.4.0 milestone Jan 11, 2021
@lgarrison
Copy link
Collaborator Author

(sorry, the merge commits got a little messy at the end there, might want to squash and merge)

@manodeep
Copy link
Owner

@lgarrison There are a bunch of things that require my immediate attention. I am planning to come back to this later in the week and give it one final checkup before merging. I did have a few related ideas:

  1. Does the #pragma omp simd for work correctly for the loop in parallel_cumsum? I have never used that pragma but seems like it might be useful
  2. That loop over the entire particle list by all threads is bothering me - do you think there is a way to create a thread-local lower and upper bounds for the particles within the thread-specificc range of cell-indices during the first pass over all particles?
  3. Is it feasible to extend the parallel gridlink to DDtheta? (Think all other pair-counters use the vanilla gridlink but I could be wrong)

Thanks again!

@lgarrison
Copy link
Collaborator Author

@lgarrison There are a bunch of things that require my immediate attention. I am planning to come back to this later in the week and give it one final checkup before merging. I did have a few related ideas:

Thanks, no worries!

1. Does the `#pragma omp simd for` work correctly for the loop in `parallel_cumsum`? I have never used that pragma but seems like it might be useful

Not sure! I can give it a try. It may be hard to measure a difference though; that part of the code is really fast.

2. That loop over the entire particle list by all threads is bothering me - do you think there is a way to create a thread-local lower and upper bounds for the particles within the thread-specificc range of cell-indices during the first pass over all particles?

Yes, I agree, it's a very unusual way to go about parallelization! It was not at all obvious to me that it would work as well as it seems to. I think it's just a hack until we can write a sorting-based method, which would be much more elegant. Since that's the case, I'm inclined not to optimize it further.

3. Is it feasible to extend the parallel gridlink to `DDtheta`? (Think all other pair-counters use the vanilla gridlink but I could be wrong)

It looks like gridlink_mocks_DOUBLE is used by mocks/DDsmu and mocks/DDrppi, and gridlink_mocks_theta_ra_dec_DOUBLE is used by mocks/DDtheta. I think I am out of time right now to port this over to the other gridlink implementations; we may be happier waiting until we come up with a sorting-based method in any case. I will open an issue to track this.

Thanks again!

No problem! Was fun to write.

Copy link
Owner

@manodeep manodeep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have not understood the parallel sorting :(. I will try again tomorrow

utils/gridlink_impl.c.src Show resolved Hide resolved
utils/gridlink_impl.c.src Outdated Show resolved Hide resolved
utils/gridlink_impl.c.src Outdated Show resolved Hide resolved
utils/gridlink_impl.c.src Outdated Show resolved Hide resolved
Comment on lines 188 to 192
for(int64_t c = 0; c < totncells; c++){
for(int t = 1; t < max_nthreads; t++){
cell_occupation[0][c] += cell_occupation[t][c];
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this also a parallel prefix sum?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost, it's a reduction into the "thread 0" element for each cell.

Copy link
Owner

@manodeep manodeep Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh right, of course! Instead of writing into the memory address directly, how about this (hopefully) equivalent formulation

    for(int64_t c = 0; c < totncells; c++) {
         int64_t tsum = 0;
 #if defined(_OPENMP) 
#pragma omp simd for reduction(+:tsum)
#endif
        for(int t = 0; t < max_nthreads; t++){
            tsum += cell_occupation[t][c];
        }
#if  defined(_OPENMP)
#pragma omp single 
        cell_occupation[0][c] = tsum;
#endif  
      

Very obviously untested and potentially buggy. Might need correct setting for nested parallelism

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think much of the runtime is going here, so I'm inclined to leave the simpler version. Especially since we hope all this will be nuked by the sorting version!

utils/gridlink_impl.c.src Show resolved Hide resolved
utils/gridlink_impl.c.src Outdated Show resolved Hide resolved
utils/gridlink_impl.c.src Outdated Show resolved Hide resolved
utils/utils.c Outdated
Comment on lines 868 to 874
int64_t min_N_per_thread = 10000; // heuristic to avoid overheads
if(N/min_N_per_thread < nthreads){
nthreads = N/min_N_per_thread;
}
if(nthreads < 1){
nthreads = 1;
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you are changing the number of threads used for the computation, upper-limited by max_threads. But max_threads could be quite large (say 32) while requested nthreads could be 4. Is it possible for the code to use a larger number of threads than requested?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMP_NUM_THREADS and omp_set_num_threads() control omp_get_max_threads(), so this should be safe.

Comment on lines +897 to +900
int64_t offset = 0;
for(int t = 0; t < tid; t++){
offset += cumsum[N*(t+1)/nthreads-1];
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be annotated with an #pragma omp simd for reduction(+:offset)?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this is probably an immeasurable amount of time :D

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, seems hard to gain much here. I bet that would be safe with the reduction clause, but I also kind of doubt it would do a SIMD transform, given that the memory accesses are non-contiguous.

utils/gridlink_impl.c.src Outdated Show resolved Hide resolved
utils/gridlink_impl.c.src Outdated Show resolved Hide resolved
utils/gridlink_impl.c.src Outdated Show resolved Hide resolved
Copy link
Owner

@manodeep manodeep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgarrison Alright I have taken a final look and all seems to be fine. Left a couple of comments about calloc/malloc function calls and parameters etc.

The one thing that I am not certain about is in the parallel_cumsum function - will the code still work if N is smaller than the number of threads?

utils/utils.c Show resolved Hide resolved
@manodeep
Copy link
Owner

manodeep commented Feb 8, 2021

And will you please run the tests to check for invalid memory usage with the following:

make distclean
export CFLAGS =-fsanitize=leak -fsanitize=undefined -fsanitize=bounds -fsanitize=address -fsanitize-address-use-after-scope -fsanitize-undefined-trap-on-error -fstack-protector-all
export CLINK =-fsanitize=leak -fsanitize=undefined -fsanitize=bounds -fsanitize=address -fsanitize-address-use-after-scope -fsanitize-undefined-trap-on-error -fstack-protector-all
make tests

If you are feeling particularly cautious then feel free to add -DINTEGRATION_TESTS into CFLAGS and CLINK (heads-up: such tests take 2+ days to complete under on a single node on our local supercomputer)

@lgarrison
Copy link
Collaborator Author

Just tested; the core theory and mock tests look clean under the sanitizers.

The Python bindings are proving harder to test with sanitizers. They complain about libasan not being loaded first, and when I fix that with LD_PRELOAD, I end up with a bunch of Python-level errors like:

Direct leak of 425383 byte(s) in 292 object(s) allocated from:
    #0 0x155554a2655f in __interceptor_malloc /home/conda/feedstock_root/build_artifacts/ctng-compilers_1601682258120/work/.build/x86_64-conda-linux-gnu/src/gcc/libsanitizer/asan/asan_malloc_linux.cc:144
    #1 0x555555679551 in _PyObject_Malloc /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Objects/obmalloc.c:1628
    #2 0x555555679551 in PyObject_Malloc /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Objects/obmalloc.c:685

They all look like Python-internal things; I don't see any traces related to Corrfunc. I tested the master branch, and it shows the same errors. So I think the errors are unrelated to this PR.

@manodeep
Copy link
Owner

Ohh I have never ran the memcheck on the python API. That's not great that there are memory leaks (but obviously not relevant to this PR)

@lgarrison All good to squash-merge the PR?

@lgarrison
Copy link
Collaborator Author

Yes, ready!

@pllim
Copy link

pllim commented Feb 11, 2021

Re: astropy-bot -- Hello! Please open issues at https://github.com/astropy/astropy-bot for how the bot isn't working for you. I do notice that it is flaky in astropy but it still works most of the time...

@lgarrison
Copy link
Collaborator Author

@pllim Astropy bot did eventually kick in; I see the checks as passing on this PR. I don't see the usual comment from the bot in this thread, but maybe that's because it happened to be down when we opened the PR. We'll keep an eye out on the next PR!

@lgarrison lgarrison merged commit 95245a8 into master Feb 11, 2021
@manodeep
Copy link
Owner

@lgarrison The gridlink-parallel branch can now be deleted right?

@lgarrison
Copy link
Collaborator Author

Yep! I'll go ahead and do that.

@lgarrison lgarrison deleted the gridlink-parallel branch February 24, 2021 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants