Parallel gridlink #239

lgarrison · 2021-01-04T21:50:19Z

Hey @manodeep, here's a stab at a parallel implementation of gridlink! It was motivated by a recent example that popped up in my research where the actual pair counting was very fast (due to low sample density and low rmax), but gridlink took about the same amount of time as the counting, due to the large Nmesh, and thus large number of mallocs and single-threaded particle writes. This version does one malloc for the whole grid (in detail, one malloc for each X/Y/Z/W column of particle data), calculates cell assignments in parallel, then uses that information to do parallel writes of the particle data into the appropriate cells.

Because this parallelization requires two reads of the particle data instead of one, I expected that the new version running with one thread might be slower than the old single-threaded code, especially at high particles-per-cell where the malloc overhead is lower. However, that does not appear to be the case in benchmarks. The reads are probably cheap compared to the writes, and it's possible that there are vectorization opportunities, e.g. in the computation of the cell index, that are present in this version that were not present in the old version.

Old:

   6.0 ms (Ngrid   10, N     100000, ppc    100,  1 threads)
  54.4 ms (Ngrid   10, N    1000000, ppc   1000,  1 threads)
 556.5 ms (Ngrid   10, N   10000000, ppc  10000,  1 threads)
5166.1 ms (Ngrid   10, N  100000000, ppc 100000,  1 threads)
  68.9 ms (Ngrid   20, N    1000000, ppc    125,  1 threads)
 137.3 ms (Ngrid   40, N    1000000, ppc     15,  1 threads)
 313.0 ms (Ngrid  100, N    1000000, ppc      1,  1 threads)

New:

   4.4 ms (Ngrid   10, N     100000, ppc    100,  1 threads)
  34.5 ms (Ngrid   10, N    1000000, ppc   1000,  1 threads)
 338.9 ms (Ngrid   10, N   10000000, ppc  10000,  1 threads)
3472.7 ms (Ngrid   10, N  100000000, ppc 100000,  1 threads)
  46.6 ms (Ngrid   20, N    1000000, ppc    125,  1 threads)
  75.2 ms (Ngrid   40, N    1000000, ppc     15,  1 threads)
 150.1 ms (Ngrid  100, N    1000000, ppc      1,  1 threads)

New, parallel:

   2.3 ms (Ngrid   10, N     100000, ppc    100, 12 threads)
   8.5 ms (Ngrid   10, N    1000000, ppc   1000, 12 threads)
  70.5 ms (Ngrid   10, N   10000000, ppc  10000, 12 threads)
 721.8 ms (Ngrid   10, N  100000000, ppc 100000, 12 threads)
   9.9 ms (Ngrid   20, N    1000000, ppc    125, 12 threads)
  11.0 ms (Ngrid   40, N    1000000, ppc     15, 12 threads)
  32.8 ms (Ngrid  100, N    1000000, ppc      1, 12 threads)

These are direct benchmarks of gridlink_double() (i.e. calling directly with C code; mean of 10 runs) on a 12 core, 2 socket Intel machine. No sort_on_z, since that part is unchanged between the old and new versions. The thread scaling is far from perfect, but it's a nice little speedup. I expect this will be most helpful in latency-sensitive applications like HOD evaluations. The benchmark data is uniform random, but I've spot-checked with clustered data, which looks similar (and may be an even bigger win, because no more reallocs).

The tests all pass, and I've checked with and without OpenMP and copy_particles. Could you take a look and let me know what you think?

lgarrison · 2021-01-05T03:32:13Z

I'll note that I think an even faster version would be possible if the threads were doing sequential writes and random reads, instead of random writes and sequential reads. But the sequential write version requires a fast, parallel argsort of the cell indices. I couldn't find one readily available in C, and I think the current version is fast enough that gridlink is unlikely to be a bottleneck.

Also, astropy bot appears to be afk...?

manodeep · 2021-01-05T22:00:18Z

@lgarrison Think we are almost there to a parallel counting sort - that might be a cleaner + faster solution. I will take a look regardless - changing the serial-ness of gridlink would be great. (I have long had goals for creating an OpenMP version of sglib.h but the amount of macro gymnastics required to support both OpenMP and non-OpenMP has been too big of a hurdle!)

And yup - think astropy-bot is deprecated/non-functional - at least based on the (now archived) astropy-bot channel on the astropy slack. We might have to setup a different solution through GH actions.

lgarrison · 2021-01-05T22:11:45Z

Thanks! This approach indeed might just be a hack until we have time to plan a better version.

Does a counting sort actually work for this problem? My understanding was that counting sort worked by counting integer occurrences, sort of "compressing" the information into a histogram, and therefore losing the original indices. I guess a radix argsort might work since that preserves the original keys in the radix buckets, but I couldn't find a good parallel radix argsort in C. Definitely worth exploring though; we could write one ourselves.

lgarrison · 2021-01-05T22:19:35Z

It seems that astropy bot may technically still be alive, but is in limbo porting between Heroku and GitHub Actions, based on astropy/astropy-bot#122 and astropy/astropy-bot#125

manodeep · 2021-01-08T20:15:04Z

@lgarrison Can you check what happens if you change the sort to a heap-sort and enable parallelism through tasks (by editing the sglib.h)? Think if you add #pragma omp task here and here, a #pragma omp parallel and a #pragma omp single on the top of the first invocation for the heapsort, and a #pragma omp taskwait here, then it might all work out

lgarrison · 2021-01-09T20:27:55Z

Just tried it; doesn't seem to sort correctly, even single threaded, unless I put a taskwait inside both loops. Looking at the HEAP_DOWN source, I think the different tasks/loop iterations are racing with each other (e.g. the _l_ variable in some iterations will be equal to the _i_ variable in others).

Single threaded (without any OpenMP), the heap sort runs at about 200 ms for 1e6 particles (not counting the time to compute the indices). Given that the implementation in this PR runs at 35 to 150 ms with 1 thread on 1e6 particles (depending on ppc), I think a sorting-based gridlink will have to use a really, really fast sort. Maybe a radix sort will be fast enough.

It would be interesting to test the speed of a radix argsort versus a radix sort that swaps particles as it goes. My guess is that the argsort version will be faster, just because the data movement is smaller during the sort, and the "heavy" movement is saved until the end.

utils/gridlink_impl.c.src

utils/utils.c

manodeep · 2021-01-09T22:33:52Z

@lgarrison Took a quick look through the updates - this is a great effort! Think I mostly understand what's going on and left some comments where I was not sure. I can totally see that the overall data movement is reduced and this will be faster for most cases. I am assuming that you have checked that the parallel_cumsum implementation works for edge-cases of combinations of Nthreads > totncells, Nthreads < totncells, and Nthreads == totncells.

Ohh and did you check that there are no memory leaks/memory access violations with the -fsanitize family of flags

utils/gridlink_impl.c.src

lgarrison · 2021-01-11T22:05:07Z

I've applied the changes from above and added clarifying comments in places. I double-checked parallel_cumsum for all the cases you mentioned, and it looks good (in detail, the function limits nthreads < ncell so that the algorithm never has to worry about empty threads). I checked -fsanitize=address -fsanitize=leak -fsanitize=undefined, and it passes with no errors.

Ready for merge? I'll add a changelog entry.

… gridlink-parallel

lgarrison · 2021-01-11T22:11:18Z

(sorry, the merge commits got a little messy at the end there, might want to squash and merge)

manodeep · 2021-01-13T01:53:29Z

@lgarrison There are a bunch of things that require my immediate attention. I am planning to come back to this later in the week and give it one final checkup before merging. I did have a few related ideas:

Does the #pragma omp simd for work correctly for the loop in parallel_cumsum? I have never used that pragma but seems like it might be useful
That loop over the entire particle list by all threads is bothering me - do you think there is a way to create a thread-local lower and upper bounds for the particles within the thread-specificc range of cell-indices during the first pass over all particles?
Is it feasible to extend the parallel gridlink to DDtheta? (Think all other pair-counters use the vanilla gridlink but I could be wrong)

Thanks again!

lgarrison · 2021-01-13T04:10:07Z

@lgarrison There are a bunch of things that require my immediate attention. I am planning to come back to this later in the week and give it one final checkup before merging. I did have a few related ideas:

Thanks, no worries!

1. Does the `#pragma omp simd for` work correctly for the loop in `parallel_cumsum`? I have never used that pragma but seems like it might be useful

Not sure! I can give it a try. It may be hard to measure a difference though; that part of the code is really fast.

2. That loop over the entire particle list by all threads is bothering me - do you think there is a way to create a thread-local lower and upper bounds for the particles within the thread-specificc range of cell-indices during the first pass over all particles?

Yes, I agree, it's a very unusual way to go about parallelization! It was not at all obvious to me that it would work as well as it seems to. I think it's just a hack until we can write a sorting-based method, which would be much more elegant. Since that's the case, I'm inclined not to optimize it further.

3. Is it feasible to extend the parallel gridlink to `DDtheta`? (Think all other pair-counters use the vanilla gridlink but I could be wrong)

It looks like gridlink_mocks_DOUBLE is used by mocks/DDsmu and mocks/DDrppi, and gridlink_mocks_theta_ra_dec_DOUBLE is used by mocks/DDtheta. I think I am out of time right now to port this over to the other gridlink implementations; we may be happier waiting until we come up with a sorting-based method in any case. I will open an issue to track this.

Thanks again!

No problem! Was fun to write.

manodeep

I still have not understood the parallel sorting :(. I will try again tomorrow

utils/gridlink_impl.c.src

manodeep · 2021-01-21T03:47:18Z

utils/gridlink_impl.c.src

+    for(int64_t c = 0; c < totncells; c++){
+        for(int t = 1; t < max_nthreads; t++){
+            cell_occupation[0][c] += cell_occupation[t][c];
+        }
+    }


Isn't this also a parallel prefix sum?

Almost, it's a reduction into the "thread 0" element for each cell.

Ohh right, of course! Instead of writing into the memory address directly, how about this (hopefully) equivalent formulation

for(int64_t c = 0; c < totncells; c++) { int64_t tsum = 0; #if defined(_OPENMP) #pragma omp simd for reduction(+:tsum) #endif for(int t = 0; t < max_nthreads; t++){ tsum += cell_occupation[t][c]; } #if defined(_OPENMP) #pragma omp single cell_occupation[0][c] = tsum; #endif

Very obviously untested and potentially buggy. Might need correct setting for nested parallelism

I don't think much of the runtime is going here, so I'm inclined to leave the simpler version. Especially since we hope all this will be nuked by the sorting version!

utils/gridlink_impl.c.src

manodeep · 2021-01-21T19:08:42Z

utils/utils.c

+    int64_t min_N_per_thread = 10000;  // heuristic to avoid overheads
+    if(N/min_N_per_thread < nthreads){
+        nthreads = N/min_N_per_thread;
+    }
+    if(nthreads < 1){
+        nthreads = 1;
+    }


Looks like you are changing the number of threads used for the computation, upper-limited by max_threads. But max_threads could be quite large (say 32) while requested nthreads could be 4. Is it possible for the code to use a larger number of threads than requested?

OMP_NUM_THREADS and omp_set_num_threads() control omp_get_max_threads(), so this should be safe.

manodeep · 2021-01-21T19:11:40Z

utils/utils.c

+        int64_t offset = 0;
+        for(int t = 0; t < tid; t++){
+            offset += cumsum[N*(t+1)/nthreads-1];
+        }


Could this be annotated with an #pragma omp simd for reduction(+:offset)?

Although this is probably an immeasurable amount of time :D

Yeah, seems hard to gain much here. I bet that would be safe with the reduction clause, but I also kind of doubt it would do a SIMD transform, given that the memory accesses are non-contiguous.

utils/gridlink_impl.c.src

manodeep

@lgarrison Alright I have taken a final look and all seems to be fine. Left a couple of comments about calloc/malloc function calls and parameters etc.

The one thing that I am not certain about is in the parallel_cumsum function - will the code still work if N is smaller than the number of threads?

utils/utils.c

manodeep · 2021-02-08T00:03:15Z

And will you please run the tests to check for invalid memory usage with the following:

make distclean
export CFLAGS =-fsanitize=leak -fsanitize=undefined -fsanitize=bounds -fsanitize=address -fsanitize-address-use-after-scope -fsanitize-undefined-trap-on-error -fstack-protector-all
export CLINK =-fsanitize=leak -fsanitize=undefined -fsanitize=bounds -fsanitize=address -fsanitize-address-use-after-scope -fsanitize-undefined-trap-on-error -fstack-protector-all
make tests

If you are feeling particularly cautious then feel free to add -DINTEGRATION_TESTS into CFLAGS and CLINK (heads-up: such tests take 2+ days to complete under on a single node on our local supercomputer)

lgarrison · 2021-02-08T18:41:10Z

Just tested; the core theory and mock tests look clean under the sanitizers.

The Python bindings are proving harder to test with sanitizers. They complain about libasan not being loaded first, and when I fix that with LD_PRELOAD, I end up with a bunch of Python-level errors like:

Direct leak of 425383 byte(s) in 292 object(s) allocated from:
    #0 0x155554a2655f in __interceptor_malloc /home/conda/feedstock_root/build_artifacts/ctng-compilers_1601682258120/work/.build/x86_64-conda-linux-gnu/src/gcc/libsanitizer/asan/asan_malloc_linux.cc:144
    #1 0x555555679551 in _PyObject_Malloc /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Objects/obmalloc.c:1628
    #2 0x555555679551 in PyObject_Malloc /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Objects/obmalloc.c:685

They all look like Python-internal things; I don't see any traces related to Corrfunc. I tested the master branch, and it shows the same errors. So I think the errors are unrelated to this PR.

manodeep · 2021-02-10T03:44:24Z

Ohh I have never ran the memcheck on the python API. That's not great that there are memory leaks (but obviously not relevant to this PR)

@lgarrison All good to squash-merge the PR?

lgarrison · 2021-02-10T03:54:05Z

Yes, ready!

pllim · 2021-02-11T16:23:31Z

Re: astropy-bot -- Hello! Please open issues at https://github.com/astropy/astropy-bot for how the bot isn't working for you. I do notice that it is flaky in astropy but it still works most of the time...

lgarrison · 2021-02-11T17:29:17Z

@pllim Astropy bot did eventually kick in; I see the checks as passing on this PR. I don't see the usual comment from the bot in this thread, but maybe that's because it happened to be down when we opened the PR. We'll keep an eye out on the next PR!

manodeep · 2021-02-24T22:18:31Z

@lgarrison The gridlink-parallel branch can now be deleted right?

lgarrison · 2021-02-24T22:20:24Z

Yep! I'll go ahead and do that.

lgarrison added 4 commits January 3, 2021 13:57

Fix Makefile cleaning bug

8c53c7f

Working gridlink update, non-parallel

d586697

Parallel gridlink

18861a3

Bug fixes for copy_particles=False

565288e

lgarrison requested a review from manodeep January 4, 2021 21:50

manodeep reviewed Jan 9, 2021

View reviewed changes

utils/gridlink_impl.c.src Outdated Show resolved Hide resolved

manodeep reviewed Jan 9, 2021

View reviewed changes

utils/gridlink_impl.c.src Show resolved Hide resolved

manodeep reviewed Jan 9, 2021

View reviewed changes

utils/gridlink_impl.c.src Outdated Show resolved Hide resolved

manodeep reviewed Jan 9, 2021

View reviewed changes

utils/gridlink_impl.c.src Outdated Show resolved Hide resolved

manodeep reviewed Jan 9, 2021

View reviewed changes

utils/gridlink_impl.c.src Outdated Show resolved Hide resolved

manodeep reviewed Jan 9, 2021

View reviewed changes

utils/utils.c Outdated Show resolved Hide resolved

manodeep reviewed Jan 9, 2021

View reviewed changes

utils/utils.c Outdated Show resolved Hide resolved

manodeep reviewed Jan 9, 2021

View reviewed changes

utils/utils.c Outdated Show resolved Hide resolved

manodeep reviewed Jan 10, 2021

View reviewed changes

utils/gridlink_impl.c.src Show resolved Hide resolved

Apply changes from Manodeep's code review

4c29ac2

Add changelog entry [ci skip]

f061ce3

lgarrison added this to the v2.4.0 milestone Jan 11, 2021

lgarrison added 3 commits January 11, 2021 17:08

Merge remote-tracking branch 'origin/master' into gridlink-parallel

de85ebe

Merge remote-tracking branch 'origin/master' into gridlink-parallel

8d362a7

Merge branch 'gridlink-parallel' of github.com:manodeep/Corrfunc into…

a9ebace

… gridlink-parallel

Add missing nthreads=1 for non-OpenMP

527732f

lgarrison mentioned this pull request Jan 13, 2021

Better parallelization algorithm for gridlink #240

Open

manodeep reviewed Jan 21, 2021

View reviewed changes

More parallel gridlink code review

9bc0cc6

manodeep reviewed Jan 21, 2021

View reviewed changes

utils/gridlink_impl.c.src Outdated Show resolved Hide resolved

manodeep reviewed Jan 21, 2021

View reviewed changes

utils/gridlink_impl.c.src Outdated Show resolved Hide resolved

manodeep reviewed Jan 21, 2021

View reviewed changes

More parallel gridlink code review

7770361

manodeep reviewed Feb 7, 2021

View reviewed changes

utils/gridlink_impl.c.src Outdated Show resolved Hide resolved

manodeep reviewed Feb 7, 2021

View reviewed changes

utils/gridlink_impl.c.src Outdated Show resolved Hide resolved

manodeep reviewed Feb 7, 2021

View reviewed changes

utils/gridlink_impl.c.src Outdated Show resolved Hide resolved

manodeep reviewed Feb 7, 2021

View reviewed changes

utils/utils.c Show resolved Hide resolved

More parallel-gridlink code review

fc22c79

lgarrison merged commit 95245a8 into master Feb 11, 2021

lgarrison deleted the gridlink-parallel branch February 24, 2021 22:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel gridlink #239

Parallel gridlink #239

lgarrison commented Jan 4, 2021

lgarrison commented Jan 5, 2021

manodeep commented Jan 5, 2021

lgarrison commented Jan 5, 2021

lgarrison commented Jan 5, 2021

manodeep commented Jan 8, 2021

lgarrison commented Jan 9, 2021

manodeep commented Jan 9, 2021

lgarrison commented Jan 11, 2021

lgarrison commented Jan 11, 2021

manodeep commented Jan 13, 2021

lgarrison commented Jan 13, 2021

manodeep left a comment

manodeep Jan 21, 2021

lgarrison Jan 21, 2021

manodeep Jan 21, 2021 •

edited

Loading

lgarrison Jan 21, 2021

manodeep Jan 21, 2021

lgarrison Jan 21, 2021

manodeep Jan 21, 2021

manodeep Jan 21, 2021

lgarrison Jan 21, 2021

manodeep left a comment

manodeep commented Feb 8, 2021

lgarrison commented Feb 8, 2021

manodeep commented Feb 10, 2021

lgarrison commented Feb 10, 2021

pllim commented Feb 11, 2021

lgarrison commented Feb 11, 2021

manodeep commented Feb 24, 2021

lgarrison commented Feb 24, 2021

Parallel gridlink #239

Parallel gridlink #239

Conversation

lgarrison commented Jan 4, 2021

lgarrison commented Jan 5, 2021

manodeep commented Jan 5, 2021

lgarrison commented Jan 5, 2021

lgarrison commented Jan 5, 2021

manodeep commented Jan 8, 2021

lgarrison commented Jan 9, 2021

manodeep commented Jan 9, 2021

lgarrison commented Jan 11, 2021

lgarrison commented Jan 11, 2021

manodeep commented Jan 13, 2021

lgarrison commented Jan 13, 2021

manodeep left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manodeep Jan 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manodeep left a comment

Choose a reason for hiding this comment

manodeep commented Feb 8, 2021

lgarrison commented Feb 8, 2021

manodeep commented Feb 10, 2021

lgarrison commented Feb 10, 2021

pllim commented Feb 11, 2021

lgarrison commented Feb 11, 2021

manodeep commented Feb 24, 2021

lgarrison commented Feb 24, 2021

manodeep Jan 21, 2021 •

edited

Loading