Consolidate buffer packing functions with less atomics #1199

alexrlongne · 2024-10-25T21:31:24Z

PR Summary

Profiling with NSIGHT systems revealed that the LoadBuffer_ function in swarm_comms.cpp took 20% of the GPU time in the particles example that used much more particles (1e6 per block). NSIGHT compute showed that the expensive kernel in LoadBuffer_ had many stalls, we speculated this was due to waiting on the atomic_fetch_add function in LoadBuffer_. This PR reworks LoadBuffer_ and CountParticlesToSend_ to remove atomics and instead sort the particles by buffer ID and use their sorted order to determine their size and index into that buffer.
Removing atomics reduces the runtime by about 20% in the million particle example and more in a 10 million particle version. A case where the particles are photons may involves even more cell block crossings as c*delta_t could be large compared to a block size (in the particles example the particle's path length is equal to a block size).
The reason for the atomic in both cases was that the loops were over particles where the particle buffer was determined and that was then used to atomically update a buffer specific field.
As a rider, this PR also updates some sampling functions in the particle example to use less transcendentals, which has shown improvement in a Parthenon downstream code. I can separate into another PR if desired.

Changes

Fold in buffer sizing to the load buffer function
Use a sort and a discontinuity check instead of an atomic_fetch_add inside of LoadBuffer_ to get particle index into buffer
Reduce transcendental functions call in particle sourcing in particle example

PR Checklist

+ Fold in buffer sizing to the load buffer function + Use a sort and a discontinuity check instead of an atomic_fetch_add inside of LoadBuffer_ to get particle index into buffer + Reduce transcendental functions call in particle sourcing in particle example

src/interface/swarm_comms.cpp

brryan

@alexrlongne This looks great, I think with just formatting, cleanup of the SwarmKey cell_idx_1d name, and removing the dynamic memory allocs, this is ready to go.

+ Call the member variable in SwarmKey the sort_key + Remove CountParticlesInBuffer function + Add buffer_start and buffer_sorted as swarm member variables

example/particles/particles.cpp

src/interface/swarm_comms.cpp

brryan

Looks great! Just one more query about r^2 vs r^-2

Co-authored-by: Ben Ryan <bryan10@illinois.edu>

…-lab#1199) * Consolidate buffer packing functions with less atomics + Fold in buffer sizing to the load buffer function + Use a sort and a discontinuity check instead of an atomic_fetch_add inside of LoadBuffer_ to get particle index into buffer + Reduce transcendental functions call in particle sourcing in particle example * Address PR comments + Call the member variable in SwarmKey the sort_key + Remove CountParticlesInBuffer function + Add buffer_start and buffer_sorted as swarm member variables * Update example/particles/particles.cpp Co-authored-by: Ben Ryan <bryan10@illinois.edu> * Update src/interface/swarm_comms.cpp Co-authored-by: Ben Ryan <bryan10@illinois.edu> --------- Co-authored-by: Ben Ryan <bryan10@illinois.edu> Co-authored-by: Ben Ryan <brryan@lanl.gov>

* meshdata version of refinement tagging * got initialization and vector indices working * fix tensor indices * added CheckRefinementMesh to fine-advection example * cleaning up var names * cleanup * adding scatter view utilities * scatterview version of refinement * burgers-benchmark uses Tag<MeshData> * add hierarchial par * cleanup includes * missed one * Update CHANGELOG.md * remove default level tag * default CheckRefineMesh to true * respect amr_criteria max_level * renaming delta_levels->amr_tags, mc->md * fix refinement/bc order * docs for CheckRefinementMesh * move amr_tags array to mesh * adding comments for ScatterMax view * it compiles at least * add easy machinery to register reflecting BCs * changelog * swarm bcs differetn from mesh bcs * use new input block * typo * add error checking for swarm/mesh BC consistency * typo * phdf diff * Register reflecting BCs for advection examples * typo * silly backwards compatibility thing to make it so you don't have to specify swarm BCs if you're not using swarms * working * changelog * Make everything work * format * maybe fix doc issue? * Address CUDA MPI/ICP issue with Kokkos <=4.4.1 (#1189) * Jonah's fix for this CI issue * CHANGELOG * Remove else from if constexpr when there are returns * Consolidate buffer packing functions with less atomics (#1199) * Consolidate buffer packing functions with less atomics + Fold in buffer sizing to the load buffer function + Use a sort and a discontinuity check instead of an atomic_fetch_add inside of LoadBuffer_ to get particle index into buffer + Reduce transcendental functions call in particle sourcing in particle example * Address PR comments + Call the member variable in SwarmKey the sort_key + Remove CountParticlesInBuffer function + Add buffer_start and buffer_sorted as swarm member variables * Update example/particles/particles.cpp Co-authored-by: Ben Ryan <bryan10@illinois.edu> * Update src/interface/swarm_comms.cpp Co-authored-by: Ben Ryan <bryan10@illinois.edu> --------- Co-authored-by: Ben Ryan <bryan10@illinois.edu> Co-authored-by: Ben Ryan <brryan@lanl.gov> * [Trivial] Fix type used for array init (#1170) * Fix type used for array init * init array with constexpr expression * CC --------- Co-authored-by: Jonah Miller <jonahm@lanl.gov> * Leapfrog fix (#1206) * Missing send size init * cleanup, CHANGELOG * verbose CI * further CI debugging * This should be working... * This should be fixed... but I get a segfault on GPU * Is it my AMD GPU thats wrong? * Missing a return statement * retest * Oops missing statement * Revert test * revert workflow * removing meshblock amr_criteria * use pack.UpperBound(b) to check allocation * remove meshblock first/second derivative from cpp * linting --------- Co-authored-by: Jonah Miller <jonah.maxwell.miller@gmail.com> Co-authored-by: Jonah Miller <jonahm@lanl.gov> Co-authored-by: Luke Roberts <lfroberts@lanl.gov> Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de> Co-authored-by: Ben Ryan <brryan@lanl.gov> Co-authored-by: Adam Dempsey <adempsey@lanl.gov> Co-authored-by: Alex Long <along@lanl.gov> Co-authored-by: Ben Ryan <bryan10@illinois.edu>

alexrlongne commented Oct 25, 2024

View reviewed changes

src/interface/swarm_comms.cpp Outdated Show resolved Hide resolved

src/interface/swarm_comms.cpp Outdated Show resolved Hide resolved

brryan reviewed Oct 28, 2024

View reviewed changes

src/interface/swarm_comms.cpp Outdated Show resolved Hide resolved

brryan reviewed Oct 28, 2024

View reviewed changes

src/interface/swarm_comms.cpp Outdated Show resolved Hide resolved

brryan reviewed Oct 28, 2024

View reviewed changes

src/interface/swarm_comms.cpp Outdated Show resolved Hide resolved

brryan reviewed Oct 28, 2024

View reviewed changes

brryan requested review from pdmullen, Yurlungur and pgrete October 28, 2024 15:18

Address PR comments

d1e3e37

+ Call the member variable in SwarmKey the sort_key + Remove CountParticlesInBuffer function + Add buffer_start and buffer_sorted as swarm member variables

alexrlongne force-pushed the along/try_less_atomic branch from f061df6 to d1e3e37 Compare October 29, 2024 17:05

brryan reviewed Oct 29, 2024

View reviewed changes

example/particles/particles.cpp Outdated Show resolved Hide resolved

brryan reviewed Oct 29, 2024

View reviewed changes

src/interface/swarm_comms.cpp Outdated Show resolved Hide resolved

brryan approved these changes Oct 29, 2024

View reviewed changes

alexrlongne and others added 2 commits October 29, 2024 14:37

Update example/particles/particles.cpp

feafba6

Co-authored-by: Ben Ryan <bryan10@illinois.edu>

Update src/interface/swarm_comms.cpp

f029fbd

Co-authored-by: Ben Ryan <bryan10@illinois.edu>

pdmullen approved these changes Oct 30, 2024

View reviewed changes

Merge branch 'develop' into along/try_less_atomic

69c0a6c

brryan enabled auto-merge (squash) November 1, 2024 02:17

brryan merged commit b5364b7 into develop Nov 1, 2024
50 of 53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate buffer packing functions with less atomics #1199

Consolidate buffer packing functions with less atomics #1199

alexrlongne commented Oct 25, 2024

brryan left a comment

brryan left a comment

Consolidate buffer packing functions with less atomics #1199

Consolidate buffer packing functions with less atomics #1199

Conversation

alexrlongne commented Oct 25, 2024

PR Summary

Changes

PR Checklist

brryan left a comment

Choose a reason for hiding this comment

brryan left a comment

Choose a reason for hiding this comment