Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate buffer packing functions with less atomics #1199

Merged
merged 5 commits into from
Nov 1, 2024

Conversation

alexrlongne
Copy link
Collaborator

PR Summary

  • Profiling with NSIGHT systems revealed that the LoadBuffer_ function in swarm_comms.cpp took 20% of the GPU time in the particles example that used much more particles (1e6 per block). NSIGHT compute showed that the expensive kernel in LoadBuffer_ had many stalls, we speculated this was due to waiting on the atomic_fetch_add function in LoadBuffer_. This PR reworks LoadBuffer_ and CountParticlesToSend_ to remove atomics and instead sort the particles by buffer ID and use their sorted order to determine their size and index into that buffer.

  • Removing atomics reduces the runtime by about 20% in the million particle example and more in a 10 million particle version. A case where the particles are photons may involves even more cell block crossings as c*delta_t could be large compared to a block size (in the particles example the particle's path length is equal to a block size).

  • The reason for the atomic in both cases was that the loops were over particles where the particle buffer was determined and that was then used to atomically update a buffer specific field.

  • As a rider, this PR also updates some sampling functions in the particle example to use less transcendentals, which has shown improvement in a Parthenon downstream code. I can separate into another PR if desired.

Changes

  • Fold in buffer sizing to the load buffer function
  • Use a sort and a discontinuity check instead of an atomic_fetch_add inside of LoadBuffer_ to get particle index into buffer
  • Reduce transcendental functions call in particle sourcing in particle example

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • Change is breaking (API, behavior, ...)
    • Change is additionally added to CHANGELOG.md in the breaking section
    • PR is marked as breaking
    • Short summary API changes at the top of the PR (plus optionally with an automated update/fix script)
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

+ Fold in buffer sizing to the load buffer function
+ Use a sort and a discontinuity check instead of an
atomic_fetch_add inside of LoadBuffer_ to get particle
index into buffer
+ Reduce transcendental functions call in particle sourcing in
particle example
src/interface/swarm_comms.cpp Outdated Show resolved Hide resolved
src/interface/swarm_comms.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@brryan brryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexrlongne This looks great, I think with just formatting, cleanup of the SwarmKey cell_idx_1d name, and removing the dynamic memory allocs, this is ready to go.

+ Call the member variable in SwarmKey the sort_key
+ Remove CountParticlesInBuffer function
+ Add buffer_start and buffer_sorted as swarm member variables
@alexrlongne alexrlongne force-pushed the along/try_less_atomic branch from f061df6 to d1e3e37 Compare October 29, 2024 17:05
Copy link
Collaborator

@brryan brryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just one more query about r^2 vs r^-2

alexrlongne and others added 2 commits October 29, 2024 14:37
Co-authored-by: Ben Ryan <bryan10@illinois.edu>
Co-authored-by: Ben Ryan <bryan10@illinois.edu>
@brryan brryan enabled auto-merge (squash) November 1, 2024 02:17
@brryan brryan merged commit b5364b7 into develop Nov 1, 2024
50 of 53 checks passed
acreyes pushed a commit to acreyes/parthenon that referenced this pull request Nov 14, 2024
…-lab#1199)

* Consolidate buffer packing functions with less atomics

+ Fold in buffer sizing to the load buffer function
+ Use a sort and a discontinuity check instead of an
atomic_fetch_add inside of LoadBuffer_ to get particle
index into buffer
+ Reduce transcendental functions call in particle sourcing in
particle example

* Address PR comments

+ Call the member variable in SwarmKey the sort_key
+ Remove CountParticlesInBuffer function
+ Add buffer_start and buffer_sorted as swarm member variables

* Update example/particles/particles.cpp

Co-authored-by: Ben Ryan <bryan10@illinois.edu>

* Update src/interface/swarm_comms.cpp

Co-authored-by: Ben Ryan <bryan10@illinois.edu>

---------

Co-authored-by: Ben Ryan <bryan10@illinois.edu>
Co-authored-by: Ben Ryan <brryan@lanl.gov>
pgrete added a commit that referenced this pull request Feb 10, 2025
* meshdata version of refinement tagging

* got initialization and vector indices working

* fix tensor indices

* added CheckRefinementMesh to fine-advection example

* cleaning up var names

* cleanup

* adding scatter view utilities

* scatterview version of refinement

* burgers-benchmark uses Tag<MeshData>

* add hierarchial par

* cleanup includes

* missed one

* Update CHANGELOG.md

* remove default level tag

* default CheckRefineMesh to true

* respect amr_criteria max_level

* renaming delta_levels->amr_tags, mc->md

* fix refinement/bc order

* docs for CheckRefinementMesh

* move amr_tags array to mesh

* adding comments for ScatterMax view

* it compiles at least

* add easy machinery to register reflecting BCs

* changelog

* swarm bcs differetn from mesh bcs

* use new input block

* typo

* add error checking for swarm/mesh BC consistency

* typo

* phdf diff

* Register reflecting BCs for advection examples

* typo

* silly backwards compatibility thing to make it so you don't have to specify swarm BCs if you're not using swarms

* working

* changelog

* Make everything work

* format

* maybe fix doc issue?

* Address CUDA MPI/ICP issue with Kokkos <=4.4.1 (#1189)

* Jonah's fix for this CI issue

* CHANGELOG

* Remove else from if constexpr when there are returns

* Consolidate buffer packing functions with less atomics (#1199)

* Consolidate buffer packing functions with less atomics

+ Fold in buffer sizing to the load buffer function
+ Use a sort and a discontinuity check instead of an
atomic_fetch_add inside of LoadBuffer_ to get particle
index into buffer
+ Reduce transcendental functions call in particle sourcing in
particle example

* Address PR comments

+ Call the member variable in SwarmKey the sort_key
+ Remove CountParticlesInBuffer function
+ Add buffer_start and buffer_sorted as swarm member variables

* Update example/particles/particles.cpp

Co-authored-by: Ben Ryan <bryan10@illinois.edu>

* Update src/interface/swarm_comms.cpp

Co-authored-by: Ben Ryan <bryan10@illinois.edu>

---------

Co-authored-by: Ben Ryan <bryan10@illinois.edu>
Co-authored-by: Ben Ryan <brryan@lanl.gov>

* [Trivial] Fix type used for array init (#1170)

* Fix type used for array init

* init array with constexpr expression

* CC

---------

Co-authored-by: Jonah Miller <jonahm@lanl.gov>

* Leapfrog fix (#1206)

* Missing send size init

* cleanup, CHANGELOG

* verbose CI

* further CI debugging

* This should be working...

* This should be fixed... but I get a segfault on GPU

* Is it my AMD GPU thats wrong?

* Missing a return statement

* retest

* Oops missing statement

* Revert test

* revert workflow

* removing meshblock amr_criteria

* use pack.UpperBound(b) to check allocation

* remove meshblock first/second derivative from cpp

* linting

---------

Co-authored-by: Jonah Miller <jonah.maxwell.miller@gmail.com>
Co-authored-by: Jonah Miller <jonahm@lanl.gov>
Co-authored-by: Luke Roberts <lfroberts@lanl.gov>
Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de>
Co-authored-by: Ben Ryan <brryan@lanl.gov>
Co-authored-by: Adam Dempsey <adempsey@lanl.gov>
Co-authored-by: Alex Long <along@lanl.gov>
Co-authored-by: Ben Ryan <bryan10@illinois.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants