Sort particles by cell #619

brryan · 2021-12-07T18:14:49Z

PR Summary

We often want to be able to quickly loop through all the particles in a particular cell. By default, particles live in per-meshblock lists and these will not in general be ordered by zone. This PR introduces a method SwarmContainer::SortParticlesByCell that uses a 1D sort operation to populate a map from cell index i,j,k to a list of per-meshblock indices of all particles in that cell.

The sort routine is provided separately and uses separate implementations for CPU and GPU. For the GPU, it relies on thrust (Kokkos provides a sort routine but from some open issues on that repo it appears to be slow, especially on the GPU).

PR Checklist

Code passes cpplint
New features are documented.
Adds a test for any bugs fixed. Adds tests for new features.
Code is formatted
Changes are summarized in CHANGELOG.md
CI has been triggered on Darwin for performance regression tests.
(@lanl.gov employees) Update copyright on changed files
Add API for accessing this sorted list from outside Swarm, and demonstrate usage in the particles example.

src/interface/swarm.cpp

brryan · 2021-12-07T18:25:19Z

src/utils/sort.hpp

+#ifdef KOKKOS_ENABLE_CUDA
+  thrust::device_ptr<Key> first_d = thrust::device_pointer_cast(data.data()) + min_idx;
+  thrust::device_ptr<Key> last_d = thrust::device_pointer_cast(data.data()) + max_idx + 1;
+  thrust::sort(first_d, last_d, comparator);


As far as I know we don't use thrust anywhere else in parthenon, but my understanding is this should always be available when CUDA is available. Still, calling this out explicitly in case anyone has concerns.

What happens on Intel and AMD GPUs? The std::sort probably won't work without additional magic, does it?
What's the issue with Kokkos sort? You mentioned performance but my quick search in the Kokkos wiki didn't bring up any helpful information.

I think @brryan was scared off by this issue:
kokkos/kokkos#2914

For intel and AMD GPUs I suggest we fall back to Kokkos::sort. But I like using thrust::sort and std::sort on appropriate hardware for now, until Kokkos::sort is more performant.

How about asking the Kokkos team about the current status and doing some (limited) performance tests covering the intended (immediate) use case ourselves?
My main motivation is to keep the amount of different code paths as low as possible (especially given that the other architecture will become more and more available next year).

As I just replied to Jonah, it sounds like the Kokkos team is looking to rewrite their sorting methods, which should both allow for key-value pair sorting like I'm using here and hopefully improve performance. OK if I leave this PR as-is and open an issue to plan to move to Kokkos::sort exclusively when it is ready?

To be clear this PR currently will not work for Intel or AMD GPUs as is, but is that a broad requirement now?

Following the discussion on the Kokkos::sort status, I'm fine with keeping thrust for Cuda for now.
From a functional point of view it'd still be great to call Kokkos::sort() if neither std::sort nor thrust is availabe, or throw a meaningful error message (at compile time).
AMD GPUs may be there sooner than later and I'm currently fixing a couple of things to make those work in #629 as AMD GPUs will become very relevant for us this year.

Good point, yes, I should include an error message (done). This might still cause a compiler warning unless the if(std::is_same<... is a constexpr if, but I assume it will be if compiling in C++17 mode?

Searching real quick, it seems to me that AMD GPUs support their Bolt library, which has its own sort function? However, I can't immediately tell whether that is guaranteed to be included in an AMD compilation environment like thrust is with CUDA. However, if we need a performant sort on AMD in the near term, that is one possibility

pgrete

@brryan thanks for this PR. The overall functionality will definitely be useful downstream.
I left a couple of (high level) comments with respect to the approach(es) chosen.

example/particles/particles.cpp

pgrete · 2021-12-09T15:02:29Z

example/particles/particles.cpp

-    auto deposit_particles =
-        tl.AddTask(destroy_some_particles, DepositParticles, pmb.get());
+    auto sort_particles =
+        tl.AddTask(destroy_some_particles, SortParticlesIfNecessary, pmb.get());


What defines "is necessary" here? I'm not sure I follow the intention of this naming (in the context of the example).

Sorry, should've responded before, this sort is necessary if we're using the pre-sorted per-cell lists for the non-atomic branch of the subsequent particle deposition call (which assumes those sorted lists exist). SortParticlesIfNecessaryForDeposition?

Changed to SortParticlesIfUsingPerCellDeposition

pgrete · 2021-12-09T15:12:19Z

src/interface/swarm_device_context.hpp

+  SwarmKey(const int cell_idx_1d, const int swarm_idx_1d)
+      : cell_idx_1d_(cell_idx_1d), swarm_idx_(swarm_idx_1d) {}
+
+  int cell_idx_1d_;


Thinking long run here, I suggest to add either a safety count that nxnynz < max int, or increase to int64_t.

Yeah this will probably be a problem eventually, I should've been using int64_t from the beginning 😬

Actually this particular one may always be fine since at most this should be nx*ny*nz of a meshblock times the number of meshblocks on a processor. We'd need about 1200^3 cells on a single compute processor to cause an issue here. I'll still add the debug check now just for future-proofing.

I'm actually adding this check where cell_idx_1d is evaluated, line 520 of src/interface/swarm.cpp, since it will already arrive here as an integer.

How about moving the PARTHENON_DEBUG_REQUIRE that currently lives inside a parallel region to somewhere outside (e.g., the constructor) and make it a check that is also triggered without debug mode?
My reasoning is that the MeshBlock size won't change for a simulation, so we need the check only once, and that it's rather unlikely to run a problem at scale (where the issue will show first) in debug mode.

PS: I don't think that 1k^3 is that far off. We can already fit multi-stage hydrodynamics at 512^3 on a single A100 with 40G.
So if I take hydro out and just do pure n-body (i.e., reducing the number of cell centered variables needed) and take the next gen GPU with >=128G memory, we may get there sooner than later.

Good point about lack of debug mode at scale -- I just moved the PARTHENON_REQUIRE up out of the KOKKOS_LAMBDA but in the same function so it's clear where the problem is -- I don't think that should hurt performance (and the constructor itself doesn't have a meshblock pointer so the check would have to be in SetBlockPointer and then might not be propagated correctly if that part of the code is refactored).

PS The future is arriving sooner than I expected :)

src/interface/swarm_container.cpp

pgrete · 2021-12-09T15:25:12Z

src/utils/sort.hpp

+#ifdef KOKKOS_ENABLE_CUDA
+  thrust::device_ptr<Key> first_d = thrust::device_pointer_cast(data.data()) + min_idx;
+  thrust::device_ptr<Key> last_d = thrust::device_pointer_cast(data.data()) + max_idx + 1;
+  thrust::sort(first_d, last_d, comparator);


What happens on Intel and AMD GPUs? The std::sort probably won't work without additional magic, does it?
What's the issue with Kokkos sort? You mentioned performance but my quick search in the Kokkos wiki didn't bring up any helpful information.

src/interface/swarm.cpp

Yurlungur

Nice work @brryan I'm pleased how simple this came out. For the most part, I second @pgrete 's suggestions but overall am happy with this as it is now.

example/particles/particles.cpp

src/interface/swarm.cpp

src/utils/sort.hpp

Yurlungur · 2021-12-16T18:40:48Z

@brryan what's the status of this?

…icles

brryan · 2022-01-06T19:29:01Z

@Yurlungur Thanks for following up, I responded to the queries above and if you, @pgrete, and the CI are happy I think this is ready

Yurlungur · 2022-01-06T20:38:32Z

@Yurlungur Thanks for following up, I responded to the queries above and if you, @pgrete, and the CI are happy I think this is ready

I'm happy. But I'll wait for @pgrete to approve as well.

pgrete

I left two minor comments as replies to previous threads and have one additional question that I missed in the first review:
Is the new deposition method tested somewhere,i.e, that it results in the same distribution as the original method?

pgrete · 2022-01-07T14:13:24Z

src/interface/swarm_device_context.hpp

+  SwarmKey(const int cell_idx_1d, const int swarm_idx_1d)
+      : cell_idx_1d_(cell_idx_1d), swarm_idx_(swarm_idx_1d) {}
+
+  int cell_idx_1d_;


How about moving the PARTHENON_DEBUG_REQUIRE that currently lives inside a parallel region to somewhere outside (e.g., the constructor) and make it a check that is also triggered without debug mode?
My reasoning is that the MeshBlock size won't change for a simulation, so we need the check only once, and that it's rather unlikely to run a problem at scale (where the issue will show first) in debug mode.

PS: I don't think that 1k^3 is that far off. We can already fit multi-stage hydrodynamics at 512^3 on a single A100 with 40G.
So if I take hydro out and just do pure n-body (i.e., reducing the number of cell centered variables needed) and take the next gen GPU with >=128G memory, we may get there sooner than later.

pgrete · 2022-01-07T14:20:43Z

src/utils/sort.hpp

+#ifdef KOKKOS_ENABLE_CUDA
+  thrust::device_ptr<Key> first_d = thrust::device_pointer_cast(data.data()) + min_idx;
+  thrust::device_ptr<Key> last_d = thrust::device_pointer_cast(data.data()) + max_idx + 1;
+  thrust::sort(first_d, last_d, comparator);


Following the discussion on the Kokkos::sort status, I'm fine with keeping thrust for Cuda for now.
From a functional point of view it'd still be great to call Kokkos::sort() if neither std::sort nor thrust is availabe, or throw a meaningful error message (at compile time).
AMD GPUs may be there sooner than later and I'm currently fixing a couple of things to make those work in #629 as AMD GPUs will become very relevant for us this year.

brryan · 2022-01-17T02:52:12Z

@pgrete How do you feel about approving this? Looking through the comments just now I think I closed the loop on everything, but I didn't mark anything resolved in case I'm mistaken

pgrete

Thanks for the updates. Looks good. I will merge now.

brryan added 6 commits December 6, 2021 13:49

First cut of per-cell particle sorting (no GPUs yet)

4d215cb

Add GPU support

463758e

Get thrust sort working on gpu

d80567f

Added unit test

6862e70

Working on gpu

c5c30eb

Cleanup

1686062

brryan added the enhancement New feature or request label Dec 7, 2021

brryan self-assigned this Dec 7, 2021

brryan added 2 commits December 7, 2021 10:17

Remove leftover code

e3efb90

Format swarm.hpp

a8da667

brryan commented Dec 7, 2021

View reviewed changes

brryan requested review from pgrete and Yurlungur December 7, 2021 18:26

brryan added 5 commits December 7, 2021 10:27

Revert particles/parthinput file

a00288f

style

262fe1b

Bug when entering zeros for extra extents

6f0be83

Cleaning up old debug code

45aa712

documentation

f04cba8

brryan changed the title ~~WIP: Sort particles by cell~~ Sort particles by cell Dec 7, 2021

pgrete reviewed Dec 9, 2021

View reviewed changes

Yurlungur approved these changes Dec 9, 2021

View reviewed changes

example/particles/particles.cpp Outdated Show resolved Hide resolved

src/interface/swarm.cpp Show resolved Hide resolved

src/interface/swarm.cpp Show resolved Hide resolved

src/utils/sort.hpp Show resolved Hide resolved

src/utils/sort.hpp Show resolved Hide resolved

brryan added 2 commits January 6, 2022 12:12

Merge remote-tracking branch 'upstream/develop' into brryan/sort_part…

60b8e4c

…icles

Formatting

c0a2a30

include_what_you_use

a545d62

pgrete reviewed Jan 7, 2022

View reviewed changes

Add error check for large cell counts, non-nvidia gpu in sort

816bbf6

pgrete approved these changes Jan 17, 2022

View reviewed changes

pgrete merged commit c40e78a into develop Jan 17, 2022

Yurlungur deleted the brryan/sort_particles branch December 15, 2022 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort particles by cell #619

Sort particles by cell #619

brryan commented Dec 7, 2021 •

edited

Loading

brryan Dec 7, 2021

pgrete Dec 9, 2021

Yurlungur Dec 9, 2021

pgrete Dec 10, 2021

brryan Jan 6, 2022

brryan Jan 6, 2022

pgrete Jan 7, 2022

brryan Jan 10, 2022

pgrete left a comment

pgrete Dec 9, 2021

brryan Dec 10, 2021

brryan Jan 6, 2022

pgrete Dec 9, 2021

brryan Dec 9, 2021

brryan Jan 6, 2022

pgrete Jan 7, 2022

brryan Jan 10, 2022

pgrete Dec 9, 2021

Yurlungur left a comment

Yurlungur commented Dec 16, 2021

brryan commented Jan 6, 2022

Yurlungur commented Jan 6, 2022

pgrete left a comment

pgrete Jan 7, 2022

pgrete Jan 7, 2022

brryan commented Jan 17, 2022

pgrete left a comment

Sort particles by cell #619

Sort particles by cell #619

Conversation

brryan commented Dec 7, 2021 • edited Loading

PR Summary

PR Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgrete left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yurlungur left a comment

Choose a reason for hiding this comment

Yurlungur commented Dec 16, 2021

brryan commented Jan 6, 2022

Yurlungur commented Jan 6, 2022

pgrete left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brryan commented Jan 17, 2022

pgrete left a comment

Choose a reason for hiding this comment

brryan commented Dec 7, 2021 •

edited

Loading