-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort particles by cell #619
Conversation
#ifdef KOKKOS_ENABLE_CUDA | ||
thrust::device_ptr<Key> first_d = thrust::device_pointer_cast(data.data()) + min_idx; | ||
thrust::device_ptr<Key> last_d = thrust::device_pointer_cast(data.data()) + max_idx + 1; | ||
thrust::sort(first_d, last_d, comparator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know we don't use thrust
anywhere else in parthenon, but my understanding is this should always be available when CUDA is available. Still, calling this out explicitly in case anyone has concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens on Intel and AMD GPUs? The std::sort
probably won't work without additional magic, does it?
What's the issue with Kokkos sort? You mentioned performance but my quick search in the Kokkos wiki didn't bring up any helpful information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @brryan was scared off by this issue:
kokkos/kokkos#2914
For intel and AMD GPUs I suggest we fall back to Kokkos::sort
. But I like using thrust::sort
and std::sort
on appropriate hardware for now, until Kokkos::sort
is more performant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about asking the Kokkos team about the current status and doing some (limited) performance tests covering the intended (immediate) use case ourselves?
My main motivation is to keep the amount of different code paths as low as possible (especially given that the other architecture will become more and more available next year).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I just replied to Jonah, it sounds like the Kokkos team is looking to rewrite their sorting methods, which should both allow for key-value pair sorting like I'm using here and hopefully improve performance. OK if I leave this PR as-is and open an issue to plan to move to Kokkos::sort
exclusively when it is ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear this PR currently will not work for Intel or AMD GPUs as is, but is that a broad requirement now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the discussion on the Kokkos::sort
status, I'm fine with keeping thrust for Cuda for now.
From a functional point of view it'd still be great to call Kokkos::sort()
if neither std::sort
nor thrust
is availabe, or throw a meaningful error message (at compile time).
AMD GPUs may be there sooner than later and I'm currently fixing a couple of things to make those work in #629 as AMD GPUs will become very relevant for us this year.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, yes, I should include an error message (done). This might still cause a compiler warning unless the if(std::is_same<...
is a constexpr if
, but I assume it will be if compiling in C++17 mode?
Searching real quick, it seems to me that AMD GPUs support their Bolt library, which has its own sort function? However, I can't immediately tell whether that is guaranteed to be included in an AMD compilation environment like thrust is with CUDA. However, if we need a performant sort on AMD in the near term, that is one possibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brryan thanks for this PR. The overall functionality will definitely be useful downstream.
I left a couple of (high level) comments with respect to the approach(es) chosen.
example/particles/particles.cpp
Outdated
auto deposit_particles = | ||
tl.AddTask(destroy_some_particles, DepositParticles, pmb.get()); | ||
auto sort_particles = | ||
tl.AddTask(destroy_some_particles, SortParticlesIfNecessary, pmb.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What defines "is necessary" here? I'm not sure I follow the intention of this naming (in the context of the example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, should've responded before, this sort is necessary if we're using the pre-sorted per-cell lists for the non-atomic branch of the subsequent particle deposition call (which assumes those sorted lists exist). SortParticlesIfNecessaryForDeposition
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to SortParticlesIfUsingPerCellDeposition
SwarmKey(const int cell_idx_1d, const int swarm_idx_1d) | ||
: cell_idx_1d_(cell_idx_1d), swarm_idx_(swarm_idx_1d) {} | ||
|
||
int cell_idx_1d_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking long run here, I suggest to add either a safety count that nxnynz < max int, or increase to int64_t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this will probably be a problem eventually, I should've been using int64_t
from the beginning 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this particular one may always be fine since at most this should be nx*ny*nz
of a meshblock times the number of meshblocks on a processor. We'd need about 1200^3 cells on a single compute processor to cause an issue here. I'll still add the debug check now just for future-proofing.
I'm actually adding this check where cell_idx_1d
is evaluated, line 520 of src/interface/swarm.cpp
, since it will already arrive here as an integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving the PARTHENON_DEBUG_REQUIRE
that currently lives inside a parallel region to somewhere outside (e.g., the constructor) and make it a check that is also triggered without debug mode?
My reasoning is that the MeshBlock size won't change for a simulation, so we need the check only once, and that it's rather unlikely to run a problem at scale (where the issue will show first) in debug mode.
PS: I don't think that 1k^3 is that far off. We can already fit multi-stage hydrodynamics at 512^3 on a single A100 with 40G.
So if I take hydro out and just do pure n-body (i.e., reducing the number of cell centered variables needed) and take the next gen GPU with >=128G memory, we may get there sooner than later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about lack of debug mode at scale -- I just moved the PARTHENON_REQUIRE
up out of the KOKKOS_LAMBDA
but in the same function so it's clear where the problem is -- I don't think that should hurt performance (and the constructor itself doesn't have a meshblock pointer so the check would have to be in SetBlockPointer
and then might not be propagated correctly if that part of the code is refactored).
PS The future is arriving sooner than I expected :)
#ifdef KOKKOS_ENABLE_CUDA | ||
thrust::device_ptr<Key> first_d = thrust::device_pointer_cast(data.data()) + min_idx; | ||
thrust::device_ptr<Key> last_d = thrust::device_pointer_cast(data.data()) + max_idx + 1; | ||
thrust::sort(first_d, last_d, comparator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens on Intel and AMD GPUs? The std::sort
probably won't work without additional magic, does it?
What's the issue with Kokkos sort? You mentioned performance but my quick search in the Kokkos wiki didn't bring up any helpful information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brryan what's the status of this? |
@Yurlungur Thanks for following up, I responded to the queries above and if you, @pgrete, and the CI are happy I think this is ready |
I'm happy. But I'll wait for @pgrete to approve as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left two minor comments as replies to previous threads and have one additional question that I missed in the first review:
Is the new deposition method tested somewhere,i.e, that it results in the same distribution as the original method?
SwarmKey(const int cell_idx_1d, const int swarm_idx_1d) | ||
: cell_idx_1d_(cell_idx_1d), swarm_idx_(swarm_idx_1d) {} | ||
|
||
int cell_idx_1d_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving the PARTHENON_DEBUG_REQUIRE
that currently lives inside a parallel region to somewhere outside (e.g., the constructor) and make it a check that is also triggered without debug mode?
My reasoning is that the MeshBlock size won't change for a simulation, so we need the check only once, and that it's rather unlikely to run a problem at scale (where the issue will show first) in debug mode.
PS: I don't think that 1k^3 is that far off. We can already fit multi-stage hydrodynamics at 512^3 on a single A100 with 40G.
So if I take hydro out and just do pure n-body (i.e., reducing the number of cell centered variables needed) and take the next gen GPU with >=128G memory, we may get there sooner than later.
#ifdef KOKKOS_ENABLE_CUDA | ||
thrust::device_ptr<Key> first_d = thrust::device_pointer_cast(data.data()) + min_idx; | ||
thrust::device_ptr<Key> last_d = thrust::device_pointer_cast(data.data()) + max_idx + 1; | ||
thrust::sort(first_d, last_d, comparator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the discussion on the Kokkos::sort
status, I'm fine with keeping thrust for Cuda for now.
From a functional point of view it'd still be great to call Kokkos::sort()
if neither std::sort
nor thrust
is availabe, or throw a meaningful error message (at compile time).
AMD GPUs may be there sooner than later and I'm currently fixing a couple of things to make those work in #629 as AMD GPUs will become very relevant for us this year.
@pgrete How do you feel about approving this? Looking through the comments just now I think I closed the loop on everything, but I didn't mark anything resolved in case I'm mistaken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. Looks good. I will merge now.
PR Summary
We often want to be able to quickly loop through all the particles in a particular cell. By default, particles live in per-meshblock lists and these will not in general be ordered by zone. This PR introduces a method
SwarmContainer::SortParticlesByCell
that uses a 1D sort operation to populate a map from cell indexi,j,k
to a list of per-meshblock indices of all particles in that cell.The
sort
routine is provided separately and uses separate implementations for CPU and GPU. For the GPU, it relies onthrust
(Kokkos provides asort
routine but from some open issues on that repo it appears to be slow, especially on the GPU).PR Checklist
Swarm
, and demonstrate usage in theparticles
example.