Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort particles by cell #619

Merged
merged 17 commits into from
Jan 17, 2022
Merged

Sort particles by cell #619

merged 17 commits into from
Jan 17, 2022

Conversation

brryan
Copy link
Collaborator

@brryan brryan commented Dec 7, 2021

PR Summary

We often want to be able to quickly loop through all the particles in a particular cell. By default, particles live in per-meshblock lists and these will not in general be ordered by zone. This PR introduces a method SwarmContainer::SortParticlesByCell that uses a 1D sort operation to populate a map from cell index i,j,k to a list of per-meshblock indices of all particles in that cell.

The sort routine is provided separately and uses separate implementations for CPU and GPU. For the GPU, it relies on thrust (Kokkos provides a sort routine but from some open issues on that repo it appears to be slow, especially on the GPU).

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • CI has been triggered on Darwin for performance regression tests.
  • (@lanl.gov employees) Update copyright on changed files
  • Add API for accessing this sorted list from outside Swarm, and demonstrate usage in the particles example.

@brryan brryan added the enhancement New feature or request label Dec 7, 2021
@brryan brryan self-assigned this Dec 7, 2021
src/interface/swarm.cpp Show resolved Hide resolved
#ifdef KOKKOS_ENABLE_CUDA
thrust::device_ptr<Key> first_d = thrust::device_pointer_cast(data.data()) + min_idx;
thrust::device_ptr<Key> last_d = thrust::device_pointer_cast(data.data()) + max_idx + 1;
thrust::sort(first_d, last_d, comparator);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know we don't use thrust anywhere else in parthenon, but my understanding is this should always be available when CUDA is available. Still, calling this out explicitly in case anyone has concerns.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens on Intel and AMD GPUs? The std::sort probably won't work without additional magic, does it?
What's the issue with Kokkos sort? You mentioned performance but my quick search in the Kokkos wiki didn't bring up any helpful information.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @brryan was scared off by this issue:
kokkos/kokkos#2914

For intel and AMD GPUs I suggest we fall back to Kokkos::sort. But I like using thrust::sort and std::sort on appropriate hardware for now, until Kokkos::sort is more performant.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about asking the Kokkos team about the current status and doing some (limited) performance tests covering the intended (immediate) use case ourselves?
My main motivation is to keep the amount of different code paths as low as possible (especially given that the other architecture will become more and more available next year).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I just replied to Jonah, it sounds like the Kokkos team is looking to rewrite their sorting methods, which should both allow for key-value pair sorting like I'm using here and hopefully improve performance. OK if I leave this PR as-is and open an issue to plan to move to Kokkos::sort exclusively when it is ready?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear this PR currently will not work for Intel or AMD GPUs as is, but is that a broad requirement now?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the discussion on the Kokkos::sort status, I'm fine with keeping thrust for Cuda for now.
From a functional point of view it'd still be great to call Kokkos::sort() if neither std::sort nor thrust is availabe, or throw a meaningful error message (at compile time).
AMD GPUs may be there sooner than later and I'm currently fixing a couple of things to make those work in #629 as AMD GPUs will become very relevant for us this year.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, yes, I should include an error message (done). This might still cause a compiler warning unless the if(std::is_same<... is a constexpr if, but I assume it will be if compiling in C++17 mode?

Searching real quick, it seems to me that AMD GPUs support their Bolt library, which has its own sort function? However, I can't immediately tell whether that is guaranteed to be included in an AMD compilation environment like thrust is with CUDA. However, if we need a performant sort on AMD in the near term, that is one possibility

@brryan brryan requested review from pgrete and Yurlungur December 7, 2021 18:26
@brryan brryan changed the title WIP: Sort particles by cell Sort particles by cell Dec 7, 2021
Copy link
Collaborator

@pgrete pgrete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brryan thanks for this PR. The overall functionality will definitely be useful downstream.
I left a couple of (high level) comments with respect to the approach(es) chosen.

example/particles/particles.cpp Outdated Show resolved Hide resolved
example/particles/particles.cpp Show resolved Hide resolved
auto deposit_particles =
tl.AddTask(destroy_some_particles, DepositParticles, pmb.get());
auto sort_particles =
tl.AddTask(destroy_some_particles, SortParticlesIfNecessary, pmb.get());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What defines "is necessary" here? I'm not sure I follow the intention of this naming (in the context of the example).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, should've responded before, this sort is necessary if we're using the pre-sorted per-cell lists for the non-atomic branch of the subsequent particle deposition call (which assumes those sorted lists exist). SortParticlesIfNecessaryForDeposition?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to SortParticlesIfUsingPerCellDeposition

SwarmKey(const int cell_idx_1d, const int swarm_idx_1d)
: cell_idx_1d_(cell_idx_1d), swarm_idx_(swarm_idx_1d) {}

int cell_idx_1d_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking long run here, I suggest to add either a safety count that nxnynz < max int, or increase to int64_t.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this will probably be a problem eventually, I should've been using int64_t from the beginning 😬

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this particular one may always be fine since at most this should be nx*ny*nz of a meshblock times the number of meshblocks on a processor. We'd need about 1200^3 cells on a single compute processor to cause an issue here. I'll still add the debug check now just for future-proofing.

I'm actually adding this check where cell_idx_1d is evaluated, line 520 of src/interface/swarm.cpp, since it will already arrive here as an integer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving the PARTHENON_DEBUG_REQUIRE that currently lives inside a parallel region to somewhere outside (e.g., the constructor) and make it a check that is also triggered without debug mode?
My reasoning is that the MeshBlock size won't change for a simulation, so we need the check only once, and that it's rather unlikely to run a problem at scale (where the issue will show first) in debug mode.

PS: I don't think that 1k^3 is that far off. We can already fit multi-stage hydrodynamics at 512^3 on a single A100 with 40G.
So if I take hydro out and just do pure n-body (i.e., reducing the number of cell centered variables needed) and take the next gen GPU with >=128G memory, we may get there sooner than later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about lack of debug mode at scale -- I just moved the PARTHENON_REQUIRE up out of the KOKKOS_LAMBDA but in the same function so it's clear where the problem is -- I don't think that should hurt performance (and the constructor itself doesn't have a meshblock pointer so the check would have to be in SetBlockPointer and then might not be propagated correctly if that part of the code is refactored).

PS The future is arriving sooner than I expected :)

src/interface/swarm_container.cpp Outdated Show resolved Hide resolved
#ifdef KOKKOS_ENABLE_CUDA
thrust::device_ptr<Key> first_d = thrust::device_pointer_cast(data.data()) + min_idx;
thrust::device_ptr<Key> last_d = thrust::device_pointer_cast(data.data()) + max_idx + 1;
thrust::sort(first_d, last_d, comparator);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens on Intel and AMD GPUs? The std::sort probably won't work without additional magic, does it?
What's the issue with Kokkos sort? You mentioned performance but my quick search in the Kokkos wiki didn't bring up any helpful information.

src/interface/swarm.cpp Show resolved Hide resolved
Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @brryan I'm pleased how simple this came out. For the most part, I second @pgrete 's suggestions but overall am happy with this as it is now.

example/particles/particles.cpp Outdated Show resolved Hide resolved
src/interface/swarm.cpp Show resolved Hide resolved
src/interface/swarm.cpp Show resolved Hide resolved
src/utils/sort.hpp Show resolved Hide resolved
src/utils/sort.hpp Show resolved Hide resolved
@Yurlungur
Copy link
Collaborator

@brryan what's the status of this?

@brryan
Copy link
Collaborator Author

brryan commented Jan 6, 2022

@Yurlungur Thanks for following up, I responded to the queries above and if you, @pgrete, and the CI are happy I think this is ready

@Yurlungur
Copy link
Collaborator

@Yurlungur Thanks for following up, I responded to the queries above and if you, @pgrete, and the CI are happy I think this is ready

I'm happy. But I'll wait for @pgrete to approve as well.

Copy link
Collaborator

@pgrete pgrete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two minor comments as replies to previous threads and have one additional question that I missed in the first review:
Is the new deposition method tested somewhere,i.e, that it results in the same distribution as the original method?

SwarmKey(const int cell_idx_1d, const int swarm_idx_1d)
: cell_idx_1d_(cell_idx_1d), swarm_idx_(swarm_idx_1d) {}

int cell_idx_1d_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving the PARTHENON_DEBUG_REQUIRE that currently lives inside a parallel region to somewhere outside (e.g., the constructor) and make it a check that is also triggered without debug mode?
My reasoning is that the MeshBlock size won't change for a simulation, so we need the check only once, and that it's rather unlikely to run a problem at scale (where the issue will show first) in debug mode.

PS: I don't think that 1k^3 is that far off. We can already fit multi-stage hydrodynamics at 512^3 on a single A100 with 40G.
So if I take hydro out and just do pure n-body (i.e., reducing the number of cell centered variables needed) and take the next gen GPU with >=128G memory, we may get there sooner than later.

#ifdef KOKKOS_ENABLE_CUDA
thrust::device_ptr<Key> first_d = thrust::device_pointer_cast(data.data()) + min_idx;
thrust::device_ptr<Key> last_d = thrust::device_pointer_cast(data.data()) + max_idx + 1;
thrust::sort(first_d, last_d, comparator);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the discussion on the Kokkos::sort status, I'm fine with keeping thrust for Cuda for now.
From a functional point of view it'd still be great to call Kokkos::sort() if neither std::sort nor thrust is availabe, or throw a meaningful error message (at compile time).
AMD GPUs may be there sooner than later and I'm currently fixing a couple of things to make those work in #629 as AMD GPUs will become very relevant for us this year.

@brryan
Copy link
Collaborator Author

brryan commented Jan 17, 2022

@pgrete How do you feel about approving this? Looking through the comments just now I think I closed the loop on everything, but I didn't mark anything resolved in case I'm mistaken

Copy link
Collaborator

@pgrete pgrete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. Looks good. I will merge now.

@pgrete pgrete merged commit c40e78a into develop Jan 17, 2022
@Yurlungur Yurlungur deleted the brryan/sort_particles branch December 15, 2022 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants