-
Notifications
You must be signed in to change notification settings - Fork 102
The Kokkos Lectures: Module 4 Q&A
Daniel Arndt edited this page Aug 12, 2020
·
2 revisions
- It's not allocating memory or initializing so it's rather cheap.
Is Kokkos:fence() cudaDeviceSynchronize() or cudaStreamSynchronize() if the execution space is Cuda?
- Kokkos::fence() fences all spaces, so it's cudaDeviceSynchronize(). On the other hand, you can create instances of execution_ paces and call fence() on these. That corresponds to cudaStreamSynchronize(). We will cover streams in module 5.
- For the moment finding an optimal value requires trial and error but we have autotuning for this coming in release 3.3.
- There is currently no special value but you can use tile sizes equal to your iteration bounds if you don’t want any tiling.
- Yes, in general, it is.
- No, it doesn’t even have the same semantics as dynamic parallelism, parallel_for on TeamThreadRange is not a fork/join operation.
Are there multidimensional versions of TeamPolicy and TeamThreadRange? (e.g. if I have 5 nested loops and want a 2d-grid of teams doing 3-dimensional loops)
- No, not yet but it's not too hard to do manually. For 5 nested loops, you need to tell us which level of hierarchy corresponds to a shared scratch domain (TeamThreadRange) and which level corresponds to a vector-like instruction (ThreadVectorRange).
Would it be performance-portable to manually convert 1D-indices to 2D/3D in a row/column-major fashion? (i=idx/n, j=idx%n)
- Yes, for Teams that usually works well.
What is the overhead of 3-level parallelism with respect to the GPU thread register pressure? The teamMember argument needs some registers, doesn't it?
- It' pretty low. Much of the team member argument can be constant propagated and doesn't many registers.
Is the "everybody gets it" what makes things different between single and just having member 0 do something?
- No, for single just having member zero do it is pretty similar. We are allowed to do certain special things like having the first thread that gets there do it instead.
If the outer loop is parallel_for, do you still need to protect the last step (now, just writing the reduction result, rather than adding it)?
- The reason you're protecting the last step is so that all of the threads don’t do it redundantly. Any time that a redundant thing would happen on each thread and lead to the wrong result is when you need single. On the other hand, redundant writes are often faster than using a single because you avoid a branch instruction and the hardware can often eliminate redundant writes if they're coalesced.
- It’s just memory so you cast it any way you want.
- It's in flat HBM mode.
- We've looked at cooperative groups but they don’t help with the locking problem, though the deadlock issue isn’t a problem on pretty much any GPU that supports cooperative groups anyway.