-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement independent_groups and independent_group_elements for Alpaka kernels #43624
Conversation
- `independent_groups(acc, groups)` returns a range than spans the group indices from 0 to `groups`, with one group per block; - `independent_group_elements(acc, elements)` returns a range that spans all the elements within the given group, from 0 to `elements`.
please test |
cms-bot internal usage |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43624/38271
|
A new Pull Request was created by @fwyzard (Andrea Bocci) for master. It involves the following packages:
@makortel, @fwyzard can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-81e25c/36618/summary.html Comparison SummarySummary:
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43624/38273
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43624/38276
|
please test |
+heterogeneous |
This pull request is fully signed and it will be integrated in one of the next master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @sextonkennedy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2) |
enable gpu |
please test |
@antoniovilela @rappoccio, if @makortel does not have any comments, can you pick this for pre2 ? |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-81e25c/36634/summary.html Comparison SummarySummary:
GPU Comparison SummarySummary:
|
+1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few remarks that can be addressed later.
I'd also suggest to provide a concise recipe-like documentation, e.g. in HeterogeneousCore/AlpakaInterface/README.md
, that would list the (most common) use cases, and what loop structure to use in each case (with link(s) to more details, e.g. the comments in this header).
* If the work division has more than 63 blocks, the first 63 will perform one iteration of the loop, while the other | ||
* blocks will exit immediately. | ||
* If the work division has less than 63 blocks, some of the blocks will perform more than one iteration, in order to | ||
* cover then whole problem space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further above had
All threads in a block see the same loop iterations
I think it would be good to clarify what this means for the behavior of the elements of the last group (given how important this is for syncBlockThreads()
).
In this example I'd guess the threads 0-7 of the block executing the group 62 would run the loop body, and the threads 8-15 would not (i.e. in a way the threads 8-15 do not see the same loop iterations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's as the comment says: all threads (0-15) will execute the loop body.
It's up to the inner loop (or the user, if they don't use an inner loop) to run only on a subset of the threads.
In fact,
for (auto group: uniform_groups(acc, 1000) { ... }
should be exactly the same as
for (auto group: independent_groups(acc, round_up_by(1000, alpaka::getWorkDiv<alpaka::Grid, alpaka::Elems>)) { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, this loop is about the groups/blocks, and not about elements, thanks.
So if one would do
for (auto group : uniform_groups(acc, 1000) {
for (auto element : uniform_group_elements(acc, group, 1000) {
...
}
}
it's the inner loop (uniform_group_elements
) where one needs to be careful with alpaka::syncBlockThreads()
, right?
And there it would happen that the threads 0-7 of the block executing group 62 would run the inner loop, and there threads 8-15 would not, right?
If this is the case, how about adding a warning in the comment of uniform_group_elements
(perhaps also for elements_in_block
) that the alpaka::syncBlockThreads()
is not generally safe to be called within that loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's the inner loop (
uniform_group_elements
) where one needs to be careful withalpaka::syncBlockThreads()
, right?
Yes.
In fact, it does not make sense to call alpaka::syncBlockThreads()
within the inner loop.
One should always split the inner loop in multiple loops, and synchronise between them:
for (auto group : uniform_groups(acc, 1000) {
for (auto element : uniform_group_elements(acc, group, 1000) {
// no synchronisations here
...
}
alpaka::syncBlockThreads();
for (auto element : uniform_group_elements(acc, group, 1000) {
// no synchronisations here
...
}
alpaka::syncBlockThreads();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And there it would happen that the threads 0-7 of the block executing group 62 would run the inner loop, and there threads 8-15 would not, right?
All threads would execute the for (auto element : uniform_group_elements(acc, group, 1000)
loop.
However, the number of iterations of the inner loop will be different for the different threads: 0-7 will execute 1 iteration, 8-15 will execute 0 iterations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the case, how about adding a warning in the comment of
uniform_group_elements
(perhaps also forelements_in_block
) that thealpaka::syncBlockThreads()
is not generally safe to be called within that loop?
Sure, see #43632 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the number of iterations of the inner loop will be different for the different threads: 0-7 will execute 1 iteration, 8-15 will execute 0 iterations.
Thanks, this is what I was after, but written more accurately.
stride_{alpaka::getWorkDiv<alpaka::Grid, alpaka::Blocks>(acc)[0u]}, | ||
extent_{groups} {} | ||
|
||
class iterator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle this class could be named as const_iterator
as in practice it provides only const
access. Such name would avoid wondering of why begin() const
returns a non-const iterator
, before realizing the iterator
itself gives only const access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind the type name, since I don't expect anybody will ever use the nested iterator
types explicitly.
For the same reason I don't mind renaming them - and I assume the same comment applies to the other range-loop types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #43632 .
* | ||
* Iterating over the range yields the local element index, between `0` and `elements - 1`. The threads in the block | ||
* will perform one or more iterations, depending on the number of elements per thread, and on the number of threads | ||
* per block, ocmpared with the total number of elements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* per block, ocmpared with the total number of elements. | |
* per block, compared with the total number of elements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #43632 .
stride_{alpaka::getWorkDiv<alpaka::Block, alpaka::Threads>(acc)[0u] * elements_}, | ||
extent_{extent} {} | ||
|
||
class iterator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here I'd find const_iterator
clearer name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed in #43632 .
PR description:
Implement
independent_groups(acc, groups)
andindependent_group_elements(acc, elements)
for Alpaka kernels:independent_groups(acc, groups)
returns a range than spans the group indices from 0 togroups
, with one group per block;independent_group_elements(acc, elements)
returns a range that spans all the elements within the given group, from 0 toelements
.Add
uniform_groups
anduniform_group_elements
type aliases.Extend
elements_with_stride
with an optional starting value.Add a test for
independent_groups
,independent_group_elements
andelements_with_stride
with a starting value.PR validation:
The new integration test passes on CPU and on GPU.