-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
groupreduction and subgroupreduction #421
Conversation
removed config include removed using GPUArrays versions
Thank you! As you found the notion of a subgroup is not necessarily consistent across backends. So I am wondering if we need to expose that as feature or if backends could use them to implement reduction. So as an example could we "just" add What is Do we need a broadcast/shuffle operation? I don't have the answers to those questions. I think adding |
I will fix this, meanwhile we can think about the wraps. If we will introduce it and in what way. |
Yeah requiring a neutral value sounds good. Could not each backend choose to use warp level ops, iff available? Instead of having the user make this decision. |
The problem with that approach is that warp intrinsics don’t work for every type. We then would need something like this:
In this case you don’t override the function. I am also not sure if this kind of type specialisation works where it takes the first function when it is one of the types. You probs have more expertise on this. would it be prefered that every thread returns the right rval? I do think in that case I Will have to rethink the warp reduction bases groupreduce. |
No it just needs to be documented :) and we might want to add a So it looks like CUDA.jl makes this decision based on the element type? |
src/KernelAbstractions.jl
Outdated
""" | ||
@subgroupsize() | ||
|
||
returns the GPUs subgroupsize. | ||
""" | ||
macro subgroupsize() | ||
quote | ||
$__subgroupsize() | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""" | |
@subgroupsize() | |
returns the GPUs subgroupsize. | |
""" | |
macro subgroupsize() | |
quote | |
$__subgroupsize() | |
end | |
end |
src/reduce.jl
Outdated
idx_in_group = @index(Local) | ||
groupsize = @groupsize()[1] | ||
|
||
localmem = @localmem(T, groupsize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see if we can do a subgroupreduce the memory we need here is much reduced.
src/reduce.jl
Outdated
- `op`: the operator of the reduction | ||
- `val`: value that each thread contibutes to the values that need to be reduced | ||
- `netral`: value of the operator, so that `op(netural, neutral) = neutral`` | ||
- `use_subgroups`: make use of the subgroupreduction of the groupreduction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the value of having a common implementation.
So I would define:
@reduce(op, val, neutral)
__reduce(op, val, neutral)
And then maybe:
__subgroupreduce & __subgroupsize (No @ version)
__can_subgroup_reduce(T) = false
And then we could define:
__reduce(__ctx___, op, val, neutral, ::Val{true})
__reduce(__ctx___, op, val, neutral, ::Val{false})
as you have here.
function __reduce(op, val::T, neutral::T) where T
__reduce(op, val, neutral, Val(__can_subgroup_reduce(T)))
end
Yeah so CUDA(and i think most GPU backends make the decision based on type). I was thinking if something like this would be possible: For the implementation without warps:
And every back-end like CUDAKernels just implements something like this:
It would allow for 1 common implementation and a few specializations that are done by the backends themselves. But, |
I recommed |
Only the group reduction without subgroups remains. It is ready to merge, once merged I will add support for Subgroups to the CUDAkernels, MetalKernels, and OneAPI. |
Okay so this is API neutral? Pure addition and we don't need anything new? |
# perform the reduction | ||
d = 1 | ||
while d < groupsize | ||
@synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Workaround?
I am unsure why my previous PR closed but here are the changes.
It was my first time writing tests, and they passed. How are these tested on GPUs, and do I need more tests?