-
-
Notifications
You must be signed in to change notification settings - Fork 83
Create a CUDA context #406
base: master
Are you sure you want to change the base?
Conversation
Thanks! What is driving the choice to use IRTools over Cassette? I would prefer the maintenance burden to rest with Cassette (e.g. me) |
The choice was made for the little nicer control over the IR with IRTools. Also, it's conceptually simpler so maintaining it should be easier also. It was also fairly straightforward to define in lesser code making it more readable. Mind you I'm no cassette pro, but definitely worth a discussion. |
src/context.jl
Outdated
function get_cached(array_bank, arr::Array{T,N})::CuArray{T,N} where {T,N} | ||
haskey(array_bank, arr) ? | ||
array_bank[arr] : | ||
(array_bank[arr] = CuArray(arr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably write get_cached(cx, x)
as cx[x]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on using get!
? I suppose the extra network transfer would be a problem in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could look at Base.@get!
, which avoids evaluating the new value if it's not needed.
src/context.jl
Outdated
(array_bank[arr] = CuArray(arr)) | ||
end | ||
|
||
function (c::CUDACtx)(::typeof(broadcasted), f, args...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For broadcast
I think we should leave the broadcast struct alone (so that CuArrays can't leak into the program), and instead do all conversion and computation in materialize
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I will dig into it
src/context.jl
Outdated
ir = IR(meta...) | ||
ir == nothing && return | ||
|
||
pr = Pipe(ir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could replace this code with IRTools.recurse!(ir)
(there's some info in the docs if needed).
src/context.jl
Outdated
@eval (c::CUDACtx)(::typeof($f), args...) = $f(args...) | ||
end | ||
|
||
noop_pass.((get_cached, NNlib.check_spdf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably best if these are macros. It'd be nice to add an @cuda
macro or similar for the purpose of overloading CUDACtx
.
src/context.jl
Outdated
noop_pass.((get_cached, NNlib.check_spdf, | ||
)) | ||
|
||
for f in names(NNlib) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to do this explicitly per function.
test/context.jl
Outdated
@testset "simple ops" begin | ||
W = rand(5, 5) | ||
b = rand(5) | ||
@test cuda(() -> W*b) ≈ W*b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to check types here as well, e.g. that the output is still an Array
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worthwhile to have a way to switch off emptying the context? I'd like to be able to say if Arrays were in fact also allocated on the GPU; and a crude way might be to check the types in the context dict after the fact
@vchuravy there probably isn't much in it, so if the lead maintainers of this package strongly prefer Cassette then I imagine it'd be OK to port it over. Though as Dhairya points out there's a couple of potential advantages to fine grained control of the IR pass; the main one is that it's easier to cut out classes of functions we're not interested in, e.g. intrinsics or certain modules in Base, avoiding some redundant recompilation. |
@@ -4,7 +4,7 @@ using CUDAapi, CUDAdrv, CUDAnative | |||
|
|||
using GPUArrays | |||
|
|||
export CuArray, CuVector, CuMatrix, CuVecOrMat, cu | |||
export CuArray, CuVector, CuMatrix, CuVecOrMat, cu, cuda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like a pretty generic function to export (both cu
and cuda
are bound to confuse users). why not something that implies its action, e.g., on_cuda
? or @cuda
re @async
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about not exporting this for now. In the longer term, if this is successful it should replace cu
entirely (alongside all the other APIs, for most users), so a generic name seems appropriate.
I think cuda() do ...
reads right, and provides an obvious space for options (cuda(device=2) do ...
), but @cuda
could work well too (especially in that it's a bit nicer for one liners).
Very interesting! Looking forward to giving this a spin, might open up some nice new ways of doing GPU computation. I guess we'll need some way to assert GPU execution to actually test this? |
Yeah, for the tests I was thinking just having a context which we can look into to assert that the array is actually in there and corresponds to memory associated with the GPU |
Grrml Gmail ate my reply: Since CUDAnative will use Cassette and GPUifyLoops already does I would strongly prefer only having one tool in the GPU ecosystem to do this. I would be interested in making IRTools transforms/utility functions work with Cassette, which should work relatively straightforwardly. |
end | ||
end | ||
|
||
@contextual :+ :- :* :/ sum similar materialize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should set these things up to explicitly call whatever lower-level bindings we have; it should show what it would look like if we got rid of CuArray
altogether.
end | ||
end | ||
|
||
@noop_pass get_cached NNlib.check_spdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a noop for get_cached
? That shouldn't ever be called in code that we're transforming, right?
using IRTools: meta, Pipe, finish, Variable, self | ||
using MacroTools: @forward | ||
|
||
import Base.Broadcast.broadcasted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These imports are redundant now
test/context.jl
Outdated
@testset "simple ops" begin | ||
W = rand(5, 5) | ||
b = rand(5) | ||
@test cuda(() -> W*b) ≈ W*b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worthwhile to have a way to switch off emptying the context? I'd like to be able to say if Arrays were in fact also allocated on the GPU; and a crude way might be to check the types in the context dict after the fact
src/context.jl
Outdated
(array_bank[arr] = CuArray(arr)) | ||
end | ||
|
||
function (c::CUDACtx)(::typeof(broadcasted), f, args...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I will dig into it
function cache(cx, x::CuArray{T,N})::Array{T,N} where {T,N} | ||
cpu = Array{T,N}(undef, ntuple(_->0,N)) | ||
cx[cpu] = x | ||
return cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cases like BatchNorm
, before any compute is hit, the check for dimensions takes the cpu
which has 0
shape, which errors. Returning the data also seems wasteful. Thoughts?
This does seem to make things work (including backwards pass on Zygote with Flux models, but its hitting some bad code paths currently).
fc487fd
to
fced436
Compare
Thanks to
IRTools.jl
, we can do some nifty things with Julia IR. Like using adynamo
to walk through the deep IR and offload sensible ops to the GPU.Notice the return type is a normal
Array
, meaning that without much fidgeting, it is trivial to offload computation to the GPU and continue where you left off.There are a couple caveats, not all functions behave nicely yet and we need better test coverage, but opening it now to get some review and direction of the way forward
cc @MikeInnes
ref https://github.com/JuliaGPU/CuArrays.jl/issues/303