Skip to content
This repository has been archived by the owner on Mar 12, 2021. It is now read-only.

Create a CUDA context #406

Draft
wants to merge 34 commits into
base: master
Choose a base branch
from
Draft

Create a CUDA context #406

wants to merge 34 commits into from

Conversation

DhairyaLGandhi
Copy link
Member

@DhairyaLGandhi DhairyaLGandhi commented Aug 29, 2019

Thanks to IRTools.jl, we can do some nifty things with Julia IR. Like using a dynamo to walk through the deep IR and offload sensible ops to the GPU.

julia> c = Conv((3,3), 3 => 16, pad = (1,1), relu); # from Flux

julia> r = rand(Float32, 32, 32, 3, 100);

julia> cuda() do
           c(r)
       end # run on GPU

julia> a = rand(Float32, 5*10^4);

julia> b = rand(Float32, 5*10^4);

julia> cuda() do
           a + b
       end
50000-element Array{Float32,1}:
 0.9649581
 1.2122422
 0.423553
...

Notice the return type is a normal Array, meaning that without much fidgeting, it is trivial to offload computation to the GPU and continue where you left off.

There are a couple caveats, not all functions behave nicely yet and we need better test coverage, but opening it now to get some review and direction of the way forward

cc @MikeInnes

ref https://github.com/JuliaGPU/CuArrays.jl/issues/303

@vchuravy
Copy link
Member

Thanks! What is driving the choice to use IRTools over Cassette? I would prefer the maintenance burden to rest with Cassette (e.g. me)

@DhairyaLGandhi
Copy link
Member Author

The choice was made for the little nicer control over the IR with IRTools. Also, it's conceptually simpler so maintaining it should be easier also.

It was also fairly straightforward to define in lesser code making it more readable. Mind you I'm no cassette pro, but definitely worth a discussion.

src/context.jl Outdated
function get_cached(array_bank, arr::Array{T,N})::CuArray{T,N} where {T,N}
haskey(array_bank, arr) ?
array_bank[arr] :
(array_bank[arr] = CuArray(arr))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably write get_cached(cx, x) as cx[x].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on using get!? I suppose the extra network transfer would be a problem in this case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could look at Base.@get!, which avoids evaluating the new value if it's not needed.

src/context.jl Outdated
(array_bank[arr] = CuArray(arr))
end

function (c::CUDACtx)(::typeof(broadcasted), f, args...)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For broadcast I think we should leave the broadcast struct alone (so that CuArrays can't leak into the program), and instead do all conversion and computation in materialize.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I will dig into it

src/context.jl Outdated Show resolved Hide resolved
src/context.jl Outdated Show resolved Hide resolved
src/context.jl Outdated
ir = IR(meta...)
ir == nothing && return

pr = Pipe(ir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could replace this code with IRTools.recurse!(ir) (there's some info in the docs if needed).

src/context.jl Outdated Show resolved Hide resolved
src/context.jl Outdated
@eval (c::CUDACtx)(::typeof($f), args...) = $f(args...)
end

noop_pass.((get_cached, NNlib.check_spdf,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best if these are macros. It'd be nice to add an @cuda macro or similar for the purpose of overloading CUDACtx.

src/context.jl Outdated
noop_pass.((get_cached, NNlib.check_spdf,
))

for f in names(NNlib)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to do this explicitly per function.

src/context.jl Outdated Show resolved Hide resolved
test/context.jl Outdated
@testset "simple ops" begin
W = rand(5, 5)
b = rand(5)
@test cuda(() -> W*b) ≈ W*b
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to check types here as well, e.g. that the output is still an Array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worthwhile to have a way to switch off emptying the context? I'd like to be able to say if Arrays were in fact also allocated on the GPU; and a crude way might be to check the types in the context dict after the fact

@MikeInnes
Copy link
Collaborator

@vchuravy there probably isn't much in it, so if the lead maintainers of this package strongly prefer Cassette then I imagine it'd be OK to port it over.

Though as Dhairya points out there's a couple of potential advantages to fine grained control of the IR pass; the main one is that it's easier to cut out classes of functions we're not interested in, e.g. intrinsics or certain modules in Base, avoiding some redundant recompilation.

@@ -4,7 +4,7 @@ using CUDAapi, CUDAdrv, CUDAnative

using GPUArrays

export CuArray, CuVector, CuMatrix, CuVecOrMat, cu
export CuArray, CuVector, CuMatrix, CuVecOrMat, cu, cuda
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like a pretty generic function to export (both cu and cuda are bound to confuse users). why not something that implies its action, e.g., on_cuda? or @cuda re @async?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about not exporting this for now. In the longer term, if this is successful it should replace cu entirely (alongside all the other APIs, for most users), so a generic name seems appropriate.

I think cuda() do ... reads right, and provides an obvious space for options (cuda(device=2) do ...), but @cuda could work well too (especially in that it's a bit nicer for one liners).

src/context.jl Outdated Show resolved Hide resolved
@maleadt
Copy link
Member

maleadt commented Aug 29, 2019

Very interesting! Looking forward to giving this a spin, might open up some nice new ways of doing GPU computation.

I guess we'll need some way to assert GPU execution to actually test this?

@DhairyaLGandhi
Copy link
Member Author

Yeah, for the tests I was thinking just having a context which we can look into to assert that the array is actually in there and corresponds to memory associated with the GPU

@vchuravy
Copy link
Member

vchuravy commented Sep 2, 2019

Grrml Gmail ate my reply:

Since CUDAnative will use Cassette and GPUifyLoops already does I would strongly prefer only having one tool in the GPU ecosystem to do this. I would be interested in making IRTools transforms/utility functions work with Cassette, which should work relatively straightforwardly.

end
end

@contextual :+ :- :* :/ sum similar materialize
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should set these things up to explicitly call whatever lower-level bindings we have; it should show what it would look like if we got rid of CuArray altogether.

end
end

@noop_pass get_cached NNlib.check_spdf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a noop for get_cached? That shouldn't ever be called in code that we're transforming, right?

using IRTools: meta, Pipe, finish, Variable, self
using MacroTools: @forward

import Base.Broadcast.broadcasted
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These imports are redundant now

test/context.jl Outdated
@testset "simple ops" begin
W = rand(5, 5)
b = rand(5)
@test cuda(() -> W*b) ≈ W*b
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worthwhile to have a way to switch off emptying the context? I'd like to be able to say if Arrays were in fact also allocated on the GPU; and a crude way might be to check the types in the context dict after the fact

src/context.jl Outdated
(array_bank[arr] = CuArray(arr))
end

function (c::CUDACtx)(::typeof(broadcasted), f, args...)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I will dig into it

function cache(cx, x::CuArray{T,N})::Array{T,N} where {T,N}
cpu = Array{T,N}(undef, ntuple(_->0,N))
cx[cpu] = x
return cpu
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cases like BatchNorm, before any compute is hit, the check for dimensions takes the cpu which has 0 shape, which errors. Returning the data also seems wasteful. Thoughts?

This does seem to make things work (including backwards pass on Zygote with Flux models, but its hitting some bad code paths currently).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants