Merge #467

467: implement CuIterator for batching arrays to the GPU r=jrevels a=jrevels I'm hitting what I believe is likely a common use case for CuArrays. My workload is essentially a map-reduce of the form: ```julia some_reduction(f(batch::Tuple{Vararg{Array}})::Number for batch in batches) ``` Assuming `f` plays nicely with CuArrays, one way to get this running on a GPU is: ```julia cubatches = (map(x -> adapt(CuArray, x), batch) for batch in batches) some_reduction(f(cubatch) for cubatch in cubatches) ``` Unfortunately, this approach is a poor one when `batches` doesn't entirely fit in GPU memory. As the caller, I can assert that the old iterations' batches don't need to be kept around, so ideally, I'd have a mechanism to leverage to simply reuse old iterations' memory instead of allocating more. This PR implements such a mechanism: a `CuIterator` that maintains a memory pool and exploits the assumption that previous iterations' memory can be reused. This is just a rough POC sketch right now just to express what I'm trying to get at...it's probably dumb in ways I can't yet comprehend. AFAICT I'd like to do the following before merging: - [x] instead of the current shape/eltype-matching approach, just allocate to a device buffer (~`CUDAdrv.UnifiedBuffer`? I have no idea what I'm doing 😅 EDIT: oh is this how CUDAdrv exposes UVM or is that separate?~), increasing the buffer size as necessary once larger batches are encountered - [x] ~leverage `CUDAdrv.prefetch` to asynchronously move values to GPU memory so as to not artificially block the iterator's consumer~ EDIT: I'm not sure this is necessary anymore assuming the `copyto!` call we're using is asynchronous? - [x] make sure to free the pool once iteration is finished - [x] docs - [x] tests In a future PR, we could add a feature for `CuIterator` to utilize UVM if the caller is in an environment where that's supported. cc @vchuravy (thanks for discussing this with me earlier!) Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
JuliaGPU · Mar 12, 2020 · ae239d7 · ae239d7
2 parents 15f6b54 + 259ed9f
commit ae239d7
Show file tree

Hide file tree

Showing 4 changed files with 46 additions and 1 deletion.
diff --git a/src/CuArrays.jl b/src/CuArrays.jl
@@ -4,7 +4,7 @@ using CUDAapi, CUDAdrv, CUDAnative
 
 using GPUArrays
 
-export CuArray, CuVector, CuMatrix, CuVecOrMat, cu
+export CuArray, CuVector, CuMatrix, CuVecOrMat, CuIterator, cu
 export CUBLAS, CUSPARSE, CUSOLVER, CUFFT, CURAND, CUDNN, CUTENSOR
 
 import LinearAlgebra
@@ -93,6 +93,7 @@ include("mapreduce.jl")
 include("accumulate.jl")
 include("linalg.jl")
 include("nnlib.jl")
+include("iterator.jl")
 
 include("deprecated.jl")
 

diff --git a/src/iterator.jl b/src/iterator.jl
@@ -0,0 +1,27 @@
+"""
+    CuIterator(batches)
+
+Return a `CuIterator` that can iterate through the provided `batches` via `Base.iterate`.
+
+Upon each iteration, the current `batch` is adapted to the GPU (via `map(x -> adapt(CuArray, x), batch)`)
+and the previous iteration is marked as freeable from GPU memory (via `unsafe_free!`).
+
+This abstraction is useful for batching data into GPU memory in a manner that
+allows old iterations to potentially be freed (or marked as reusable) earlier
+than they otherwise would via CuArray's internal polling mechanism.
+"""
+mutable struct CuIterator{B}
+    batches::B
+    previous::Any
+    CuIterator(batches) = new{typeof(batches)}(batches)
+end
+
+function Base.iterate(c::CuIterator, state...)
+    item = iterate(c.batches, state...)
+    isdefined(c, :previous) && foreach(unsafe_free!, c.previous)
+    item === nothing && return nothing
+    batch, next_state = item
+    cubatch = map(x -> adapt(CuArray, x), batch)
+    c.previous = cubatch
+    return cubatch, next_state
+end
diff --git a/test/iterator.jl b/test/iterator.jl
@@ -0,0 +1,16 @@
+@testset "CuIterator" begin
+    batch_count = 10
+    max_batch_items = 3
+    max_ndims = 3
+    sizes = 20:50
+    rand_shape = () -> rand(sizes, rand(1:max_ndims))
+    batches = [[rand(Float32, rand_shape()...) for _ in 1:rand(1:max_batch_items)] for _ in 1:batch_count]
+    cubatches = CuIterator(batch for batch in batches) # ensure generators are accepted
+    previous_cubatch = missing
+    for (batch, cubatch) in zip(batches, cubatches)
+        @test ismissing(previous_cubatch) || all(x -> x.freed, previous_cubatch)
+        @test batch == Array.(cubatch)
+        @test all(x -> x isa CuArray, cubatch)
+        previous_cubatch = cubatch
+    end
+end
diff --git a/test/runtests.jl b/test/runtests.jl
@@ -60,6 +60,7 @@ include("solver.jl")
 include("sparse_solver.jl")
 include("dnn.jl")
 include("tensor.jl")
+include("iterator.jl")
 
 include("forwarddiff.jl")
 include("nnlib.jl")