-
-
Notifications
You must be signed in to change notification settings - Fork 19
Add support for pinned memory. #133
Conversation
Stuff I'm unhappy with:
|
CC: @lcw who has expressed interest in this. |
Would it be sane to make the |
Fixes across the stack (including a clean-up of sine GPUArrays code): Is anybody using these |
Edit: Nevermind. Just saw the changes across the stack. This will take a minute to go through and see how it all is interacting across packages. I am not at a workstation with a GPU right now I will checkout the changes and try testing it Thursday earliest. |
I'll be sure to update the CUDAdrv tests and add an example here to illustrate the envisioned usage. |
This is how it currently looks: using CUDAdrv, CuArrays
dims = (5,)
T = Int
bytes = prod(dims)*sizeof(T)
# pinned memory
buf = Mem.alloc(Mem.Host, bytes)
cpu_ptr = convert(Ptr{T}, buf)
cpu_obj = unsafe_wrap(Array, cpu_ptr, dims)
gpu_obj = CuArray{T}(undef, dims)
copyto!(gpu_obj, cpu_obj)
# pinned and mapped
buf = Mem.alloc(Mem.Host, bytes, Mem.HOSTALLOC_DEVICEMAP)
cpu_ptr = convert(Ptr{T}, buf)
cpu_obj = unsafe_wrap(Array, cpu_ptr, dims)
gpu_ptr = convert(CuPtr{T}, buf)
gpu_obj = unsafe_wrap(CuArray, gpu_ptr, dims) Thoughts? |
I think it looks good/hits the mark! As mentioned I won't have access to a GPU until tomorrow, but I will try to look at the branch and see how it holds up to some intentional abuse in the morning 😃 |
This is reallllllyyyy nice. Just had a chance to do some basic testing. Only one comment regarding calling Right now it looks like the default e.g.
In my case I would almost always use async. But there's two reasons to pin: speed and async/non-default stream. I can think of situations where the implicit sync might be useful. I like the idea of having a simple default dispatch and then optional flags (how you've done with Mem.copy!) |
I just got a chance to test this out. It seems to work great! I ran the following code using CUDAdrv
"""
empiricalbandwidth(nbytes=2*1024^3; devicenumber=0, numtests=10)
Compute the emperical bandwidth in GB/s of a CUDA device using `nbytes` of
memory. The device to test can be slected with `devicenumber` and the
bandwidth is an average of `ntests`.
"""
function empiricalbandwidth(nbytes=1024^2; devicenumber=0, ntests=10, pinned=false)
dev = CuDevice(devicenumber)
ctx = CuContext(dev)
stm = CuStream()
d = rand(Char, nbytes)
a = pinned ? Mem.alloc(Mem.Host, nbytes) : pointer(d)
b = Mem.alloc(Mem.Device, nbytes)
Mem.copy!(a, b, nbytes)
Mem.copy!(b, a, nbytes)
t0, t1 = CuEvent(), CuEvent()
record(t0, stm)
for n = 1:ntests
Mem.copy!(a, b, nbytes, async=true, stream=stm)
end
record(t1, stm)
synchronize(t1)
t_dtoh = elapsed(t0, t1)
bandwidth_dtoh = nbytes*ntests/(t_dtoh*1e9)
t0, t1 = CuEvent(), CuEvent()
record(t0, stm)
for n = 1:ntests
Mem.copy!(b, a, nbytes, async=true, stream=stm)
end
record(t1, stm)
synchronize(t1)
t_htod = elapsed(t0, t1)
bandwidth_htod = nbytes*ntests/(t_htod*1e9)
pinned && Mem.free(a)
Mem.free(b)
(bandwidth_dtoh, bandwidth_htod)
end and got
Let me know if there is anything I can do to help get this branch merged in. |
We're not taking the address of a Julia object, as unsafe_convert does.
Fix #122
Based on #123, trying to improve the API making it more idiomatic while protecting against erroneous conversions.
WIP, but let me know what you think @wsphillips. Also cc @vchuravy.