Skip to content
This repository has been archived by the owner on May 27, 2021. It is now read-only.

Add support for pinned memory. #133

Merged
merged 25 commits into from
Apr 12, 2019
Merged

Add support for pinned memory. #133

merged 25 commits into from
Apr 12, 2019

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Mar 25, 2019

Fix #122

Based on #123, trying to improve the API making it more idiomatic while protecting against erroneous conversions.

WIP, but let me know what you think @wsphillips. Also cc @vchuravy.

@maleadt
Copy link
Member Author

maleadt commented Mar 25, 2019

Stuff I'm unhappy with:

  • Mem.alloc(Mem.DeviceBuffer, ...) too verbose
  • Mem.upload(..., stream; async) stream only makes sense with async=true, but having both upload and eg. async_upload isn't nice either

@vchuravy
Copy link
Member

CC: @lcw who has expressed interest in this.

@maleadt
Copy link
Member Author

maleadt commented Mar 26, 2019

Would it be sane to make the unsafe_copyto!(::HostBuffer) (aka. pinned memory) async by default, whereas copies to and from other memories would be synchronizing? Or do we still want a async flag?

@maleadt
Copy link
Member Author

maleadt commented Mar 26, 2019

Fixes across the stack (including a clean-up of sine GPUArrays code):
https://github.com/JuliaGPU/CUDAnative.jl/compare/tb/pinned
https://github.com/JuliaGPU/GPUArrays.jl/compare/tb/unsafe_copyto
https://github.com/JuliaGPU/CuArrays.jl/compare/tb/pinned

Is anybody using these CUDAdrv.Mem APIs (except for the above packages which I intend to fix)? Otherwise this is another breaking release.

@wsphillips
Copy link

wsphillips commented Mar 26, 2019

Edit: Nevermind. Just saw the changes across the stack. This will take a minute to go through and see how it all is interacting across packages.

I am not at a workstation with a GPU right now I will checkout the changes and try testing it Thursday earliest.

@maleadt
Copy link
Member Author

maleadt commented Mar 27, 2019

I'll be sure to update the CUDAdrv tests and add an example here to illustrate the envisioned usage.

@maleadt
Copy link
Member Author

maleadt commented Mar 27, 2019

This is how it currently looks:

using CUDAdrv, CuArrays

dims = (5,)
T = Int
bytes = prod(dims)*sizeof(T)


# pinned memory

buf = Mem.alloc(Mem.Host, bytes)

cpu_ptr = convert(Ptr{T}, buf)
cpu_obj = unsafe_wrap(Array, cpu_ptr, dims)

gpu_obj = CuArray{T}(undef, dims)
copyto!(gpu_obj, cpu_obj)


# pinned and mapped

buf = Mem.alloc(Mem.Host, bytes, Mem.HOSTALLOC_DEVICEMAP)

cpu_ptr = convert(Ptr{T}, buf)
cpu_obj = unsafe_wrap(Array, cpu_ptr, dims)

gpu_ptr = convert(CuPtr{T}, buf)
gpu_obj = unsafe_wrap(CuArray, gpu_ptr, dims)

Thoughts?

@wsphillips
Copy link

wsphillips commented Mar 27, 2019

Thoughts?

I think it looks good/hits the mark!

As mentioned I won't have access to a GPU until tomorrow, but I will try to look at the branch and see how it holds up to some intentional abuse in the morning 😃

@wsphillips
Copy link

This is reallllllyyyy nice. Just had a chance to do some basic testing.

Only one comment regarding calling Base.copyto! from CuArrays.

Right now it looks like the default Mem.copy! is making this happen. Could a dispatch to permit calling the extra flags (async + stream) be added there?

e.g. copyto!(gpu_obj, cpu_obj, async=true, stream=someotherstream) calls into the Mem.copy! dispatch containing async/stream flags.

Would it be sane to make the unsafe_copyto!(::HostBuffer) (aka. pinned memory) async by default, whereas copies to and from other memories would be synchronizing? Or do we still want a async flag?

In my case I would almost always use async. But there's two reasons to pin: speed and async/non-default stream. I can think of situations where the implicit sync might be useful. I like the idea of having a simple default dispatch and then optional flags (how you've done with Mem.copy!)

@lcw
Copy link
Contributor

lcw commented Apr 10, 2019

I just got a chance to test this out. It seems to work great!

I ran the following code

using CUDAdrv

"""
    empiricalbandwidth(nbytes=2*1024^3; devicenumber=0, numtests=10)

Compute the emperical bandwidth in GB/s of a CUDA device using `nbytes` of
memory.  The device to test can be slected with `devicenumber` and the
bandwidth is an average of `ntests`.
"""
function empiricalbandwidth(nbytes=1024^2; devicenumber=0, ntests=10, pinned=false)
    dev = CuDevice(devicenumber)
    ctx = CuContext(dev)
    stm = CuStream()

    d = rand(Char, nbytes)
    a = pinned ? Mem.alloc(Mem.Host, nbytes) : pointer(d)
    b = Mem.alloc(Mem.Device, nbytes)

    Mem.copy!(a, b, nbytes)
    Mem.copy!(b, a, nbytes)

    t0, t1 = CuEvent(), CuEvent()
    record(t0, stm)
    for n = 1:ntests
        Mem.copy!(a, b, nbytes, async=true, stream=stm)
    end
    record(t1, stm)
    synchronize(t1)
    t_dtoh = elapsed(t0, t1)
    bandwidth_dtoh = nbytes*ntests/(t_dtoh*1e9)

    t0, t1 = CuEvent(), CuEvent()
    record(t0, stm)
    for n = 1:ntests
        Mem.copy!(b, a, nbytes, async=true, stream=stm)
    end
    record(t1, stm)
    synchronize(t1)
    t_htod = elapsed(t0, t1)
    bandwidth_htod = nbytes*ntests/(t_htod*1e9)

    pinned && Mem.free(a)
    Mem.free(b)

    (bandwidth_dtoh, bandwidth_htod)
end

and got

❯ julia --project=.
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.2.0-DEV.670 (2019-04-10)
 _/ |\__'_|_|_|\__'_|  |  Commit 20499752ac (0 days old master)
|__/                   |

julia> include("pinned.jl")
empiricalbandwidth

julia> empiricalbandwidth()
(5.266134908961078, 9.937828167245515)

julia> empiricalbandwidth(pinned=true)
(11.4790166633746, 11.016305722250417)

Let me know if there is anything I can do to help get this branch merged in.

@maleadt maleadt marked this pull request as ready for review April 12, 2019 08:20
@maleadt maleadt merged commit d899dd5 into master Apr 12, 2019
@bors bors bot deleted the tb/pinned branch April 12, 2019 08:21
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for cudaHostAlloc() cudaHostGetDevicePtr() cudaFreeHost() including flag optionss
4 participants