Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce AsyncNumber to lazily copy numeric mapreduce results to the host #550

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

pxl-th
Copy link

@pxl-th pxl-th commented Jul 6, 2024

Introduce GPUNumber to store the resul of mapreduce across all dims (i.e. dims = :) instead of immediately transferring it to host.

GPUNumber copies its value to host when it is used as a regular number.
But if it is used for broadcasting, it passes its underlying 1-element array without device-to-host copy.

Seems to be minimally breaking change, except for the return type of sum and the likes.
Using AbstractNumbers.jl since it implements the whole Number interface conveniently.

More context: FluxML/Zygote.jl#1513

  • Before:
julia> x = AMDGPU.rand(Float32, 1024);

julia> @btime sum($x);
  64.340 μs (146 allocations: 6.11 KiB)

julia> @btime Zygote.gradient($x) do x
           sum(1f0 .- x)
       end;
  131.617 μs (368 allocations: 16.10 KiB)
  • Now:
julia> @btime sum($x);
  15.809 μs (140 allocations: 5.84 KiB)

julia> @btime Zygote.gradient($x) do x
           sum(1f0 .- x)
       end;
  44.123 μs (356 allocations: 15.57 KiB)

Here's also a timeline profile for the Flux.jl model for the same region.
We can see that now there are no device-to-host copies.

Timeline profile
Before image
Now image

@pxl-th
Copy link
Author

pxl-th commented Jul 7, 2024

@maleadt, the remaining CUDA.jl issues are because of method signatures, which can easily be updated to handle the change. I can open respective PR in CUDA.jl if you think this change is fine

@maleadt
Copy link
Member

maleadt commented Jul 8, 2024

Surprisingly small change! Still not a fan though :-P How much additional pressure does this put on the GC?

which can easily be updated to handle the change

Only without losing some "type safety", right? IIUC <:Integer arguments have to be widened to <:Number, or similarly <:Real or <:Complex types would get lost.

@pxl-th
Copy link
Author

pxl-th commented Jul 8, 2024

To avoid changing method signatures we can just change this line to:

- m=maximum(I), n=maximum(J);
+ m=maximum(I)[], n=maximum(J)[];

IIUC <:Integer arguments have to be widened to <:Number, or similarly <:Real or <:Complex types would get lost.

You mean that the compiler wouldn't be able to infer the type?
I think it is type-stable, since we maintain the underlying type in GPUNumber{T} and GPUNumber just forwards through a thin layer.

julia> x = AMDGPU.rand(Float32, 16);

julia> f(x::Number) = x + 2
f (generic function with 1 method)

julia> @code_warntype sum(x)
MethodInstance for sum(::ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer})
  from sum(a::AbstractArray; dims, kw...) @ Base reducedim.jl:1010
Arguments
  #self#::Core.Const(sum)
  a::ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}
Body::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}}
1nothing%2 = Base.:(:)::Core.Const(Colon())
│   %3 = Core.NamedTuple()::Core.Const(NamedTuple())
│   %4 = Base.pairs(%3)::Core.Const(Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}())
│   %5 = Base.:(var"#sum#828")(%2, %4, #self#, a)::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}}
└──      return %5

julia> @code_warntype(f(sum(x)))
MethodInstance for f(::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}})
  from f(x::Number) @ Main REPL[3]:1
Arguments
  #self#::Core.Const(f)
  x::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}}
Body::Float32
1%1 = (x + 2)::Float32
└──      return %1

@pxl-th
Copy link
Author

pxl-th commented Jul 8, 2024

How much additional pressure does this put on the GC?

For the Flux model that I have and use for testing, machine consistently hangs (machine with a single AMD GPU) after several hundred training steps with or without this change...
So I'm not sure it should be a blocker for this PR.

But generally, I really do think that something needs to be done to GC and GPU, since this is the biggest blocker for Julia + DL right now, unless you have 2+ GPUs or use headless server...

@maleadt
Copy link
Member

maleadt commented Jul 8, 2024

IIUC <:Integer arguments have to be widened to <:Number, or similarly <:Real or <:Complex types would get lost.

You mean that the compiler wouldn't be able to infer the type?

I'm not thinking of type stability or inference, but the ability to restrict method definitions (e.g. for dispatch purposes). If we now have foo(::Real) and foo(::Complex) we lose the ability to distinguish between both by doing foo(::GPUNumber). Or when you have a method bar(::CuArray{T}, ::T) where {T <: Number} signalling that the (el)types need to match; this would become impossible too.

The fact that we're now indicating that mapreduce returns a Number also seems wrong to me too. It's possible to reduce non-numbers, e.g.,

# returns `missing` when missing values are involved
function Base.:(==)(A::AnyGPUArray, B::AnyGPUArray)
if axes(A) != axes(B)
return false
end
function mapper(a, b)
eq = (a == b)
if ismissing(eq)
(; is_missing=true, is_equal=#=don't care=#false)
else
(; is_missing=false, is_equal=eq)
end
end
function reducer(a, b)
if a.is_missing || b.is_missing
(; is_missing=true, is_equal=#=don't care=#false)
else
(; is_missing=false, is_equal=a.is_equal & b.is_equal)
end
end
res = mapreduce(mapper, reducer, A, B; init=(; is_missing=false, is_equal=true))
res.is_missing ? missing : res.is_equal
end
. I take it this works because getproperty is forwarded to the inner value? That doesn't seem scalable though, as you can't forward everything to the inner value (and with arbitrary values being possible here, it's impossible to tell what to forward).

Not that I wouldn't want this to work, I just want to avoid breaking code. The fact that tests seem relatively OK is a good argument in favor though, and worst case we can always revert what's ultimately just an optimization.

@pxl-th
Copy link
Author

pxl-th commented Jul 8, 2024

I take it this works because getproperty is forwarded to the inner value?

No, GPUNumber only inherits Number interface. For everything else (like with that reducer example) the user would need to explicitly tranfer value to host by calling getindex sum(x)[] (or AN.number as is in this PR).

The fact that we're now indicating that mapreduce returns a Number also seems wrong to me too.

Maybe we can have GPUNumber for numbers and some other container for non-Number types?

Or when you have a method bar(::CuArray{T}, ::T) where {T <: Number} signalling that the (el)types need to match; this would become impossible too.

Yeah, that's a consequence of making host transfers more explicit I guess...
Maybe we can expose something like (as defined in this PR):

maybe_number(g::GPUNumber) = AN.number(g)
maybe_number(g) = g

So that if the user writes code both for CPU and GPU he can use maybe_number (or some other name, like materialize) when needed to make sure the type is actually a scalar.

struct GPUNumber{T <: AbstractGPUArray} <: AN.AbstractNumber{T}
val::T

function GPUNumber(val::T) where T <: AbstractGPUArray
Copy link
Contributor

@vpuri3 vpuri3 Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pxl-th it might be a good idea to do a dropdims or vec in the constructor so this guy truly behaves like a number. The case I am thinking of is the following.

a = CUDA.rand(1,1,1,1) |> GPUNumber
x = CUDA.rand(4,4)
size(x .+ a) # want (4,4) but will likely get (4,4,1,1)

I haven't run your code but it seems like it would do something similar to

julia> rand(1,1,1,1) .+ rand(4,4) |> size
(4, 4, 1, 1)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, thanks!

@maleadt
Copy link
Member

maleadt commented Jul 9, 2024

For everything else (like with that reducer example) the user would need to explicitly tranfer value to host by calling getindex sum(x)[]

But that's a breaking (and non-generic) change? So yeah, making this happen only for Numbers where we can be "sure" we implement the number interface transparently seems like the safer choice. But even then I'm still wary that this will break user code, so this will need thorough testing...

@pxl-th
Copy link
Author

pxl-th commented Jul 9, 2024

So I've made it behave as usual when eltype is not Number, otherwise return GPUNumber.
I'll also do some more testing and benchmarking to see the impact.

src/host/gpunumber.jl Outdated Show resolved Hide resolved
@maleadt
Copy link
Member

maleadt commented Jul 9, 2024

Also, some bike shedding: AsyncNumber might be more clear than the otherwise noninformative GPUNumber. Or AsyncValue if we find a way to generalize this, which I'm not sure we will. Also sounds a bit like a Ref, but that verb is pretty overloaded already.

@maleadt maleadt changed the title Introduce GPUNumber to lazily copy mapreduce result to host Introduce AsyncNumber to lazily copy numeric mapreduce results to the host Jul 11, 2024
@pxl-th
Copy link
Author

pxl-th commented Jul 11, 2024

Here's also a timeline for the Flux.jl model for CUDA.jl.
Profiling over 20 training steps and explicitly avoiding any host transfers, like visualizing loss values.
Before it took ~29 seconds, now ~19 seconds. This includes some other PRs that avoid host transfers:

When running outside of profiler, one epoch takes ~35-40 minutes (down from 50-55 min), while PyTorch takes ~12 minutes per epoch. So there's still room for improvement (timeline profiling also looks much denser for PyTorch, there are no gaps between GPU kernels).

Timeline
Before image
Now image
PyTorch image

@pxl-th
Copy link
Author

pxl-th commented Jul 11, 2024

Remaining gaps could be either GC pauses.
Running profiling with GC logging enabled:

GC: pause 366.60ms. collected 5.253326MB. incr 

GC: pause 112.91ms. collected 34.399342MB. full recollect

GC: pause 355.50ms. collected 0.455795MB. incr 

GC: pause 114.99ms. collected 30.454090MB. full recollect

GC: pause 371.72ms. collected 0.405960MB. incr 

GC: pause 118.76ms. collected 30.621845MB. full recollect

GC: pause 359.00ms. collected 5.808136MB. incr 

GC: pause 124.97ms. collected 10.381500MB. full recollect

GC: pause 348.96ms. collected 5.253326MB. incr 

GC: pause 112.76ms. collected 37.985279MB. full recollect

GC: pause 348.96ms. collected 0.455795MB. incr 

GC: pause 111.61ms. collected 33.037785MB. full recollect

GC: pause 349.60ms. collected 0.405960MB. incr 

Or synchronizations during host -> device transfer. Could be that PyTorch's data loader prefetches data on a separate worker/stream to overlap with compute (don't think it is using pinned memory by default).

@maleadt
Copy link
Member

maleadt commented Jul 12, 2024

You could use NVTX.jl to visualize GC pauses, https://juliagpu.github.io/NVTX.jl/dev/#Instrumenting-Julia-internals, and maybe add a couple of annotations to key CUDA.jl functions.

@pxl-th
Copy link
Author

pxl-th commented Jul 12, 2024

You could use NVTX.jl to visualize GC pauses

Timeline
default gc threads image
--gcthreads=4 image

So it does look like these gaps are GC pauses. I've also logged maybe_collect and it almost always corresponds to one of these GC pauses and the pressure when it is triggered is always > 0.99.

@maleadt
Copy link
Member

maleadt commented Jul 12, 2024

Is the GC time spent marking/sweeping in Julia, or are the cuMemFreeAsync calls soaking up the time? I would assume the former, but I've also seen CUDA itself spending significant time in the allocator especially when close to OOM (presumably compacting/defragmenting its heap).

@vchuravy
Copy link
Member

So it does look like these gaps are GC pauses.

Can you set --gcthreads to be 4 or so?

@pxl-th
Copy link
Author

pxl-th commented Jul 12, 2024

Is the GC time spent marking/sweeping in Julia, or are the cuMemFreeAsync calls soaking up the time?

Selected green region is where cuMemFreeAsync happens, so I guess the bigger portion of time is spent on marking/sweeping?

image

Can you set --gcthreads to be 4 or so?

@vchuravy I've updated the above post with --gcthreads=4 option. The GC pause time decreased from ~500ms to ~300ms, but for the whole training epoch the improvement is maybe a couple of minutes (1 epoch is still taking ~40 min to run)

@vpuri3
Copy link
Contributor

vpuri3 commented Sep 7, 2024

@pxl-th what model are you training? I'm curious how much of the performance gain is due to AsyncNumber vs JuliaDiff/ChainRules.jl#801

@pxl-th
Copy link
Author

pxl-th commented Sep 7, 2024

It was HiFi-GAN and JuliaDiff/ChainRules.jl#801 probably brings bigger performance gain.
And I'm not sure at this point how impactful this PR is in the real-world use-cases, because in the end the real bottleneck was due to GC calls when we ran out of memory as described above #550 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants