Introduce `AsyncNumber` to lazily copy numeric `mapreduce` results to the host #550

pxl-th · 2024-07-06T21:50:07Z

Introduce GPUNumber to store the resul of mapreduce across all dims (i.e. dims = :) instead of immediately transferring it to host.

GPUNumber copies its value to host when it is used as a regular number.
But if it is used for broadcasting, it passes its underlying 1-element array without device-to-host copy.

Seems to be minimally breaking change, except for the return type of sum and the likes.
Using AbstractNumbers.jl since it implements the whole Number interface conveniently.

More context: FluxML/Zygote.jl#1513

Before:

julia> x = AMDGPU.rand(Float32, 1024);

julia> @btime sum($x);
  64.340 μs (146 allocations: 6.11 KiB)

julia> @btime Zygote.gradient($x) do x
           sum(1f0 .- x)
       end;
  131.617 μs (368 allocations: 16.10 KiB)

Now:

julia> @btime sum($x);
  15.809 μs (140 allocations: 5.84 KiB)

julia> @btime Zygote.gradient($x) do x
           sum(1f0 .- x)
       end;
  44.123 μs (356 allocations: 15.57 KiB)

Here's also a timeline profile for the Flux.jl model for the same region.
We can see that now there are no device-to-host copies.

	Timeline profile
Before
Now

pxl-th · 2024-07-07T21:17:02Z

@maleadt, the remaining CUDA.jl issues are because of method signatures, which can easily be updated to handle the change. I can open respective PR in CUDA.jl if you think this change is fine

maleadt · 2024-07-08T09:31:15Z

Surprisingly small change! Still not a fan though :-P How much additional pressure does this put on the GC?

which can easily be updated to handle the change

Only without losing some "type safety", right? IIUC <:Integer arguments have to be widened to <:Number, or similarly <:Real or <:Complex types would get lost.

pxl-th · 2024-07-08T16:22:57Z

To avoid changing method signatures we can just change this line to:

- m=maximum(I), n=maximum(J);
+ m=maximum(I)[], n=maximum(J)[];

IIUC <:Integer arguments have to be widened to <:Number, or similarly <:Real or <:Complex types would get lost.

You mean that the compiler wouldn't be able to infer the type?
I think it is type-stable, since we maintain the underlying type in GPUNumber{T} and GPUNumber just forwards through a thin layer.

julia> x = AMDGPU.rand(Float32, 16);

julia> f(x::Number) = x + 2
f (generic function with 1 method)

julia> @code_warntype sum(x)
MethodInstance for sum(::ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer})
  from sum(a::AbstractArray; dims, kw...) @ Base reducedim.jl:1010
Arguments
  #self#::Core.Const(sum)
  a::ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}
Body::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}}
1 ─      nothing
│   %2 = Base.:(:)::Core.Const(Colon())
│   %3 = Core.NamedTuple()::Core.Const(NamedTuple())
│   %4 = Base.pairs(%3)::Core.Const(Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}())
│   %5 = Base.:(var"#sum#828")(%2, %4, #self#, a)::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}}
└──      return %5

julia> @code_warntype(f(sum(x)))
MethodInstance for f(::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}})
  from f(x::Number) @ Main REPL[3]:1
Arguments
  #self#::Core.Const(f)
  x::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}}
Body::Float32
1 ─ %1 = (x + 2)::Float32
└──      return %1

pxl-th · 2024-07-08T16:48:26Z

How much additional pressure does this put on the GC?

For the Flux model that I have and use for testing, machine consistently hangs (machine with a single AMD GPU) after several hundred training steps with or without this change...
So I'm not sure it should be a blocker for this PR.

But generally, I really do think that something needs to be done to GC and GPU, since this is the biggest blocker for Julia + DL right now, unless you have 2+ GPUs or use headless server...

maleadt · 2024-07-08T18:37:45Z

IIUC <:Integer arguments have to be widened to <:Number, or similarly <:Real or <:Complex types would get lost.

You mean that the compiler wouldn't be able to infer the type?

I'm not thinking of type stability or inference, but the ability to restrict method definitions (e.g. for dispatch purposes). If we now have foo(::Real) and foo(::Complex) we lose the ability to distinguish between both by doing foo(::GPUNumber). Or when you have a method bar(::CuArray{T}, ::T) where {T <: Number} signalling that the (el)types need to match; this would become impossible too.

The fact that we're now indicating that mapreduce returns a Number also seems wrong to me too. It's possible to reduce non-numbers, e.g.,

GPUArrays.jl/src/host/mapreduce.jl

Lines 111 to 134 in 80b8226

    
           # returns `missing` when missing values are involved 
        
           function Base.:(==)(A::AnyGPUArray, B::AnyGPUArray) 
        
               if axes(A) != axes(B) 
        
                   return false 
        
               end 
        
               function mapper(a, b) 
        
                   eq = (a == b) 
        
                   if ismissing(eq) 
        
                       (; is_missing=true, is_equal=#=don't care=#false) 
        
                   else 
        
                       (; is_missing=false, is_equal=eq) 
        
                   end 
        
               end 
        
               function reducer(a, b) 
        
                   if a.is_missing || b.is_missing 
        
                       (; is_missing=true, is_equal=#=don't care=#false) 
        
                   else 
        
                       (; is_missing=false, is_equal=a.is_equal & b.is_equal) 
        
                   end 
        
               end 
        
               res = mapreduce(mapper, reducer, A, B; init=(; is_missing=false, is_equal=true)) 
        
               res.is_missing ? missing : res.is_equal 
        
           end

. I take it this works because getproperty is forwarded to the inner value? That doesn't seem scalable though, as you can't forward everything to the inner value (and with arbitrary values being possible here, it's impossible to tell what to forward).

Not that I wouldn't want this to work, I just want to avoid breaking code. The fact that tests seem relatively OK is a good argument in favor though, and worst case we can always revert what's ultimately just an optimization.

pxl-th · 2024-07-08T20:41:50Z

I take it this works because getproperty is forwarded to the inner value?

No, GPUNumber only inherits Number interface. For everything else (like with that reducer example) the user would need to explicitly tranfer value to host by calling getindex sum(x)[] (or AN.number as is in this PR).

The fact that we're now indicating that mapreduce returns a Number also seems wrong to me too.

Maybe we can have GPUNumber for numbers and some other container for non-Number types?

Or when you have a method bar(::CuArray{T}, ::T) where {T <: Number} signalling that the (el)types need to match; this would become impossible too.

Yeah, that's a consequence of making host transfers more explicit I guess...
Maybe we can expose something like (as defined in this PR):

maybe_number(g::GPUNumber) = AN.number(g)
maybe_number(g) = g

So that if the user writes code both for CPU and GPU he can use maybe_number (or some other name, like materialize) when needed to make sure the type is actually a scalar.

vpuri3 · 2024-07-08T20:55:59Z

src/host/gpunumber.jl

+struct GPUNumber{T <: AbstractGPUArray} <: AN.AbstractNumber{T}
+    val::T
+
+    function GPUNumber(val::T) where T <: AbstractGPUArray


@pxl-th it might be a good idea to do a dropdims or vec in the constructor so this guy truly behaves like a number. The case I am thinking of is the following.

a = CUDA.rand(1,1,1,1) |> GPUNumber x = CUDA.rand(4,4) size(x .+ a) # want (4,4) but will likely get (4,4,1,1)

I haven't run your code but it seems like it would do something similar to

julia> rand(1,1,1,1) .+ rand(4,4) |> size (4, 4, 1, 1)

Indeed, thanks!

maleadt · 2024-07-09T09:03:05Z

For everything else (like with that reducer example) the user would need to explicitly tranfer value to host by calling getindex sum(x)[]

But that's a breaking (and non-generic) change? So yeah, making this happen only for Numbers where we can be "sure" we implement the number interface transparently seems like the safer choice. But even then I'm still wary that this will break user code, so this will need thorough testing...

pxl-th · 2024-07-09T12:13:24Z

So I've made it behave as usual when eltype is not Number, otherwise return GPUNumber.
I'll also do some more testing and benchmarking to see the impact.

src/host/gpunumber.jl

maleadt · 2024-07-09T14:10:23Z

Also, some bike shedding: AsyncNumber might be more clear than the otherwise noninformative GPUNumber. Or AsyncValue if we find a way to generalize this, which I'm not sure we will. Also sounds a bit like a Ref, but that verb is pretty overloaded already.

pxl-th · 2024-07-11T13:45:45Z

Here's also a timeline for the Flux.jl model for CUDA.jl.
Profiling over 20 training steps and explicitly avoiding any host transfers, like visualizing loss values.
Before it took ~29 seconds, now ~19 seconds. This includes some other PRs that avoid host transfers:

Avoid device-to-host copy in ∇getindex! JuliaDiff/ChainRules.jl#801
Vectorized getindex ignores @inbounds #542 (although it still does bounds checking in ChainRules.jl so I disabled it manually)

When running outside of profiler, one epoch takes ~35-40 minutes (down from 50-55 min), while PyTorch takes ~12 minutes per epoch. So there's still room for improvement (timeline profiling also looks much denser for PyTorch, there are no gaps between GPU kernels).

	Timeline
Before
Now
PyTorch

pxl-th · 2024-07-11T14:05:57Z

Remaining gaps could be either GC pauses.
Running profiling with GC logging enabled:

GC: pause 366.60ms. collected 5.253326MB. incr 

GC: pause 112.91ms. collected 34.399342MB. full recollect

GC: pause 355.50ms. collected 0.455795MB. incr 

GC: pause 114.99ms. collected 30.454090MB. full recollect

GC: pause 371.72ms. collected 0.405960MB. incr 

GC: pause 118.76ms. collected 30.621845MB. full recollect

GC: pause 359.00ms. collected 5.808136MB. incr 

GC: pause 124.97ms. collected 10.381500MB. full recollect

GC: pause 348.96ms. collected 5.253326MB. incr 

GC: pause 112.76ms. collected 37.985279MB. full recollect

GC: pause 348.96ms. collected 0.455795MB. incr 

GC: pause 111.61ms. collected 33.037785MB. full recollect

GC: pause 349.60ms. collected 0.405960MB. incr

Or synchronizations during host -> device transfer. Could be that PyTorch's data loader prefetches data on a separate worker/stream to overlap with compute (don't think it is using pinned memory by default).

maleadt · 2024-07-12T09:53:18Z

You could use NVTX.jl to visualize GC pauses, https://juliagpu.github.io/NVTX.jl/dev/#Instrumenting-Julia-internals, and maybe add a couple of annotations to key CUDA.jl functions.

pxl-th · 2024-07-12T13:21:00Z

You could use NVTX.jl to visualize GC pauses

	Timeline
default gc threads
`--gcthreads=4`

So it does look like these gaps are GC pauses. I've also logged maybe_collect and it almost always corresponds to one of these GC pauses and the pressure when it is triggered is always > 0.99.

maleadt · 2024-07-12T15:01:55Z

Is the GC time spent marking/sweeping in Julia, or are the cuMemFreeAsync calls soaking up the time? I would assume the former, but I've also seen CUDA itself spending significant time in the allocator especially when close to OOM (presumably compacting/defragmenting its heap).

vchuravy · 2024-07-12T15:27:46Z

So it does look like these gaps are GC pauses.

Can you set --gcthreads to be 4 or so?

pxl-th · 2024-07-12T16:20:42Z

Is the GC time spent marking/sweeping in Julia, or are the cuMemFreeAsync calls soaking up the time?

Selected green region is where cuMemFreeAsync happens, so I guess the bigger portion of time is spent on marking/sweeping?

Can you set --gcthreads to be 4 or so?

@vchuravy I've updated the above post with --gcthreads=4 option. The GC pause time decreased from ~500ms to ~300ms, but for the whole training epoch the improvement is maybe a couple of minutes (1 epoch is still taking ~40 min to run)

vpuri3 · 2024-09-07T17:13:18Z

@pxl-th what model are you training? I'm curious how much of the performance gain is due to AsyncNumber vs JuliaDiff/ChainRules.jl#801

pxl-th · 2024-09-07T18:02:51Z

It was HiFi-GAN and JuliaDiff/ChainRules.jl#801 probably brings bigger performance gain.
And I'm not sure at this point how impactful this PR is in the real-world use-cases, because in the end the real bottleneck was due to GC calls when we ran out of memory as described above #550 (comment).

Add GPUNumber

cf88dfe

pxl-th force-pushed the pxl-th/gpunumber branch from 11c9317 to cf88dfe Compare July 6, 2024 21:51

pxl-th added 4 commits July 7, 2024 00:52

Add compat

16bb79f

Add more methods

aa095e8

Add more methods

982e7cc

More methods...

59f0a88

vpuri3 reviewed Jul 8, 2024

View reviewed changes

Return GPUNumber only when eltype is Number

8e48354

Cleanup

d951132

vchuravy reviewed Jul 9, 2024

View reviewed changes

src/host/gpunumber.jl Outdated Show resolved Hide resolved

Refactoring

f01a7d1

maleadt changed the title ~~Introduce GPUNumber to lazily copy mapreduce result to host~~ Introduce AsyncNumber to lazily copy numeric mapreduce results to the host Jul 11, 2024

maleadt added enhancement performance labels Jul 11, 2024

pxl-th mentioned this pull request Jul 25, 2024

Flux : slower and slower computations of a neural network - a VRAM problem ? JuliaGPU/AMDGPU.jl#654

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce `AsyncNumber` to lazily copy numeric `mapreduce` results to the host #550

Introduce `AsyncNumber` to lazily copy numeric `mapreduce` results to the host #550

pxl-th commented Jul 6, 2024 •

edited

Loading

pxl-th commented Jul 7, 2024

maleadt commented Jul 8, 2024

pxl-th commented Jul 8, 2024

pxl-th commented Jul 8, 2024

maleadt commented Jul 8, 2024 •

edited

Loading

pxl-th commented Jul 8, 2024

vpuri3 Jul 8, 2024 •

edited

Loading

pxl-th Jul 9, 2024

maleadt commented Jul 9, 2024 •

edited

Loading

pxl-th commented Jul 9, 2024

maleadt commented Jul 9, 2024

pxl-th commented Jul 11, 2024 •

edited

Loading

pxl-th commented Jul 11, 2024 •

edited

Loading

maleadt commented Jul 12, 2024

pxl-th commented Jul 12, 2024 •

edited

Loading

maleadt commented Jul 12, 2024

vchuravy commented Jul 12, 2024

pxl-th commented Jul 12, 2024

vpuri3 commented Sep 7, 2024

pxl-th commented Sep 7, 2024

Introduce AsyncNumber to lazily copy numeric mapreduce results to the host #550

Are you sure you want to change the base?

Introduce AsyncNumber to lazily copy numeric mapreduce results to the host #550

Conversation

pxl-th commented Jul 6, 2024 • edited Loading

pxl-th commented Jul 7, 2024

maleadt commented Jul 8, 2024

pxl-th commented Jul 8, 2024

pxl-th commented Jul 8, 2024

maleadt commented Jul 8, 2024 • edited Loading

pxl-th commented Jul 8, 2024

vpuri3 Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

pxl-th Jul 9, 2024

Choose a reason for hiding this comment

maleadt commented Jul 9, 2024 • edited Loading

pxl-th commented Jul 9, 2024

maleadt commented Jul 9, 2024

pxl-th commented Jul 11, 2024 • edited Loading

pxl-th commented Jul 11, 2024 • edited Loading

maleadt commented Jul 12, 2024

pxl-th commented Jul 12, 2024 • edited Loading

maleadt commented Jul 12, 2024

vchuravy commented Jul 12, 2024

pxl-th commented Jul 12, 2024

vpuri3 commented Sep 7, 2024

pxl-th commented Sep 7, 2024

Introduce `AsyncNumber` to lazily copy numeric `mapreduce` results to the host #550

Introduce `AsyncNumber` to lazily copy numeric `mapreduce` results to the host #550

pxl-th commented Jul 6, 2024 •

edited

Loading

maleadt commented Jul 8, 2024 •

edited

Loading

vpuri3 Jul 8, 2024 •

edited

Loading

maleadt commented Jul 9, 2024 •

edited

Loading

pxl-th commented Jul 11, 2024 •

edited

Loading

pxl-th commented Jul 11, 2024 •

edited

Loading

pxl-th commented Jul 12, 2024 •

edited

Loading