Lsim gpu #468

albheim · 2021-04-25T21:34:59Z

Changed some lsim and ltitr code so it will support CuArrays or HeteroStateSpace with CuArrays.

Ran this test on master and new branch to make sure it kept same performance on normal arrays.

using ControlSystems, BenchmarkTools

ny, nx, nu = 10, 100, 20
dt = 0.01
A = randn(nx,nx)
B = randn(nx,nu)
C = randn(ny,nx)
D = zeros(ny,nu)
sys = ss(A, B, C, D)
sysd = c2d(sys, dt)

t = 0:dt:1
x0 = zeros(nx)

tmp = ones(nu)
u(x, t) = tmp
uu = transpose(reduce(hcat, u(x0, tt) for tt in t))

@benchmark lsim(sys, u, t; x0=x0)
@benchmark lsim(sys, uu, t; x0=x0)
@benchmark lsim(sysd, u, t; x0=x0)
@benchmark lsim(sysd, uu, t; x0=x0)

Results on master

julia> @benchmark lsim(sys, u, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  881.52 KiB
  allocs estimate:  3159
  --------------
  minimum time:     1.429 ms (0.00% GC)
  median time:      1.664 ms (0.00% GC)
  mean time:        1.935 ms (7.80% GC)
  maximum time:     22.504 ms (90.67% GC)
  --------------
  samples:          2572
  evals/sample:     1

julia> @benchmark lsim(sys, uu, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  3.84 MiB
  allocs estimate:  106
  --------------
  minimum time:     2.052 ms (0.00% GC)
  median time:      2.321 ms (0.00% GC)
  mean time:        2.529 ms (5.97% GC)
  maximum time:     14.941 ms (75.71% GC)
  --------------
  samples:          1971
  evals/sample:     1

julia> @benchmark lsim(sysd, u, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  421.77 KiB
  allocs estimate:  823
  --------------
  minimum time:     824.109 μs (0.00% GC)
  median time:      916.514 μs (0.00% GC)
  mean time:        970.925 μs (3.00% GC)
  maximum time:     15.533 ms (92.07% GC)
  --------------
  samples:          5125
  evals/sample:     1

julia> @benchmark lsim(sysd, uu, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  105.56 KiB
  allocs estimate:  20
  --------------
  minimum time:     714.580 μs (0.00% GC)
  median time:      788.074 μs (0.00% GC)
  mean time:        813.631 μs (0.55% GC)
  maximum time:     4.152 ms (79.61% GC)
  --------------
  samples:          6111
  evals/sample:     1

Results on new branch

julia> @benchmark lsim(sys, u, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  1.61 MiB
  allocs estimate:  2815
  --------------
  minimum time:     1.451 ms (0.00% GC)
  median time:      1.544 ms (0.00% GC)
  mean time:        1.697 ms (5.76% GC)
  maximum time:     18.358 ms (87.38% GC)
  --------------
  samples:          2930
  evals/sample:     1

julia> @benchmark lsim(sys, uu, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  4.10 MiB
  allocs estimate:  302
  --------------
  minimum time:     2.080 ms (0.00% GC)
  median time:      2.182 ms (0.00% GC)
  mean time:        2.456 ms (6.30% GC)
  maximum time:     8.137 ms (67.88% GC)
  --------------
  samples:          2029
  evals/sample:     1

julia> @benchmark lsim(sysd, u, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  421.81 KiB
  allocs estimate:  827
  --------------
  minimum time:     795.230 μs (0.00% GC)
  median time:      909.728 μs (0.00% GC)
  mean time:        994.502 μs (4.15% GC)
  maximum time:     28.030 ms (95.81% GC)
  --------------
  samples:          4990
  evals/sample:     1

julia> @benchmark lsim(sysd, uu, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  373.31 KiB
  allocs estimate:  216
  --------------
  minimum time:     688.868 μs (0.00% GC)
  median time:      764.116 μs (0.00% GC)
  mean time:        791.641 μs (2.06% GC)
  maximum time:     11.026 ms (90.31% GC)
  --------------
  samples:          6278
  evals/sample:     1

This code was used to compare normal system and ones using CuArrays

using ControlSystems, BenchmarkTools, CUDA

ny, nx, nu = 100, 1000, 200
dt = 0.01
T = Float32
A  = randn(T, nx,nx)
B  = randn(T, nx,nu)
C = randn(T, ny,nx)
D = zeros(T, ny,nu)
sys = ss(A, B, C, D)
sysd = c2d(sys, dt)
sysg = HeteroStateSpace(cu(A), cu(B), cu(C), cu(D))
sysdg = HeteroStateSpace(cu(sysd.A), cu(sysd.B), cu(sysd.C), cu(sysd.D), dt)

tmp = ones(nu)
u(x, t) = tmp
tmpg = cu(ones(nu))
ug(x, t) = tmpg

x0 = zeros(nx)
x0g = cu(zeros(nx))

t = 0:dt:1

uu = transpose(reduce(hcat, u(x0, tt) for tt in t))
uug = transpose(reduce(hcat, ug(x0g, tg) for tg in t))

@benchmark lsim(sys, u, t; x0=x0)
@benchmark lsim(sysg, ug, t; x0=x0g)

@benchmark lsim(sysd, u, t; x0=x0)
@benchmark lsim(sysdg, ug, t; x0=x0g)

@benchmark lsim(sysd, uu, t; x0=x0)
@benchmark lsim(sysdg, uug, t; x0=x0g)

which gave this result

julia> @benchmark lsim(sys, u, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  18.91 MiB
  allocs estimate:  3872
  --------------
  minimum time:     81.210 ms (0.00% GC)
  median time:      96.187 ms (0.00% GC)
  mean time:        97.154 ms (1.63% GC)
  maximum time:     112.403 ms (13.90% GC)
  --------------
  samples:          52
  evals/sample:     1

julia> @benchmark lsim(sysg, ug, t; x0=x0g)
BenchmarkTools.Trial: 
  memory estimate:  3.93 MiB
  allocs estimate:  82690
  --------------
  minimum time:     29.376 ms (0.00% GC)
  median time:      30.166 ms (0.00% GC)
  mean time:        35.255 ms (2.59% GC)
  maximum time:     150.480 ms (16.68% GC)
  --------------
  samples:          142
  evals/sample:     1

julia> @benchmark lsim(sysd, u, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  3.70 MiB
  allocs estimate:  831
  --------------
  minimum time:     14.601 ms (0.00% GC)
  median time:      16.871 ms (0.00% GC)
  mean time:        17.373 ms (2.32% GC)
  maximum time:     43.499 ms (57.95% GC)
  --------------
  samples:          288
  evals/sample:     1

julia> @benchmark lsim(sysdg, ug, t; x0=x0g)
BenchmarkTools.Trial: 
  memory estimate:  481.30 KiB
  allocs estimate:  20107
  --------------
  minimum time:     5.606 ms (0.00% GC)
  median time:      17.440 ms (0.00% GC)
  mean time:        18.180 ms (1.11% GC)
  maximum time:     293.485 ms (18.85% GC)
  --------------
  samples:          275
  evals/sample:     1

julia> @benchmark lsim(sysd, uu, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  3.47 MiB
  allocs estimate:  220
  --------------
  minimum time:     10.714 ms (0.00% GC)
  median time:      12.526 ms (0.00% GC)
  mean time:        13.298 ms (2.40% GC)
  maximum time:     37.849 ms (64.22% GC)
  --------------
  samples:          376
  evals/sample:     1

julia> @benchmark lsim(sysdg, uug, t; x0=x0g)
BenchmarkTools.Trial: 
  memory estimate:  223.44 KiB
  allocs estimate:  9603
  --------------
  minimum time:     2.641 ms (0.00% GC)
  median time:      2.774 ms (0.00% GC)
  mean time:        3.348 ms (3.66% GC)
  maximum time:     301.640 ms (25.66% GC)
  --------------
  samples:          1490
  evals/sample:     1

albheim · 2021-04-25T21:39:48Z

One other thing is that if you are using GPU you currently have to send x0 in since it wont be created correctly. I tried doing something like ...x0::AbstractVecOrMat=fill!(similar(A, size(A, 1)), 0) instead of the zeros(eltype(A), size(A, 1)) but it seemed to be slightly slower for the normal case so maybe it is better to leave it like it is and just have those who want to use GPU send in x0. This will however make it so that step for example will not work with GPU since x0 is created as zeros(nx) there.

JuliaControlBot · 2021-04-25T21:43:40Z

This is an automated message.
Plots were compared to references. 11/11 images have changed, see differences below.
After pulling this PR, please update the reference images by creating a PR to ControlExamplePlots.jl here.

Difference	Reference Image	New Image
❌ 0.031
⚠️ 0.025
✔️ 0.008
⚠️ 0.019
⚠️ 0.024
⚠️ 0.025
⚠️ 0.017
⚠️ 0.023
❌ 0.032
✔️ 0.006
✔️ 0.01

src/timeresp.jl

baggepinnen · 2021-04-26T06:15:32Z

src/timeresp.jl

    # Using similar instead of Matrix{T} to allow for CuArrays to be used.
    # This approach is problematic if x0 is sparse for example, but was considered
    # to be good enough for now
    x = similar(x0, T, (length(x0), n))

    x[:,1] .= x0
-    mul!(x[:, 2:end], B, transpose(u[1:end-1, :])) # Do all multiplications B*u[:,k] to save view allocations


Was there a problem with mul! on the GPU? Otherwise the mul! with views should avoid the allocation.

There was something with the views that made me remove it. Will have a look at what was going on.

I think your benchmark might be biased since you input u as a transpose btw

Worse is that the array is not strided when using views due to the removal of the last row.
Transposes are not always less efficient, A'B is usually more efficient than AB.

I think that we've already had some offline discussions regarding the intput/output format of lsim, but I think that we decided to push the decision to the future. I opened a new issue instead of hijacking this thread, see #469.

@mfalt Yeah, makes sense. Totally missed that.

Trying some different versions of this specific line.

First calling lsim with discrete sys and u vec where (ny, nx, nu) = (100, 500, 200) and u vec is not a transpose anymore.

@views mul!(x[:, 2:end], B, transpose(u[1:end-1, :])) error on GPU, 4.2ms median on CPU

@views mul!(x[:, 2:end], B, transpose(u)[:, 1:end-1]) 3.2s medain on GPU, 13.7ms median on CPU

x[:, 2:end] .= B * transpose(u[1:end-1, :]) 2.7ms median on GPU, 4.8ms median on CPU

x[:, 2:end] .= B * transpose(u)[:, 1:end-1] 287ms median on GPU, 4.8ms median on CPU

The calling lsim with (nx, ny, nu) = (1, 5, 2) (only for CPU here).

@views mul!(x[:, 2:end], B, transpose(u[1:end-1, :])) 31micros median on CPU

@views mul!(x[:, 2:end], B, transpose(u)[:, 1:end-1]) 35micros median on CPU

x[:, 2:end] .= B * transpose(u[1:end-1, :]) 24micros median on CPU

x[:, 2:end] .= B * transpose(u)[:, 1:end-1] 24micros median on CPU

So the original with views and mul! does not work with GPU for some reason, I have not dug deeper into why. Of the others it seems like the GPU does not like to transpose a sliced matrix (100x slower) so that is why I switched it around. And this held even when not sending in the u vec as a transpose.

I think the problem here is the data layout as discussed in #469
We need to chop u in a very unfortunate way, either to make a new copy and pay the allocation as in the last two examples, or use a view into a non-strided array, probably causing a generic fallback method to be called rather than an optimized BLAS/CUBLAS method. had the memory layout been transposed, i.e., u in R(nu x T), we could use @view and get a strided array hitting a BLAS method and both GPU and CPU versions would be blazing fast without allocations.

Example

x = zeros(4, 2^10) u = zeros(4, 2^10) B = randn(4,4) @btime @views mul!($(x)[:, 2:end], $(B), $(u)[:, 2:end]) # 3.873 μs (0 allocations: 0 bytes) @btime @views mul!($(x)[:, 2:end], $(B), $(u')[2:end, :]') # 29.313 μs (6 allocations: 336 bytes)

Yeah, I don't know what the best solution would be.

If #469 will happen then I think that would solve the problem.

I also asked in the CUDA.jl community and was told that the array type was too complex for them to do any specialized dispatch (mul! of a view of a transpose of a slice) but was told that this PR JuliaGPU/GPUArrays.jl#352 would allow us to get it working but not optimized for GPU using the current CPU-optimized code.