Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lsim gpu #468

Closed
wants to merge 6 commits into from
Closed

Lsim gpu #468

wants to merge 6 commits into from

Conversation

albheim
Copy link
Member

@albheim albheim commented Apr 25, 2021

Changed some lsim and ltitr code so it will support CuArrays or HeteroStateSpace with CuArrays.

Ran this test on master and new branch to make sure it kept same performance on normal arrays.

using ControlSystems, BenchmarkTools

ny, nx, nu = 10, 100, 20
dt = 0.01
A = randn(nx,nx)
B = randn(nx,nu)
C = randn(ny,nx)
D = zeros(ny,nu)
sys = ss(A, B, C, D)
sysd = c2d(sys, dt)

t = 0:dt:1
x0 = zeros(nx)

tmp = ones(nu)
u(x, t) = tmp
uu = transpose(reduce(hcat, u(x0, tt) for tt in t))

@benchmark lsim(sys, u, t; x0=x0)
@benchmark lsim(sys, uu, t; x0=x0)
@benchmark lsim(sysd, u, t; x0=x0)
@benchmark lsim(sysd, uu, t; x0=x0)

Results on master

julia> @benchmark lsim(sys, u, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  881.52 KiB
  allocs estimate:  3159
  --------------
  minimum time:     1.429 ms (0.00% GC)
  median time:      1.664 ms (0.00% GC)
  mean time:        1.935 ms (7.80% GC)
  maximum time:     22.504 ms (90.67% GC)
  --------------
  samples:          2572
  evals/sample:     1

julia> @benchmark lsim(sys, uu, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  3.84 MiB
  allocs estimate:  106
  --------------
  minimum time:     2.052 ms (0.00% GC)
  median time:      2.321 ms (0.00% GC)
  mean time:        2.529 ms (5.97% GC)
  maximum time:     14.941 ms (75.71% GC)
  --------------
  samples:          1971
  evals/sample:     1

julia> @benchmark lsim(sysd, u, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  421.77 KiB
  allocs estimate:  823
  --------------
  minimum time:     824.109 μs (0.00% GC)
  median time:      916.514 μs (0.00% GC)
  mean time:        970.925 μs (3.00% GC)
  maximum time:     15.533 ms (92.07% GC)
  --------------
  samples:          5125
  evals/sample:     1

julia> @benchmark lsim(sysd, uu, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  105.56 KiB
  allocs estimate:  20
  --------------
  minimum time:     714.580 μs (0.00% GC)
  median time:      788.074 μs (0.00% GC)
  mean time:        813.631 μs (0.55% GC)
  maximum time:     4.152 ms (79.61% GC)
  --------------
  samples:          6111
  evals/sample:     1

Results on new branch

julia> @benchmark lsim(sys, u, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  1.61 MiB
  allocs estimate:  2815
  --------------
  minimum time:     1.451 ms (0.00% GC)
  median time:      1.544 ms (0.00% GC)
  mean time:        1.697 ms (5.76% GC)
  maximum time:     18.358 ms (87.38% GC)
  --------------
  samples:          2930
  evals/sample:     1

julia> @benchmark lsim(sys, uu, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  4.10 MiB
  allocs estimate:  302
  --------------
  minimum time:     2.080 ms (0.00% GC)
  median time:      2.182 ms (0.00% GC)
  mean time:        2.456 ms (6.30% GC)
  maximum time:     8.137 ms (67.88% GC)
  --------------
  samples:          2029
  evals/sample:     1

julia> @benchmark lsim(sysd, u, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  421.81 KiB
  allocs estimate:  827
  --------------
  minimum time:     795.230 μs (0.00% GC)
  median time:      909.728 μs (0.00% GC)
  mean time:        994.502 μs (4.15% GC)
  maximum time:     28.030 ms (95.81% GC)
  --------------
  samples:          4990
  evals/sample:     1

julia> @benchmark lsim(sysd, uu, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  373.31 KiB
  allocs estimate:  216
  --------------
  minimum time:     688.868 μs (0.00% GC)
  median time:      764.116 μs (0.00% GC)
  mean time:        791.641 μs (2.06% GC)
  maximum time:     11.026 ms (90.31% GC)
  --------------
  samples:          6278
  evals/sample:     1

This code was used to compare normal system and ones using CuArrays

using ControlSystems, BenchmarkTools, CUDA

ny, nx, nu = 100, 1000, 200
dt = 0.01
T = Float32
A  = randn(T, nx,nx)
B  = randn(T, nx,nu)
C = randn(T, ny,nx)
D = zeros(T, ny,nu)
sys = ss(A, B, C, D)
sysd = c2d(sys, dt)
sysg = HeteroStateSpace(cu(A), cu(B), cu(C), cu(D))
sysdg = HeteroStateSpace(cu(sysd.A), cu(sysd.B), cu(sysd.C), cu(sysd.D), dt)

tmp = ones(nu)
u(x, t) = tmp
tmpg = cu(ones(nu))
ug(x, t) = tmpg

x0 = zeros(nx)
x0g = cu(zeros(nx))

t = 0:dt:1

uu = transpose(reduce(hcat, u(x0, tt) for tt in t))
uug = transpose(reduce(hcat, ug(x0g, tg) for tg in t))

@benchmark lsim(sys, u, t; x0=x0)
@benchmark lsim(sysg, ug, t; x0=x0g)

@benchmark lsim(sysd, u, t; x0=x0)
@benchmark lsim(sysdg, ug, t; x0=x0g)

@benchmark lsim(sysd, uu, t; x0=x0)
@benchmark lsim(sysdg, uug, t; x0=x0g)

which gave this result

julia> @benchmark lsim(sys, u, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  18.91 MiB
  allocs estimate:  3872
  --------------
  minimum time:     81.210 ms (0.00% GC)
  median time:      96.187 ms (0.00% GC)
  mean time:        97.154 ms (1.63% GC)
  maximum time:     112.403 ms (13.90% GC)
  --------------
  samples:          52
  evals/sample:     1

julia> @benchmark lsim(sysg, ug, t; x0=x0g)
BenchmarkTools.Trial: 
  memory estimate:  3.93 MiB
  allocs estimate:  82690
  --------------
  minimum time:     29.376 ms (0.00% GC)
  median time:      30.166 ms (0.00% GC)
  mean time:        35.255 ms (2.59% GC)
  maximum time:     150.480 ms (16.68% GC)
  --------------
  samples:          142
  evals/sample:     1

julia> @benchmark lsim(sysd, u, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  3.70 MiB
  allocs estimate:  831
  --------------
  minimum time:     14.601 ms (0.00% GC)
  median time:      16.871 ms (0.00% GC)
  mean time:        17.373 ms (2.32% GC)
  maximum time:     43.499 ms (57.95% GC)
  --------------
  samples:          288
  evals/sample:     1

julia> @benchmark lsim(sysdg, ug, t; x0=x0g)
BenchmarkTools.Trial: 
  memory estimate:  481.30 KiB
  allocs estimate:  20107
  --------------
  minimum time:     5.606 ms (0.00% GC)
  median time:      17.440 ms (0.00% GC)
  mean time:        18.180 ms (1.11% GC)
  maximum time:     293.485 ms (18.85% GC)
  --------------
  samples:          275
  evals/sample:     1

julia> @benchmark lsim(sysd, uu, t; x0=x0)
BenchmarkTools.Trial: 
  memory estimate:  3.47 MiB
  allocs estimate:  220
  --------------
  minimum time:     10.714 ms (0.00% GC)
  median time:      12.526 ms (0.00% GC)
  mean time:        13.298 ms (2.40% GC)
  maximum time:     37.849 ms (64.22% GC)
  --------------
  samples:          376
  evals/sample:     1

julia> @benchmark lsim(sysdg, uug, t; x0=x0g)
BenchmarkTools.Trial: 
  memory estimate:  223.44 KiB
  allocs estimate:  9603
  --------------
  minimum time:     2.641 ms (0.00% GC)
  median time:      2.774 ms (0.00% GC)
  mean time:        3.348 ms (3.66% GC)
  maximum time:     301.640 ms (25.66% GC)
  --------------
  samples:          1490
  evals/sample:     1

@albheim
Copy link
Member Author

albheim commented Apr 25, 2021

One other thing is that if you are using GPU you currently have to send x0 in since it wont be created correctly. I tried doing something like ...x0::AbstractVecOrMat=fill!(similar(A, size(A, 1)), 0) instead of the zeros(eltype(A), size(A, 1)) but it seemed to be slightly slower for the normal case so maybe it is better to leave it like it is and just have those who want to use GPU send in x0. This will however make it so that step for example will not work with GPU since x0 is created as zeros(nx) there.

@JuliaControlBot
Copy link

This is an automated message.
Plots were compared to references. 11/11 images have changed, see differences below.
After pulling this PR, please update the reference images by creating a PR to ControlExamplePlots.jl here.

Difference Reference Image New Image
❌ 0.031 Reference New
⚠️ 0.025 Reference New
✔️ 0.008 Reference New
⚠️ 0.019 Reference New
⚠️ 0.024 Reference New
⚠️ 0.025 Reference New
⚠️ 0.017 Reference New
⚠️ 0.023 Reference New
❌ 0.032 Reference New
✔️ 0.006 Reference New
✔️ 0.01 Reference New

src/timeresp.jl Show resolved Hide resolved
# Using similar instead of Matrix{T} to allow for CuArrays to be used.
# This approach is problematic if x0 is sparse for example, but was considered
# to be good enough for now
x = similar(x0, T, (length(x0), n))

x[:,1] .= x0
mul!(x[:, 2:end], B, transpose(u[1:end-1, :])) # Do all multiplications B*u[:,k] to save view allocations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a problem with mul! on the GPU? Otherwise the mul! with views should avoid the allocation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was something with the views that made me remove it. Will have a look at what was going on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your benchmark might be biased since you input u as a transpose btw

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worse is that the array is not strided when using views due to the removal of the last row.
Transposes are not always less efficient, A'B is usually more efficient than AB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we've already had some offline discussions regarding the intput/output format of lsim, but I think that we decided to push the decision to the future. I opened a new issue instead of hijacking this thread, see #469.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfalt Yeah, makes sense. Totally missed that.

Trying some different versions of this specific line.

First calling lsim with discrete sys and u vec where (ny, nx, nu) = (100, 500, 200) and u vec is not a transpose anymore.

  • @views mul!(x[:, 2:end], B, transpose(u[1:end-1, :])) error on GPU, 4.2ms median on CPU
  • @views mul!(x[:, 2:end], B, transpose(u)[:, 1:end-1]) 3.2s medain on GPU, 13.7ms median on CPU
  • x[:, 2:end] .= B * transpose(u[1:end-1, :]) 2.7ms median on GPU, 4.8ms median on CPU
  • x[:, 2:end] .= B * transpose(u)[:, 1:end-1] 287ms median on GPU, 4.8ms median on CPU

The calling lsim with (nx, ny, nu) = (1, 5, 2) (only for CPU here).

  • @views mul!(x[:, 2:end], B, transpose(u[1:end-1, :])) 31micros median on CPU
  • @views mul!(x[:, 2:end], B, transpose(u)[:, 1:end-1]) 35micros median on CPU
  • x[:, 2:end] .= B * transpose(u[1:end-1, :]) 24micros median on CPU
  • x[:, 2:end] .= B * transpose(u)[:, 1:end-1] 24micros median on CPU

So the original with views and mul! does not work with GPU for some reason, I have not dug deeper into why. Of the others it seems like the GPU does not like to transpose a sliced matrix (100x slower) so that is why I switched it around. And this held even when not sending in the u vec as a transpose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem here is the data layout as discussed in #469
We need to chop u in a very unfortunate way, either to make a new copy and pay the allocation as in the last two examples, or use a view into a non-strided array, probably causing a generic fallback method to be called rather than an optimized BLAS/CUBLAS method. had the memory layout been transposed, i.e., u in R(nu x T), we could use @view and get a strided array hitting a BLAS method and both GPU and CPU versions would be blazing fast without allocations.

Copy link
Member

@baggepinnen baggepinnen Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example

x = zeros(4, 2^10)
u = zeros(4, 2^10)
B = randn(4,4)

@btime @views mul!($(x)[:, 2:end], $(B), $(u)[:, 2:end])   # 3.873 μs (0 allocations: 0 bytes)
@btime @views mul!($(x)[:, 2:end], $(B), $(u')[2:end, :]') # 29.313 μs (6 allocations: 336 bytes)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't know what the best solution would be.

If #469 will happen then I think that would solve the problem.

I also asked in the CUDA.jl community and was told that the array type was too complex for them to do any specialized dispatch (mul! of a view of a transpose of a slice) but was told that this PR JuliaGPU/GPUArrays.jl#352 would allow us to get it working but not optimized for GPU using the current CPU-optimized code.

@albheim
Copy link
Member Author

albheim commented May 13, 2021

Closing this, most of this was updated in #480

@albheim albheim closed this May 13, 2021
@albheim albheim deleted the lsim_gpu branch May 13, 2021 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants