Skip to content
This repository has been archived by the owner on Mar 12, 2021. It is now read-only.

Sudden performance drop after many matrix multiplications #359

Closed
tscode opened this issue Jun 26, 2019 · 2 comments
Closed

Sudden performance drop after many matrix multiplications #359

tscode opened this issue Jun 26, 2019 · 2 comments
Labels

Comments

@tscode
Copy link

tscode commented Jun 26, 2019

When doing repeated matrix-vector multiplications with a 10000x10000 matrix, the performance drops significantly (factor 50 to 100) after ~250 iterations. This is probably related to #323, even though the code should (ideally) not allocate additional arrays, I think, and nvidia-smi is also telling me that the memory usage stays limited at under 1 GB.

using CuArrays

n = 10000
a = rand(Float32, n)    |> CuArray
b = rand(Float32, n)    |> CuArray
c = rand(Float32, n, n) |> CuArray

for i in 1:1000
  @time b .= c * a
end

Initially it looks like this

0.000166 seconds (67 allocations: 2.047 KiB)
0.000098 seconds (67 allocations: 2.047 KiB)
0.000067 seconds (67 allocations: 2.047 KiB)
...

but after some time it becomes

0.003638 seconds (70 allocations: 2.094 KiB)
0.004694 seconds (70 allocations: 2.094 KiB)
0.003664 seconds (70 allocations: 2.094 KiB)
...

I tried it on different machines under julia 1.1.0 with a Tesla M10, GTX690, and GTX1070Ti running Ubuntu 18.04 (cuda 10.1), 16.04 (cuda 8.0) and Arch Linux (cuda 10.1) respectively. I tried both add CuArrays and add CuArrays#master.

Questions:

  • can you reproduce this?
  • is there some obvious problem I don't get?
  • if it is GC-related: any easy way to prevent allocations / the slowdown?
@tscode tscode added the bug label Jun 26, 2019
@kose-y
Copy link

kose-y commented Jun 27, 2019

I could reproduce it on my system (GTX1080 on Ubuntu 16.04 (cuda 9)), Julia 1.0.1.

beginning with

  0.000063 seconds (76 allocations: 2.641 KiB)
  0.000094 seconds (79 allocations: 2.703 KiB)
  0.000072 seconds (80 allocations: 2.719 KiB)
  ...

and slows down to

  0.002140 seconds (83 allocations: 2.766 KiB)
  0.002125 seconds (83 allocations: 2.766 KiB)
  0.002145 seconds (83 allocations: 2.766 KiB)
  ...

after about 500 iterations.

Regarding matrix multiplications, it seems like using mul! is more efficient. With the code

using CuArrays

n = 10000
a = rand(Float32, n)    |> CuArray
b = rand(Float32, n)    |> CuArray
c = rand(Float32, n, n) |> CuArray

for i in 1:2000
  @time mul!(b, c, a)
end

it takes about 0.000013 seconds per iteration at first. However, it still slows down to 0.002 sec/iteration after about 1000 iterations.

By the way, 0.002 sec/iteration is still 10-20x faster than doing the same thing on the CPU (on two 10-core Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz, with MKL backend), which is already pretty much satisfactory. Maybe something unexpected is happening regarding timing?

@tscode
Copy link
Author

tscode commented Jun 27, 2019

Thanks for your response. I took a closer look at CuArrays.jl and realized that I was tricked by the asynchronous nature of CuArrays. Replacing @time ... with @time CuArrays.@sync ... shows the "slower" runtime right from the beginning. I was just confused because I saw no real performance improvement for my algorithm using the M10 compared to the server-CPU - but that just seems to be the harsh truth.

Sorry for the noise I produced!

@tscode tscode closed this as completed Jun 27, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants