Sudden performance drop after many matrix multiplications #359

tscode · 2019-06-26T07:34:14Z

When doing repeated matrix-vector multiplications with a 10000x10000 matrix, the performance drops significantly (factor 50 to 100) after ~250 iterations. This is probably related to #323, even though the code should (ideally) not allocate additional arrays, I think, and nvidia-smi is also telling me that the memory usage stays limited at under 1 GB.

using CuArrays

n = 10000
a = rand(Float32, n)    |> CuArray
b = rand(Float32, n)    |> CuArray
c = rand(Float32, n, n) |> CuArray

for i in 1:1000
  @time b .= c * a
end

Initially it looks like this

0.000166 seconds (67 allocations: 2.047 KiB)
0.000098 seconds (67 allocations: 2.047 KiB)
0.000067 seconds (67 allocations: 2.047 KiB)
...

but after some time it becomes

0.003638 seconds (70 allocations: 2.094 KiB)
0.004694 seconds (70 allocations: 2.094 KiB)
0.003664 seconds (70 allocations: 2.094 KiB)
...

I tried it on different machines under julia 1.1.0 with a Tesla M10, GTX690, and GTX1070Ti running Ubuntu 18.04 (cuda 10.1), 16.04 (cuda 8.0) and Arch Linux (cuda 10.1) respectively. I tried both add CuArrays and add CuArrays#master.

Questions:

can you reproduce this?
is there some obvious problem I don't get?
if it is GC-related: any easy way to prevent allocations / the slowdown?

The text was updated successfully, but these errors were encountered:

kose-y · 2019-06-27T00:21:45Z

I could reproduce it on my system (GTX1080 on Ubuntu 16.04 (cuda 9)), Julia 1.0.1.

beginning with

  0.000063 seconds (76 allocations: 2.641 KiB)
  0.000094 seconds (79 allocations: 2.703 KiB)
  0.000072 seconds (80 allocations: 2.719 KiB)
  ...

and slows down to

  0.002140 seconds (83 allocations: 2.766 KiB)
  0.002125 seconds (83 allocations: 2.766 KiB)
  0.002145 seconds (83 allocations: 2.766 KiB)
  ...

after about 500 iterations.

Regarding matrix multiplications, it seems like using mul! is more efficient. With the code

using CuArrays

n = 10000
a = rand(Float32, n)    |> CuArray
b = rand(Float32, n)    |> CuArray
c = rand(Float32, n, n) |> CuArray

for i in 1:2000
  @time mul!(b, c, a)
end

it takes about 0.000013 seconds per iteration at first. However, it still slows down to 0.002 sec/iteration after about 1000 iterations.

By the way, 0.002 sec/iteration is still 10-20x faster than doing the same thing on the CPU (on two 10-core Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz, with MKL backend), which is already pretty much satisfactory. Maybe something unexpected is happening regarding timing?

tscode · 2019-06-27T12:05:17Z

Thanks for your response. I took a closer look at CuArrays.jl and realized that I was tricked by the asynchronous nature of CuArrays. Replacing @time ... with @time CuArrays.@sync ... shows the "slower" runtime right from the beginning. I was just confused because I saw no real performance improvement for my algorithm using the M10 compared to the server-CPU - but that just seems to be the harsh truth.

Sorry for the noise I produced!

tscode added the bug label Jun 26, 2019

tscode closed this as completed Jun 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sudden performance drop after many matrix multiplications #359

Sudden performance drop after many matrix multiplications #359

tscode commented Jun 26, 2019

kose-y commented Jun 27, 2019 •

edited

Loading

tscode commented Jun 27, 2019

Sudden performance drop after many matrix multiplications #359

Sudden performance drop after many matrix multiplications #359

Comments

tscode commented Jun 26, 2019

kose-y commented Jun 27, 2019 • edited Loading

tscode commented Jun 27, 2019

kose-y commented Jun 27, 2019 •

edited

Loading