Simple benchmarking for Tracker, ReverseDiff, and Zygote #1140

yebai · 2020-03-02T15:35:39Z

The following is a simple benchmarking for loops, on 3 different reverse-mode AD implementations in Julia. At the moment, in terms of efficiency: ReverseDiff > Tracker > Zygote.

julia> mysum(x) = begin z = 0; for i =1:length(x); z += x[i]; end; return z; end
mysum (generic function with 1 method)

julia> using Zygote, Tracker, ReverseDiff, BenchmarkTools

julia> x = rand(10000);

julia> @btime ReverseDiff.gradient(mysum, x);
  2.915 ms (70022 allocations: 2.92 MiB)

julia> @btime Tracker.gradient(mysum, x);
  491.095 ms (150016 allocations: 767.59 MiB)

julia> @btime Zygote.gradient(mysum, x);
  976.786 ms (280102 allocations: 1.50 GiB)

Benchmark setup:

(v1.2) pkg> st
    Status `~/.julia/environments/v1.2/Project.toml`
  [0bf59076] AdvancedHMC v0.2.20
  [131c737c] ArviZ v0.2.4
  [c52e3926] Atom v0.11.3
  [6e4b80f9] BenchmarkTools v0.4.3
  [5d742f6a] CSVFiles v1.0.0
  [8f4d0f93] Conda v1.3.0
  [a93c6f00] DataFrames v0.20.0
  [163ba53b] DiffResults v1.0.2
  [31c24e10] Distributions v0.21.12
  [bbc10e6e] DynamicHMC v2.1.3
  [f6369f11] ForwardDiff v0.10.8
  [7073ff75] IJulia v1.20.2
  [e5e0dc1b] Juno v0.7.2
  [c7f686f2] MCMCChains v1.0.0
  [91a5bcdd] Plots v0.28.4
  [438e738f] PyCall v1.91.2
  [37e2e3b7] ReverseDiff v1.1.0
  [2913bbd2] StatsBase v0.32.0
  [4c63d2b9] StatsFuns v0.9.3
  [f3b207a7] StatsPlots v0.13.0
  [1d978283] TensorFlow v0.11.0
  [9f7883ad] Tracker v0.2.6
  [fce5fe82] Turing v0.8.2
  [0f1e0344] WebIO v0.8.13
  [e88e6eb3] Zygote v0.4.9
  [ade2ca70] Dates

The text was updated successfully, but these errors were encountered:

trappmartin · 2020-03-04T13:36:49Z

cc: @phipsgabler

xukai92 · 2020-03-04T13:42:38Z

There is a more comprehensive one avaiable here: https://github.com/TuringLang/TuringExamples/blob/kx/improve_functionality/benchmarks/auto_diff/loop-vs-unrolled.ipynb

yebai · 2020-03-26T13:20:52Z

Numbers using PyTorch's via ThArrays

julia> using ThArrays

julia> @btime  ThArrays.gradient(mysum, x);
  477.310 ms (160017 allocations: 7.32 MiB)

yebai · 2020-03-26T13:22:01Z

This example might be over-favourable to ReverseDiff. @mohamed82008 @xukai92

yebai · 2020-03-26T13:24:42Z

@KDr2 can you check whether ThArrays have any type-instability issue?

julia> @code_warntype ThArrays.gradient(mysum, x);
Variables
  #self#::Core.Compiler.Const(ThArrays.gradient, false)
  f::Core.Compiler.Const(mysum, false)
  data::Tuple{Array{Float64,1}}

Body::Tuple{Union{Int64, Tensor{_A,_B} where _B where _A},Tensor{_A,_B} where _B where _A}
1 ─ %1 = Core.tuple(ThArrays.C_NULL, #self#, f)::Core.Compiler.Const((Ptr{Nothing} @0x0000000000000000, ThArrays.gradient, mysum), false)
│   %2 = Core._apply(ThArrays.:(#gradient#30), %1, data)::Tuple{Union{Int64, Tensor{_A,_B} where _B where _A},Tensor{_A,_B} where _B where _A}
└──      return %2

KDr2 · 2020-03-26T14:43:48Z

@yebai
define

+tensor_from_ptr(p::Ptr{Tensor{T, N}}) where {T, N} = Tensor{T, N}(p, nothing)

and

-function grad(self::Tensor)
+function grad(self::Tensor{T, N}) where {T, N}
     outputs__ = Int[0]
     __cret = ccall((:atg_grad, :libtorch_capi),
                  Cvoid, (Ptr{Cvoid}, Ptr{Cvoid}),
                  outputs__, self.pointer)
-    __o_1 = tensor_from_ptr(Ptr{Cvoid}(outputs__[1]))
+    __o_1 = tensor_from_ptr(Ptr{Tensor{T, N}}(outputs__[1]))
     return __o_1
 end

can erase the type warnings, but I think (and from my experiment on this example) it has little influence on the benchmark results.

Maybe do a comparison on ThArrays and PyTorch to see if the result is sensible?

yebai · 2020-03-26T14:47:39Z

Maybe do a comparison on ThArrays and PyTorch to see if the result is sensible?

Can you code this example in C++, and report the run time?

KDr2 · 2020-03-26T15:15:31Z

#include <torch/torch.h>
#include <torch/csrc/autograd/variable.h>
#include <torch/csrc/autograd/function.h>

#include <typeinfo>
#include <iostream>

int main() {
    torch::Tensor t = torch::rand({10000}, torch::requires_grad(1));
    torch::Tensor r(t[0].clone());
    r.set_requires_grad(1);

    for(int i=0; i< 10000; i++ ) {
        r += t[i];
    }
    r.backward();
    // std::cout << t.grad() << std::endl;
    t.grad();
    return 0;
}

you can replace csrc/example_app.cpp with this code, then run make in deps/lib, there will be an example.exe.

On my machine, it takes 0.5s while the Julia version takes 1.25s.

I think in the Julia version most time is spent on indexing op. In C++, these ops are single function call (and maybe inlined), while in the Julia version, each op contains many function calls.

If you use sum instead of mysum, Julia and C++ are at the same level.

I don't if we can compile our function to torchscript, if we can, these time-consuming ops would disappear at runtime.

yebai · 2020-03-26T15:17:13Z

you can replace csrc/example_app.cpp with this code, then run make in deps/lib, there will be an example.exe.

Are you running this Windows?

yebai · 2020-03-26T15:18:59Z

On my machine, it takes 0.5s while the Julia version takes 1.25s.

This result (0.5s) seems consistent with julia when using BenchmarkTools:

julia> @btime  ThArrays.gradient(mysum, x);
  477.310 ms (160017 allocations: 7.32 MiB)

KDr2 · 2020-03-26T15:20:14Z

No, on linux, it's a typo, it's an ELF, sorry for the confusion.

$ file example-app
example-app: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=4cbe6786f7f899d38c1cc6be65e7435aba5faa0b, for GNU/Linux 3.2.0, not stripped

On my (old) machine, it takes 1.25s in Julia, takes 0.5s in C++.

KDr2 · 2020-03-26T15:22:38Z

julia> @btime ThArrays.gradient(mysum, x);
  1.257 s (188995 allocations: 7.92 MiB)

Julia>

$ time ./example-app >/dev/null

real    0m0.519s
user    0m0.491s
sys     0m0.013s

yebai · 2020-03-26T15:23:15Z

thanks, @KDr2

julia> @btime ReverseDiff.gradient(mysum, x);
  2.915 ms (70022 allocations: 2.92 MiB)

Based on the C++ runtime for PyTorch, maybe the superiority of ReverseDiff is mainly from tape-caching, which avoids duplicate memory allocation?

yebai · 2020-03-26T15:24:06Z

I don't if we can compile our function to torchscript, if we can, these time-consuming ops would disappear at runtime.

This sounds plausible for a subset of Julia, given that Julia's syntax is so close with Python (torchscript is a subset of Python).

devmotion · 2020-03-26T15:25:32Z

I think in the Julia version most time is spent on indexing op. In C++, these ops are single function call (and maybe inlined), while in the Julia version, each op contains many function calls.

I'm wondering if adding @inbounds to mysum could improve the performance of the Julia version?

KDr2 · 2020-03-26T15:28:44Z

In real-world practice, we may not do indexing in such a frequency, e.g. we should use sum instead of mysum.

About torchscript, I didn't mean a source to source translation, I think we can construct computational graph using Julia, then save the graph as torchscript or some internal format, then call it in Turing.

yebai · 2020-03-26T15:31:47Z

About torchscript, I didn't mean a source to source translation, I think we can construct computational graph using Julia, then save the graph as torchscript or some internal format, then call it in Turing.

That works too, but at a cost of losing control flows: loops are unrolled, branches are recorded for specific branches only.

yebai · 2020-03-26T17:04:47Z

The numbers are slightly better with @inbounds

julia> mysum_inbounds(x) = begin z = 0; @inbounds for i =1:length(x); z += x[i]; end; return z; end
mysum_inbounds (generic function with 1 method)

julia> @btime ReverseDiff.gradient(mysum_inbounds, x);
  2.436 ms (70022 allocations: 2.92 MiB)

julia> @btime Tracker.gradient(mysum_inbounds, x);
  439.518 ms (150016 allocations: 767.59 MiB)

julia> @btime Zygote.gradient(mysum_inbounds, x);
  856.985 ms (280102 allocations: 1.50 GiB)

julia> @btime ThArrays.gradient(mysum_inbounds, x);
  465.540 ms (160017 allocations: 7.32 MiB)

devmotion · 2020-03-26T17:17:25Z

I guess you should also use $ to interpolate x. It might affect the results since you're benchmarking with a global variable.

yebai · 2020-03-26T20:22:51Z

thanks @devmotion, I rerun the benchmarks with $x. The results aren't very different from above ones.

Interestingly, I did another test to separate the effects of indexing and loops, which produces:

julia> mysum2(x) = begin z = 0;_x=x[1]; for i =1:10000; z += _x; end; return z; end
mysum (generic function with 1 method)

julia> @btime ReverseDiff.gradient(mysum2, [1.0]);
  2.396 ms (60021 allocations: 2.23 MiB)

julia> @btime Tracker.gradient(mysum2, [1.0]);
  945.712 μs (60022 allocations: 1.83 MiB)

julia> @btime Zygote.gradient(mysum2, [1.0]);
  7.260 ms (100108 allocations: 3.54 MiB)

julia> @btime ThArrays.gradient(mysum2, [1.0]);
  69.599 ms (40026 allocations: 1.68 MiB)

This result together with the above ones suggest:

Zygote isn't that bad with loops (see Type instability with conditional return FluxML/Zygote.jl#371 and For loop type instability FluxML/Zygote.jl#312), but bad at indexing (see getindex performance FluxML/Zygote.jl#365)
ReverseDiff is ok with both loops and indexing
~~Tracker is quite bad with loops, not sure about indexing.~~
Tracker is good with loops, but quite bad with indexing
PyTorch / ThArrays is ok with loops but bad with indexing.

Ps

Zygote is slow with broadcasting: Slow code and excessive allocations when backpropagating broadcasting FluxML/Zygote.jl#592

devmotion · 2020-03-26T20:26:46Z

Tracker is quite bad with loops, not sure about indexing.

Tracker is actually the fastest in the loop benchmark, isn't it?

yebai · 2020-03-26T20:27:16Z

Tracker is actually the fastest in the loop benchmark, isn't it?

You're right! I didn't notice the unit!

yebai · 2020-03-26T20:43:09Z

Interesting ReverseDiff is so good with indexing, and wonder whether we can apply the same trick from ReverseDiff to improve Tracker and Zygote.

mohamed82008 · 2020-03-26T23:24:02Z

On the flip side, ReverseDiff and Zygote are both slower than Tracker in code with broadcasting.

mohamed82008 · 2020-03-27T00:07:58Z

But I am trying to make RD faster.

KDr2 · 2020-03-27T01:37:44Z

I did some optimizations on ThArrays, compintell/THArrays.jl@c021717, indexing is a little faster(10%?) now.

yebai · 2021-06-19T10:32:13Z

I just re-run the same benchmarks. Here is an update:

julia> @btime ReverseDiff.gradient(mysum, x);

  1.194 ms (70023 allocations: 3.07 MiB)

julia> @btime Tracker.gradient(mysum, x);
  86.576 ms (190012 allocations: 769.27 MiB)

julia> @btime Zygote.gradient(mysum, x);


  90.346 ms (180107 allocations: 769.40 MiB)

yebai added autograd labels Mar 2, 2020

yebai mentioned this issue Mar 26, 2020

More comprehensive AD benchmarking #1184

Closed

yebai mentioned this issue Jan 5, 2021

Inference using :reversediff, :tracker and :zygote performs worse than :forwarddiff #1513

Closed

yebai closed this as completed Nov 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple benchmarking for Tracker, ReverseDiff, and Zygote #1140

Simple benchmarking for Tracker, ReverseDiff, and Zygote #1140

yebai commented Mar 2, 2020 •

edited

Loading

trappmartin commented Mar 4, 2020

xukai92 commented Mar 4, 2020

yebai commented Mar 26, 2020 •

edited

Loading

yebai commented Mar 26, 2020

yebai commented Mar 26, 2020 •

edited

Loading

KDr2 commented Mar 26, 2020

yebai commented Mar 26, 2020

KDr2 commented Mar 26, 2020 •

edited

Loading

yebai commented Mar 26, 2020

yebai commented Mar 26, 2020

KDr2 commented Mar 26, 2020

KDr2 commented Mar 26, 2020

yebai commented Mar 26, 2020

yebai commented Mar 26, 2020

devmotion commented Mar 26, 2020

KDr2 commented Mar 26, 2020

yebai commented Mar 26, 2020

yebai commented Mar 26, 2020

devmotion commented Mar 26, 2020

yebai commented Mar 26, 2020 •

edited

Loading

devmotion commented Mar 26, 2020

yebai commented Mar 26, 2020

yebai commented Mar 26, 2020

mohamed82008 commented Mar 26, 2020

mohamed82008 commented Mar 27, 2020

KDr2 commented Mar 27, 2020

yebai commented Jun 19, 2021

Simple benchmarking for Tracker, ReverseDiff, and Zygote #1140

Simple benchmarking for Tracker, ReverseDiff, and Zygote #1140

Comments

yebai commented Mar 2, 2020 • edited Loading

trappmartin commented Mar 4, 2020

xukai92 commented Mar 4, 2020

yebai commented Mar 26, 2020 • edited Loading

yebai commented Mar 26, 2020

yebai commented Mar 26, 2020 • edited Loading

KDr2 commented Mar 26, 2020

yebai commented Mar 26, 2020

KDr2 commented Mar 26, 2020 • edited Loading

yebai commented Mar 26, 2020

yebai commented Mar 26, 2020

KDr2 commented Mar 26, 2020

KDr2 commented Mar 26, 2020

yebai commented Mar 26, 2020

yebai commented Mar 26, 2020

devmotion commented Mar 26, 2020

KDr2 commented Mar 26, 2020

yebai commented Mar 26, 2020

yebai commented Mar 26, 2020

devmotion commented Mar 26, 2020

yebai commented Mar 26, 2020 • edited Loading

devmotion commented Mar 26, 2020

yebai commented Mar 26, 2020

yebai commented Mar 26, 2020

mohamed82008 commented Mar 26, 2020

mohamed82008 commented Mar 27, 2020

KDr2 commented Mar 27, 2020

yebai commented Jun 19, 2021

yebai commented Mar 2, 2020 •

edited

Loading

yebai commented Mar 26, 2020 •

edited

Loading

yebai commented Mar 26, 2020 •

edited

Loading

KDr2 commented Mar 26, 2020 •

edited

Loading

yebai commented Mar 26, 2020 •

edited

Loading