Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple benchmarking for Tracker, ReverseDiff, and Zygote #1140

Closed
yebai opened this issue Mar 2, 2020 · 27 comments
Closed

Simple benchmarking for Tracker, ReverseDiff, and Zygote #1140

yebai opened this issue Mar 2, 2020 · 27 comments

Comments

@yebai
Copy link
Member

yebai commented Mar 2, 2020

The following is a simple benchmarking for loops, on 3 different reverse-mode AD implementations in Julia. At the moment, in terms of efficiency: ReverseDiff > Tracker > Zygote.

julia> mysum(x) = begin z = 0; for i =1:length(x); z += x[i]; end; return z; end
mysum (generic function with 1 method)

julia> using Zygote, Tracker, ReverseDiff, BenchmarkTools

julia> x = rand(10000);

julia> @btime ReverseDiff.gradient(mysum, x);
  2.915 ms (70022 allocations: 2.92 MiB)

julia> @btime Tracker.gradient(mysum, x);
  491.095 ms (150016 allocations: 767.59 MiB)

julia> @btime Zygote.gradient(mysum, x);
  976.786 ms (280102 allocations: 1.50 GiB)

Benchmark setup:

(v1.2) pkg> st
    Status `~/.julia/environments/v1.2/Project.toml`
  [0bf59076] AdvancedHMC v0.2.20
  [131c737c] ArviZ v0.2.4
  [c52e3926] Atom v0.11.3
  [6e4b80f9] BenchmarkTools v0.4.3
  [5d742f6a] CSVFiles v1.0.0
  [8f4d0f93] Conda v1.3.0
  [a93c6f00] DataFrames v0.20.0
  [163ba53b] DiffResults v1.0.2
  [31c24e10] Distributions v0.21.12
  [bbc10e6e] DynamicHMC v2.1.3
  [f6369f11] ForwardDiff v0.10.8
  [7073ff75] IJulia v1.20.2
  [e5e0dc1b] Juno v0.7.2
  [c7f686f2] MCMCChains v1.0.0
  [91a5bcdd] Plots v0.28.4
  [438e738f] PyCall v1.91.2
  [37e2e3b7] ReverseDiff v1.1.0
  [2913bbd2] StatsBase v0.32.0
  [4c63d2b9] StatsFuns v0.9.3
  [f3b207a7] StatsPlots v0.13.0
  [1d978283] TensorFlow v0.11.0
  [9f7883ad] Tracker v0.2.6
  [fce5fe82] Turing v0.8.2
  [0f1e0344] WebIO v0.8.13
  [e88e6eb3] Zygote v0.4.9
  [ade2ca70] Dates 

@trappmartin
Copy link
Member

cc: @phipsgabler

@xukai92
Copy link
Member

xukai92 commented Mar 4, 2020

There is a more comprehensive one avaiable here: https://github.com/TuringLang/TuringExamples/blob/kx/improve_functionality/benchmarks/auto_diff/loop-vs-unrolled.ipynb

@yebai
Copy link
Member Author

yebai commented Mar 26, 2020

Numbers using PyTorch's via ThArrays

julia> using ThArrays

julia> @btime  ThArrays.gradient(mysum, x);
  477.310 ms (160017 allocations: 7.32 MiB)

@yebai
Copy link
Member Author

yebai commented Mar 26, 2020

This example might be over-favourable to ReverseDiff. @mohamed82008 @xukai92

@yebai
Copy link
Member Author

yebai commented Mar 26, 2020

@KDr2 can you check whether ThArrays have any type-instability issue?

julia> @code_warntype ThArrays.gradient(mysum, x);
Variables
  #self#::Core.Compiler.Const(ThArrays.gradient, false)
  f::Core.Compiler.Const(mysum, false)
  data::Tuple{Array{Float64,1}}

Body::Tuple{Union{Int64, Tensor{_A,_B} where _B where _A},Tensor{_A,_B} where _B where _A}
1%1 = Core.tuple(ThArrays.C_NULL, #self#, f)::Core.Compiler.Const((Ptr{Nothing} @0x0000000000000000, ThArrays.gradient, mysum), false)%2 = Core._apply(ThArrays.:(#gradient#30), %1, data)::Tuple{Union{Int64, Tensor{_A,_B} where _B where _A},Tensor{_A,_B} where _B where _A}
└──      return %2

@KDr2
Copy link
Member

KDr2 commented Mar 26, 2020

@yebai
define

+tensor_from_ptr(p::Ptr{Tensor{T, N}}) where {T, N} = Tensor{T, N}(p, nothing)

and

-function grad(self::Tensor)
+function grad(self::Tensor{T, N}) where {T, N}
     outputs__ = Int[0]
     __cret = ccall((:atg_grad, :libtorch_capi),
                  Cvoid, (Ptr{Cvoid}, Ptr{Cvoid}),
                  outputs__, self.pointer)
-    __o_1 = tensor_from_ptr(Ptr{Cvoid}(outputs__[1]))
+    __o_1 = tensor_from_ptr(Ptr{Tensor{T, N}}(outputs__[1]))
     return __o_1
 end

can erase the type warnings, but I think (and from my experiment on this example) it has little influence on the benchmark results.

Maybe do a comparison on ThArrays and PyTorch to see if the result is sensible?

@yebai
Copy link
Member Author

yebai commented Mar 26, 2020

Maybe do a comparison on ThArrays and PyTorch to see if the result is sensible?

Can you code this example in C++, and report the run time?

@KDr2
Copy link
Member

KDr2 commented Mar 26, 2020

#include <torch/torch.h>
#include <torch/csrc/autograd/variable.h>
#include <torch/csrc/autograd/function.h>

#include <typeinfo>
#include <iostream>

int main() {
    torch::Tensor t = torch::rand({10000}, torch::requires_grad(1));
    torch::Tensor r(t[0].clone());
    r.set_requires_grad(1);

    for(int i=0; i< 10000; i++ ) {
        r += t[i];
    }
    r.backward();
    // std::cout << t.grad() << std::endl;
    t.grad();
    return 0;
}

you can replace csrc/example_app.cpp with this code, then run make in deps/lib, there will be an example.exe.

On my machine, it takes 0.5s while the Julia version takes 1.25s.

I think in the Julia version most time is spent on indexing op. In C++, these ops are single function call (and maybe inlined), while in the Julia version, each op contains many function calls.

If you use sum instead of mysum, Julia and C++ are at the same level.

I don't if we can compile our function to torchscript, if we can, these time-consuming ops would disappear at runtime.

@yebai
Copy link
Member Author

yebai commented Mar 26, 2020

you can replace csrc/example_app.cpp with this code, then run make in deps/lib, there will be an example.exe.

Are you running this Windows?

@yebai
Copy link
Member Author

yebai commented Mar 26, 2020

On my machine, it takes 0.5s while the Julia version takes 1.25s.

This result (0.5s) seems consistent with julia when using BenchmarkTools:

julia> @btime  ThArrays.gradient(mysum, x);
  477.310 ms (160017 allocations: 7.32 MiB)

@KDr2
Copy link
Member

KDr2 commented Mar 26, 2020

No, on linux, it's a typo, it's an ELF, sorry for the confusion.

$ file example-app
example-app: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=4cbe6786f7f899d38c1cc6be65e7435aba5faa0b, for GNU/Linux 3.2.0, not stripped

On my (old) machine, it takes 1.25s in Julia, takes 0.5s in C++.

@KDr2
Copy link
Member

KDr2 commented Mar 26, 2020

julia> @btime ThArrays.gradient(mysum, x);
  1.257 s (188995 allocations: 7.92 MiB)

Julia>

$ time ./example-app >/dev/null

real    0m0.519s
user    0m0.491s
sys     0m0.013s

@yebai
Copy link
Member Author

yebai commented Mar 26, 2020

thanks, @KDr2

julia> @btime ReverseDiff.gradient(mysum, x);
  2.915 ms (70022 allocations: 2.92 MiB)

Based on the C++ runtime for PyTorch, maybe the superiority of ReverseDiff is mainly from tape-caching, which avoids duplicate memory allocation?

@yebai
Copy link
Member Author

yebai commented Mar 26, 2020

I don't if we can compile our function to torchscript, if we can, these time-consuming ops would disappear at runtime.

This sounds plausible for a subset of Julia, given that Julia's syntax is so close with Python (torchscript is a subset of Python).

@devmotion
Copy link
Member

I think in the Julia version most time is spent on indexing op. In C++, these ops are single function call (and maybe inlined), while in the Julia version, each op contains many function calls.

I'm wondering if adding @inbounds to mysum could improve the performance of the Julia version?

@KDr2
Copy link
Member

KDr2 commented Mar 26, 2020

In real-world practice, we may not do indexing in such a frequency, e.g. we should use sum instead of mysum.

About torchscript, I didn't mean a source to source translation, I think we can construct computational graph using Julia, then save the graph as torchscript or some internal format, then call it in Turing.

@yebai
Copy link
Member Author

yebai commented Mar 26, 2020

About torchscript, I didn't mean a source to source translation, I think we can construct computational graph using Julia, then save the graph as torchscript or some internal format, then call it in Turing.

That works too, but at a cost of losing control flows: loops are unrolled, branches are recorded for specific branches only.

@yebai
Copy link
Member Author

yebai commented Mar 26, 2020

The numbers are slightly better with @inbounds

julia> mysum_inbounds(x) = begin z = 0; @inbounds for i =1:length(x); z += x[i]; end; return z; end
mysum_inbounds (generic function with 1 method)

julia> @btime ReverseDiff.gradient(mysum_inbounds, x);
  2.436 ms (70022 allocations: 2.92 MiB)

julia> @btime Tracker.gradient(mysum_inbounds, x);
  439.518 ms (150016 allocations: 767.59 MiB)

julia> @btime Zygote.gradient(mysum_inbounds, x);
  856.985 ms (280102 allocations: 1.50 GiB)

julia> @btime ThArrays.gradient(mysum_inbounds, x);
  465.540 ms (160017 allocations: 7.32 MiB)

@devmotion
Copy link
Member

I guess you should also use $ to interpolate x. It might affect the results since you're benchmarking with a global variable.

@yebai
Copy link
Member Author

yebai commented Mar 26, 2020

thanks @devmotion, I rerun the benchmarks with $x. The results aren't very different from above ones.

Interestingly, I did another test to separate the effects of indexing and loops, which produces:

julia> mysum2(x) = begin z = 0;_x=x[1]; for i =1:10000; z += _x; end; return z; end
mysum (generic function with 1 method)

julia> @btime ReverseDiff.gradient(mysum2, [1.0]);
  2.396 ms (60021 allocations: 2.23 MiB)

julia> @btime Tracker.gradient(mysum2, [1.0]);
  945.712 μs (60022 allocations: 1.83 MiB)

julia> @btime Zygote.gradient(mysum2, [1.0]);
  7.260 ms (100108 allocations: 3.54 MiB)

julia> @btime ThArrays.gradient(mysum2, [1.0]);
  69.599 ms (40026 allocations: 1.68 MiB)

This result together with the above ones suggest:

Ps

@devmotion
Copy link
Member

Tracker is quite bad with loops, not sure about indexing.

Tracker is actually the fastest in the loop benchmark, isn't it?

@yebai
Copy link
Member Author

yebai commented Mar 26, 2020

Tracker is actually the fastest in the loop benchmark, isn't it?

You're right! I didn't notice the unit!

@yebai
Copy link
Member Author

yebai commented Mar 26, 2020

Interesting ReverseDiff is so good with indexing, and wonder whether we can apply the same trick from ReverseDiff to improve Tracker and Zygote.

@mohamed82008
Copy link
Member

On the flip side, ReverseDiff and Zygote are both slower than Tracker in code with broadcasting.

@mohamed82008
Copy link
Member

But I am trying to make RD faster.

@KDr2
Copy link
Member

KDr2 commented Mar 27, 2020

I did some optimizations on ThArrays, compintell/THArrays.jl@c021717, indexing is a little faster(10%?) now.

@yebai
Copy link
Member Author

yebai commented Jun 19, 2021

I just re-run the same benchmarks. Here is an update:

julia> @btime ReverseDiff.gradient(mysum, x);

  1.194 ms (70023 allocations: 3.07 MiB)

julia> @btime Tracker.gradient(mysum, x);
  86.576 ms (190012 allocations: 769.27 MiB)

julia> @btime Zygote.gradient(mysum, x);


  90.346 ms (180107 allocations: 769.40 MiB)

@yebai yebai closed this as completed Nov 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants