-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to DynamicPPLBenchmarks #346
base: master
Are you sure you want to change the base?
Conversation
… for downstream tasks
This might be helpful for running benchmarks via CI - https://github.com/tkf/BenchmarkCI.jl |
@torfjelde should we improve this PR by incorporating Also, https://github.com/TuringLang/TuringExamples contains some very old benchmarking code. |
Pull Request Test Coverage Report for Build 13093265728Details
💛 - Coveralls |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #346 +/- ##
=======================================
Coverage 86.17% 86.17%
=======================================
Files 36 36
Lines 4305 4305
=======================================
Hits 3710 3710
Misses 595 595 ☔ View full report in Codecov by Sentry. |
I think there are few different things we need to address:
IMO, the CI stuff is not really that crucial. The most important things are a) choose a suite of models that answers all the questions we want, e.g. how does changes we make affect different impls of a model, how is scaling wrt. number of parameters affacted, how are compilation times affect, etc., and b) what's the output format for all of this. |
Some further notes on this. IMO we're mainly interested in a few different "experiments". We don't want to be testing every model out there, and so there are things we want to "answer" with our benchmarks. As a result, I'm leaning more towards a Weave approach with each notebook containing answering a distinct question, e.g. "how does the model scale with number of observations", which subsequently produces outputs that can be compared across versions somehow. That is, I think the overall approach taken in this PR is "correct", but we need to make it much nicer + update how the benchmarks are performed. But then the question is: what are the "questions" we want to answer. Here's few I can think of:
|
We can store html of
Weave approach looks fine as each notebook could address a specific questions!
It took a lot of time to run benchmarks from this PR locally, so I guess GH action is not preferred for this! Let me know what to do next, I will proceed as you say! |
I have looked into this, there are many models, we must figure out which ones to benchmark. |
@shravanngoswamii can you run all models in https://github.com/JasonPekos/TuringPosteriorDB.jl and provide an output like: https://nsiccha.github.io/StanBlocks.jl/performance.html#visualization? Let's create a EDIT: a first step is to
After this is done, start a new PR, work on adding |
I don't think we can run this in GHA, it takes too much time to run, so how are we expecting to run individual models? And can you give me the rough idea of what are we expecting from DynamicPPL benchmarking PR as of now? Can we pick some particular models that can run on GH Action? And if we are going with JMD Weave approach, then I guess we will use many JMD scripts in future... What parameters should be kept in benchmarks, and do you have any particular format in which we should display benchmark results? I am working on this kind of stuff for the first time, so I guess I am taking too much time to understand even simple things! Really sorry for it! |
Ideally, we cherry-pick a suitable set of benchmarks that could run on Github CI. Let's consider replacing More expensive benchmarks could be transferred into a separate script which we can run on private machines if necessary. |
Is this kind of GitHub comment fine?
Or maybe this:
|
Hi @shravanngoswamii, thanks for working on this, and sorry that we've been neglecting this issue a bit. @willtebbutt and I will be more active in the future and can help you out with any questions or help you need with DPPL benchmarking. Could you summarise the work you've done so far? I see you've made a thing that produces some nice tables, is that built on top of this PR? Is the code on some public branch yet? On a couple of the more specific questions you had: I think it would be great to get some basic benchmarks running on GHA. Results will be variable because who knows what sort of resources the GHA runner has available, and we can't run anything very heavy, but that's okay. Just getting a GitHub comment with a table like the one you made would be helpful to spot any horrible regressions where suddenly we are allocating a lot and runtime has gone up tenfold because of some type instability. We could also have a heavier benchmarking suite that can only reasonably be run locally, but does more extensive benchmarks. However, if you've got code for producing the tables for some simple quick benchmarks half done already then very happy to finish that first and worry about different sorts of benchmarks later. For the two table formats you proposed, I think either is workable, but maybe the first one would be nicer, to avoid side scrolling. If we add a few more models it's going to get quite long though. Would it make sense to run the full set of six benchmarks (typed, untyped, 4 different simple varinfos) only on one model, and then run only one or two ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit
JuliaFormatter
[JuliaFormatter] reported by reviewdog 🐶
DynamicPPL.jl/benchmarks/results/release-0.30.1/benchmarks.md
Lines 714 to 715 in 0291c2f
[JuliaFormatter] reported by reviewdog 🐶
[JuliaFormatter] reported by reviewdog 🐶
DynamicPPL.jl/benchmarks/results/release-0.30.1/benchmarks.md
Lines 865 to 866 in 0291c2f
[JuliaFormatter] reported by reviewdog 🐶
DynamicPPL.jl/benchmarks/results/release-0.30.1/benchmarks.md
Lines 1003 to 1005 in 0291c2f
[JuliaFormatter] reported by reviewdog 🐶
DynamicPPL.jl/benchmarks/results/release-0.30.1/benchmarks.md
Lines 1021 to 1022 in 0291c2f
[JuliaFormatter] reported by reviewdog 🐶
DynamicPPL.jl/benchmarks/results/release-0.30.1/benchmarks.md
Lines 1158 to 1160 in 0291c2f
…into tor/benchmark-update
Hello @mhauru, I have updated this branch itself and added Julia script that generates the Markdown tables and also stores benchmarking report in Markdown file and JSON in results directory. Locally, generated tables is like this: >> Running benchmarks for model: demo1
0.013535 seconds (7.19 k allocations: 495.211 KiB, 99.84% compilation time)
>> Running benchmarks for model: demo2
0.006908 seconds (5.35 k allocations: 361.320 KiB, 99.67% compilation time)
## DynamicPPL Benchmark Results (benchmarks_2025-02-02_04-36-46)
### Execution Environment
- Julia version: 1.10.5
- DynamicPPL version: 0.32.2
- Benchmark date: 2025-02-02T04:37:00.205
| Model | Evaluation Type | Time | Memory | Allocs | Samples |
|-------|-------------------------------------------|------------|-----------|--------|---------|
| demo1 | evaluation typed | 191.000 ns | 160 bytes | 3 | 10000 |
| demo1 | evaluation untyped | 1.029 μs | 1.52 KiB | 32 | 10000 |
| demo1 | evaluation simple varinfo dict | 709.000 ns | 704 bytes | 26 | 10000 |
| demo1 | evaluation simple varinfo nt | 43.000 ns | 0 bytes | 0 | 10000 |
| demo1 | evaluation simple varinfo dict from nt | 49.000 ns | 0 bytes | 0 | 10000 |
| demo1 | evaluation simple varinfo componentarrays | 42.000 ns | 0 bytes | 0 | 10000 |
| demo2 | evaluation typed | 273.000 ns | 160 bytes | 3 | 10000 |
| demo2 | evaluation untyped | 2.570 μs | 3.47 KiB | 67 | 10000 |
| demo2 | evaluation simple varinfo dict | 2.169 μs | 1.42 KiB | 60 | 10000 |
| demo2 | evaluation simple varinfo nt | 136.000 ns | 0 bytes | 0 | 10000 |
| demo2 | evaluation simple varinfo dict from nt | 122.000 ns | 0 bytes | 0 | 10000 |
| demo2 | evaluation simple varinfo componentarrays | 137.000 ns | 0 bytes | 0 | 10000 |
Benchmark results saved to: results/benchmarks_2025-02-02_04-36-46
We can just print the generated REPORT.md in comments!
Do you want me to create a web interface for DynamicPPL benchmarks where we can compare multiple benchmark reports or simply see there all other benchmarks? |
Sorry for the slow response, I've been a bit on-and-off work this week. I'll have a look at the code. Would you also be up for talking about this over Zoom? Could be easier. My first thought is that the table looks good and we could be close to having the first version of this done by just making those tables be autoposted on PRs. I do wonder about the accumulation of these REPORT.md files, it's nice to be able to see old results for comparison, but we might soon end up with dozens and dozens of these in the repo. Maybe there could be one file in the repo for the latest results on that branch, and you can see how benchmarks develop by checking the git history of that file? I might check what @willtebbutt has done for this in Mooncake.
Maybe at some point, but for now I think we can focus on getting a first version of this in, where it starts posting comments on PRs and helps us catch any horrible regressions, worry about fancier setups later. |
No worries at all! I’ve also been a bit slow, between exams and some hackathons recently.
Sure! Just let me know when you're available. I’m free anytime after 1:30 PM UTC on regular days, and anytime on Friday, Saturday, and Sunday.
Okay so I will set up a benchmarking CI for PRs and how about generating one REPORT.md for each version of DPPL? Or maybe append reports for each version in a single REPORT.md. |
A drive-by comment: I don't think the models currently tested are that useful. These days, benchmarks should be performed with TuringBenchmarking.jl so you can track the gradient timings properly 👍 |
Agreed that using TuringBenchmarking.jl would be good. Some further thoughts:
|
I agree with @mhauru's suggestions. @penelopeysm showed some nice examples #806 (comment). I'd suggest that we turn that into a CI workflow. It is also a good idea to keep these benchmarks useful for DynamicPPL developers rather than for the general audience. |
@shravanngoswamii and I just had a call to discuss this. He helped me understand how the current code works, and we decided on the following action items:
I'll take the last item of that list, @shravanngoswamii will take on the others and I'm available for help whenever needed. The goal is to have a small set of quick, crude benchmarks that you can run locally and get output as plain text (or maybe JSON if we feel like it) and that runs automatically on GHA and posts comments with a results table on PRs. We can then later add more features if/when we want them, such as
|
…into tor/benchmark-update
@mhauru I don't know if the current approach I used is correct or not, just have a look at it let me know whatever changes are required! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shravanngoswamii, the overall structure and approach here looks good. I had one bug fix to propose, and then some style points and simplifications.
I'll also start making a list of models to test. Would you like for me to push changes to the models to this same PR, or make a PR into this PR?
context = DefaultContext() | ||
|
||
# Create the chosen varinfo. | ||
vi = nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all the branches of the if-statement set vi
, maybe this could be replaced with something like
vi = if varinfo_choice == :untyped
vi = VarInfo()
model(vi)
vi
elseif varinfo_choice == :typed
VarInfo(model)
elseif [blahblahblah]
end
Note that in Julia if-statements always evaluate to a value, that is the value of the last statement in the branch of the if-block that got evaluated.
This is just a minor style point, the current code works fine.
using DynamicPPL | ||
using DynamicPPLBenchmarks | ||
using BenchmarkTools | ||
using TuringBenchmarking | ||
using Distributions | ||
using PrettyTables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are trying to move away from unqualified using X
statements in TuringLang (see TuringLang/Turing.jl#2288). Could these be replaced with either using X: X
, which then forces to qualify the use of the module later as X.foo
, or with using X: foo
if only one or two names need to be imported from X
?
using DynamicPPL | ||
using BenchmarkTools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before about imports.
:forwarddiff => :forwarddiff, | ||
:reversediff => :reversediff, | ||
:zygote => :zygote | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessary and unused.
# Define available VarInfo types. | ||
# Each entry is (Name, function to produce the VarInfo) | ||
available_varinfo_types = Dict( | ||
:untyped => ("UntypedVarInfo", VarInfo), | ||
:typed => ("TypedVarInfo", m -> VarInfo(m)), | ||
:simple_namedtuple => ("SimpleVarInfo (NamedTuple)", m -> SimpleVarInfo{Float64}(m())), | ||
:simple_dict => ("SimpleVarInfo (Dict)", m -> begin | ||
retvals = m() | ||
varnames = map(keys(retvals)) do k | ||
VarName{k}() | ||
end | ||
SimpleVarInfo{Float64}(Dict(zip(varnames, values(retvals)))) | ||
end) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessary and unused.
# Convert results to a 2D array for PrettyTables | ||
function to_matrix(tuples::Vector{<:NTuple{5,Any}}) | ||
n = length(tuples) | ||
data = Array{Any}(undef, n, 5) | ||
for i in 1:n | ||
for j in 1:5 | ||
data[i, j] = tuples[i][j] | ||
end | ||
end | ||
return data | ||
end | ||
|
||
table_matrix = to_matrix(results_table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be simplified into
table_matrix = hcat(Iterators.map(collect, zip(results_table...))...)
You could also skip the Iterators.map(collect, blah)
part if in the earlier loop you made the elements of results_table
be vectors rather than tuples, although I appreciate the neatness of having them be tuples. Or you could have results_table
be an Array{Any, 2}(undef, length(chosen_combinations), 5)
from the start. There are a few ways to simplify this, I might not have thought of the simplest way, feel free to pick your favourite.
# Add the evaluation benchmark. | ||
suite["evaluation"] = @benchmarkable $model($vi, $context) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is unnecessary, because the make_turing_suite
suite already includes a benchmark of just the plain model evaluation. The results can be found under results["AD_Benchmarking"]["evaluation"]["standard"]
, whereas the AD backend results are at results["AD_Benchmarking"]["gradient"]["standard"]
.
eval_time = median(results["evaluation"]).time | ||
ad_eval_time = median(results["AD_Benchmarking"]["evaluation"]["standard"]).time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
results["AD_Benchmarking"]["evaluation"]["standard"]
is actually getting the time for just the plain model evaluation without gradient, so very similar to the results["evaluation"]
one. I think you want results["AD_Benchmarking"]["gradient"]["standard"]
and results["AD_Benchmarking"]["evaluation"]["standard"]
. See also a comment I left in DynamicPPLBenchmarks.jl that relates to this.
Produces results such as can be seen here: #309 (comment)