Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to DynamicPPLBenchmarks #346

Draft
wants to merge 27 commits into
base: master
Choose a base branch
from
Draft

Conversation

torfjelde
Copy link
Member

Produces results such as can be seen here: #309 (comment)

@torfjelde torfjelde marked this pull request as draft December 3, 2021 00:43
@yebai
Copy link
Member

yebai commented Dec 16, 2021

This might be helpful for running benchmarks via CI - https://github.com/tkf/BenchmarkCI.jl

@yebai
Copy link
Member

yebai commented Aug 29, 2022

@torfjelde should we improve this PR by incorporating TuringBenchmarks ? Alternatively, we can move all benchmarking code here into TuringBenchmarks . I am happy with both cases, but ideally, these benchmarking utilities should live in only one place to minimise confusion.

Also, https://github.com/TuringLang/TuringExamples contains some very old benchmarking code.

cc @xukai92 @devmotion

@coveralls
Copy link

coveralls commented Feb 2, 2023

Pull Request Test Coverage Report for Build 13093265728

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 51 unchanged lines in 11 files lost coverage.
  • Overall coverage remained the same at 86.259%

Files with Coverage Reduction New Missed Lines %
src/varnamedvector.jl 1 88.25%
src/sampler.jl 1 94.03%
src/utils.jl 2 73.2%
src/contexts.jl 3 30.21%
src/values_as_in_model.jl 3 69.23%
src/distribution_wrappers.jl 4 41.67%
src/model.jl 5 80.0%
src/varinfo.jl 6 84.17%
src/simple_varinfo.jl 6 81.96%
src/compiler.jl 8 86.58%
Totals Coverage Status
Change from base Build 12993040441: 0.0%
Covered Lines: 3710
Relevant Lines: 4301

💛 - Coveralls

@codecov
Copy link

codecov bot commented Feb 2, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.17%. Comparing base (29a6c7e) to head (6f255d1).

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #346   +/-   ##
=======================================
  Coverage   86.17%   86.17%           
=======================================
  Files          36       36           
  Lines        4305     4305           
=======================================
  Hits         3710     3710           
  Misses        595      595           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@torfjelde
Copy link
Member Author

I think there are few different things we need to address:

  • How to set up the benchmarks for a given Model. This is already taken care of in TuringBenchmarking.jl; if anything is missing, we should just contribute to that, since this is also useful for end-users.
  • How do we track and compare benchmarks across versions?
  • How do we present in the information? Do we use Weave docs like in this PR or do we just present stuff in a table?
  • Which models should we benchmark?
  • Should the benchmarking be part of the CI? If so, how should this be triggered? How do we get compute for this (we can't just use a standard GH action for this but will need our "own" server to run this on)?

IMO, the CI stuff is not really that crucial. The most important things are a) choose a suite of models that answers all the questions we want, e.g. how does changes we make affect different impls of a model, how is scaling wrt. number of parameters affacted, how are compilation times affect, etc., and b) what's the output format for all of this.

@torfjelde
Copy link
Member Author

How do we present in the information? Do we use Weave docs like in this PR or do we just present stuff in a table?

Some further notes on this. IMO we're mainly interested in a few different "experiments". We don't want to be testing every model out there, and so there are things we want to "answer" with our benchmarks.

As a result, I'm leaning more towards a Weave approach with each notebook containing answering a distinct question, e.g. "how does the model scale with number of observations", which subsequently produces outputs that can be compared across versions somehow. That is, I think the overall approach taken in this PR is "correct", but we need to make it much nicer + update how the benchmarks are performed.

But then the question is: what are the "questions" we want to answer. Here's few I can think of:

  1. How does performance vary across implementations, going from "everything uses for-loops" to "everything is vectorized"?
  2. How does both runtime performance and compilation times scale wrt. number of parameters and observations?

@shravanngoswamii
Copy link
Member

shravanngoswamii commented Oct 19, 2024

How do we track and compare benchmarks across versions?

We can store html of benchmarks.md with some setup of different versions in gh-pages and serve it on /benchmarks

How do we present in the information? Do we use Weave docs like in this PR or do we just present stuff in a table?

Weave approach looks fine as each notebook could address a specific questions!

Should the benchmarking be part of the CI? If so, how should this be triggered? How do we get compute for this (we can't just use a standard GH action for this but will need our "own" server to run this on)?

It took a lot of time to run benchmarks from this PR locally, so I guess GH action is not preferred for this!

Let me know what to do next, I will proceed as you say!

@shravanngoswamii
Copy link
Member

Might want to look at https://github.com/JasonPekos/TuringPosteriorDB.jl.

I have looked into this, there are many models, we must figure out which ones to benchmark.

@yebai
Copy link
Member

yebai commented Dec 16, 2024

@shravanngoswamii can you run all models in https://github.com/JasonPekos/TuringPosteriorDB.jl and provide an output like: https://nsiccha.github.io/StanBlocks.jl/performance.html#visualization?

Let's create a qmd notebook for this benchmarking that is easy to run on CI and local machines.

EDIT: a first step is to

  • cleanup this PR
  • setup CI to run the jmd scripts and push output to the gh-pages branch
  • merge this PR as "unit benchmarking" for DynamicPPL models

After this is done, start a new PR, work on adding TuringPosteriorDB as an additional set of benchmarks

@shravanngoswamii
Copy link
Member

shravanngoswamii commented Dec 25, 2024

setup CI to run the jmd scripts and push output to the gh-pages branch

I don't think we can run this in GHA, it takes too much time to run, so how are we expecting to run individual models? And can you give me the rough idea of what are we expecting from DynamicPPL benchmarking PR as of now?

Can we pick some particular models that can run on GH Action? And if we are going with JMD Weave approach, then I guess we will use many JMD scripts in future...

What parameters should be kept in benchmarks, and do you have any particular format in which we should display benchmark results?

I am working on this kind of stuff for the first time, so I guess I am taking too much time to understand even simple things! Really sorry for it!

@yebai
Copy link
Member

yebai commented Jan 6, 2025

Ideally, we cherry-pick a suitable set of benchmarks that could run on Github CI. Let's consider replacing jmd files with Julia scripts. We could use PrettyTables.jl's to produce readable Github comments.

More expensive benchmarks could be transferred into a separate script which we can run on private machines if necessary.

@shravanngoswamii
Copy link
Member

We could use PrettyTables.jl's to produce readable Github comments.

Is this kind of GitHub comment fine?

===== Running benchmarks for demo1 and demo2 =====

--- Benchmarking demo1 ---
  0.011462 seconds (7.19 k allocations: 494.820 KiB, 99.82% compilation time)

--- Benchmarking demo2 ---
  0.004775 seconds (5.35 k allocations: 361.320 KiB, 99.48% compilation time)

===== Benchmark Summary Table =====
┌───────┬───────────────────────────────────────────┬───────────┬──────────────┬────────────┬────────┐
│ Model │                              BenchmarkKey │ Time (ns) │ GC Time (ns) │ Memory (B) │ Allocs │
├───────┼───────────────────────────────────────────┼───────────┼──────────────┼────────────┼────────┤
│ demo1 │                          evaluation_typed │     245.0 │          0.0 │        160 │      3 │
│ demo1 │                        evaluation_untyped │    1249.0 │          0.0 │       1552 │     32 │
│ demo1 │            evaluation_simple_varinfo_dict │     771.0 │          0.0 │        704 │     26 │
│ demo1 │              evaluation_simple_varinfo_nt │      50.0 │          0.0 │          0 │      0 │
│ demo1 │    evaluation_simple_varinfo_dict_from_nt │      56.0 │          0.0 │          0 │      0 │
│ demo1 │ evaluation_simple_varinfo_componentarrays │      51.0 │          0.0 │          0 │      0 │
│ demo2 │                          evaluation_typed │     257.0 │          0.0 │        160 │      3 │
│ demo2 │                        evaluation_untyped │    2573.0 │          0.0 │       3552 │     67 │
│ demo2 │            evaluation_simple_varinfo_dict │    2122.0 │          0.0 │       1456 │     60 │
│ demo2 │              evaluation_simple_varinfo_nt │     112.0 │          0.0 │          0 │      0 │
│ demo2 │    evaluation_simple_varinfo_dict_from_nt │     121.0 │          0.0 │          0 │      0 │
│ demo2 │ evaluation_simple_varinfo_componentarrays │     129.0 │          0.0 │          0 │      0 │
└───────┴───────────────────────────────────────────┴───────────┴──────────────┴────────────┴────────┘

Done!

Or maybe this:

===== Running benchmarks for demo1 and demo2 =====

--- Benchmarking demo1 ---
  0.009682 seconds (7.19 k allocations: 494.820 KiB, 99.75% compilation time)

--- Benchmarking demo2 ---
  0.004272 seconds (5.35 k allocations: 361.320 KiB, 99.41% compilation time)

===== Pivoting so each row = 1 BenchmarkKey, multiple models as columns =====
┌───────────────────────────────────────────┬───────────────┬─────────────────┬──────────────┬──────────────┬───────────────┬─────────────────┬──────────────┬──────────────┐
│                                 bench_key │ time_ns_demo1 │ gctime_ns_demo1 │ memory_demo1 │ allocs_demo1 │ time_ns_demo2 │ gctime_ns_demo2 │ memory_demo2 │ allocs_demo2 │
├───────────────────────────────────────────┼───────────────┼─────────────────┼──────────────┼──────────────┼───────────────┼─────────────────┼──────────────┼──────────────┤
│                          evaluation_typed │         189.0 │             0.0 │          160 │            3 │         260.0 │             0.0 │          160 │            3 │
│                        evaluation_untyped │         953.0 │             0.0 │         1552 │           32 │        2720.0 │             0.0 │         3552 │           67 │
│            evaluation_simple_varinfo_dict │         654.0 │             0.0 │          704 │           26 │        2226.0 │             0.0 │         1456 │           60 │
│              evaluation_simple_varinfo_nt │          42.0 │             0.0 │            0 │            0 │         129.0 │             0.0 │            0 │            0 │
│    evaluation_simple_varinfo_dict_from_nt │          49.0 │             0.0 │            0 │            0 │         118.0 │             0.0 │            0 │            0 │
│ evaluation_simple_varinfo_componentarrays │          44.0 │             0.0 │            0 │            0 │         130.0 │             0.0 │            0 │            0 │
└───────────────────────────────────────────┴───────────────┴─────────────────┴──────────────┴──────────────┴───────────────┴─────────────────┴──────────────┴──────────────┘

Done!

@mhauru
Copy link
Member

mhauru commented Jan 28, 2025

Hi @shravanngoswamii, thanks for working on this, and sorry that we've been neglecting this issue a bit. @willtebbutt and I will be more active in the future and can help you out with any questions or help you need with DPPL benchmarking.

Could you summarise the work you've done so far? I see you've made a thing that produces some nice tables, is that built on top of this PR? Is the code on some public branch yet?

On a couple of the more specific questions you had:

I think it would be great to get some basic benchmarks running on GHA. Results will be variable because who knows what sort of resources the GHA runner has available, and we can't run anything very heavy, but that's okay. Just getting a GitHub comment with a table like the one you made would be helpful to spot any horrible regressions where suddenly we are allocating a lot and runtime has gone up tenfold because of some type instability.

We could also have a heavier benchmarking suite that can only reasonably be run locally, but does more extensive benchmarks. However, if you've got code for producing the tables for some simple quick benchmarks half done already then very happy to finish that first and worry about different sorts of benchmarks later.

For the two table formats you proposed, I think either is workable, but maybe the first one would be nicer, to avoid side scrolling. If we add a few more models it's going to get quite long though. Would it make sense to run the full set of six benchmarks (typed, untyped, 4 different simple varinfos) only on one model, and then run only one or two (evaluation_typed for sure, maybe something else?) on all of the other models? Would help keep the table concise.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

JuliaFormatter

[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶

@shravanngoswamii
Copy link
Member

shravanngoswamii commented Feb 1, 2025

Could you summarise the work you've done so far? I see you've made a thing that produces some nice tables, is that built on top of this PR? Is the code on some public branch yet?

Hello @mhauru, I have updated this branch itself and added Julia script that generates the Markdown tables and also stores benchmarking report in Markdown file and JSON in results directory. Locally, generated tables is like this:

>> Running benchmarks for model: demo1
  0.013535 seconds (7.19 k allocations: 495.211 KiB, 99.84% compilation time)

>> Running benchmarks for model: demo2
  0.006908 seconds (5.35 k allocations: 361.320 KiB, 99.67% compilation time)

## DynamicPPL Benchmark Results (benchmarks_2025-02-02_04-36-46)

### Execution Environment
- Julia version: 1.10.5
- DynamicPPL version: 0.32.2
- Benchmark date: 2025-02-02T04:37:00.205

| Model | Evaluation Type                           |       Time |    Memory | Allocs | Samples |
|-------|-------------------------------------------|------------|-----------|--------|---------|
| demo1 | evaluation typed                          | 191.000 ns | 160 bytes |      3 |   10000 |
| demo1 | evaluation untyped                        |   1.029 μs |  1.52 KiB |     32 |   10000 |
| demo1 | evaluation simple varinfo dict            | 709.000 ns | 704 bytes |     26 |   10000 |
| demo1 | evaluation simple varinfo nt              |  43.000 ns |   0 bytes |      0 |   10000 |
| demo1 | evaluation simple varinfo dict from nt    |  49.000 ns |   0 bytes |      0 |   10000 |
| demo1 | evaluation simple varinfo componentarrays |  42.000 ns |   0 bytes |      0 |   10000 |
| demo2 | evaluation typed                          | 273.000 ns | 160 bytes |      3 |   10000 |
| demo2 | evaluation untyped                        |   2.570 μs |  3.47 KiB |     67 |   10000 |
| demo2 | evaluation simple varinfo dict            |   2.169 μs |  1.42 KiB |     60 |   10000 |
| demo2 | evaluation simple varinfo nt              | 136.000 ns |   0 bytes |      0 |   10000 |
| demo2 | evaluation simple varinfo dict from nt    | 122.000 ns |   0 bytes |      0 |   10000 |
| demo2 | evaluation simple varinfo componentarrays | 137.000 ns |   0 bytes |      0 |   10000 |


Benchmark results saved to: results/benchmarks_2025-02-02_04-36-46

I think it would be great to get some basic benchmarks running on GHA. Results will be variable because who knows what sort of resources the GHA runner has available, and we can't run anything very heavy, but that's okay. Just getting a GitHub comment with a table like the one you made would be helpful to spot any horrible regressions where suddenly we are allocating a lot and runtime has gone up tenfold because of some type instability.

We can just print the generated REPORT.md in comments!

Would help keep the table concise.

Do you want me to create a web interface for DynamicPPL benchmarks where we can compare multiple benchmark reports or simply see there all other benchmarks?

@mhauru
Copy link
Member

mhauru commented Feb 6, 2025

Sorry for the slow response, I've been a bit on-and-off work this week.

I'll have a look at the code. Would you also be up for talking about this over Zoom? Could be easier. My first thought is that the table looks good and we could be close to having the first version of this done by just making those tables be autoposted on PRs. I do wonder about the accumulation of these REPORT.md files, it's nice to be able to see old results for comparison, but we might soon end up with dozens and dozens of these in the repo. Maybe there could be one file in the repo for the latest results on that branch, and you can see how benchmarks develop by checking the git history of that file? I might check what @willtebbutt has done for this in Mooncake.

Do you want me to create a web interface for DynamicPPL benchmarks where we can compare multiple benchmark reports or simply see there all other benchmarks?

Maybe at some point, but for now I think we can focus on getting a first version of this in, where it starts posting comments on PRs and helps us catch any horrible regressions, worry about fancier setups later.

@shravanngoswamii
Copy link
Member

Sorry for the slow response, I've been a bit on-and-off work this week.

No worries at all! I’ve also been a bit slow, between exams and some hackathons recently.

Would you also be up for talking about this over Zoom?

Sure! Just let me know when you're available. I’m free anytime after 1:30 PM UTC on regular days, and anytime on Friday, Saturday, and Sunday.

My first thought is that the table looks good and we could be close to having the first version of this done by just making those tables be autoposted on PRs. I do wonder about the accumulation of these REPORT.md files, it's nice to be able to see old results for comparison, but we might soon end up with dozens and dozens of these in the repo.

Okay so I will set up a benchmarking CI for PRs and how about generating one REPORT.md for each version of DPPL? Or maybe append reports for each version in a single REPORT.md.

@torfjelde
Copy link
Member Author

A drive-by comment: I don't think the models currently tested are that useful. These days, benchmarks should be performed with TuringBenchmarking.jl so you can track the gradient timings properly 👍

@mhauru
Copy link
Member

mhauru commented Feb 13, 2025

Agreed that using TuringBenchmarking.jl would be good.

Some further thoughts:

  1. Weave is unmaintained, and we no longer use it for our docs. I think we should try to move away from it. If switching to Quarto is trivial we could do that. However, this leads to the next question:
  2. What's the value of having the results in notebooks? Could we cut code complexity and our dependencies by simply outputting JSON and/or plain text?
  3. I think having a historical record of benchmark results from various versions isn't very valuable as long as we don't have a standardised piece of hardware and environment to run them in. And I don't think that's happening any time soon. Thus, I would see two uses for benchmarks:
    • Having very crude benchmark results posted on GitHub in PR comments. Just a table like the one @shravanngoswamii posted above. The sort of benchmark where you should pay no attention to any differences that are less than ~50%, but that just alerts you to any horrible failures where either compilation or runtime has gone up in a qualitative jump. These should be lightweight enough to finish in a few minutes, to run on GHA.
    • Having utilities for running more extensive benchmarks locally. If you want to compare two versions you'll have to run both of them yourself, but at least then you know you're doing a fair comparison. These can take longer, but should be runnable on a laptop in preferably substantially less than an hour.
  4. Mooncake has a nice setup for posting comments in PRs: https://github.com/compintell/Mooncake.jl/blob/6c66347bbc50aa92959d34f3ad66b534a1e25442/.github/workflows/CI.yml#L145 We could mimic that. It would allow us to not keep any result files in the repo, which I think would be preferable.

@yebai
Copy link
Member

yebai commented Feb 14, 2025

I agree with @mhauru's suggestions.

@penelopeysm showed some nice examples #806 (comment). I'd suggest that we turn that into a CI workflow. It is also a good idea to keep these benchmarks useful for DynamicPPL developers rather than for the general audience.

@mhauru
Copy link
Member

mhauru commented Feb 14, 2025

@shravanngoswamii and I just had a call to discuss this. He helped me understand how the current code works, and we decided on the following action items:

  • Switch use of @benchmarkable within DynamicPPLBenchmarks.jl to a suitable function call from TuringBenchmarking.jl. I think make_turing_suite is the function we need.
  • Remove everything related to Weave documents. Let's make this work with plain text tables first and consider fancier Quarto things later if we feel like it. If there's something in the functions and files that we delete that we may want to come back to using later, maybe make note of it so we know to dig it up from git history when needed.
  • Set up CI to post a table of benchmark results to GitHub PRs without storing any files in the repo, mimicing Mooncake.
  • Add functionality to benchmarks.jl to choose combinations of model, AD backend, and varinfo type to run benchmarks on. Note that we don't want to test all models on all backends and all varinfos (too many benchmarks), so we need to be able to manually pick the combinations we want.
  • Curate a list of model - AD backend - varinfo combinations that we want to benchmark on.

I'll take the last item of that list, @shravanngoswamii will take on the others and I'm available for help whenever needed.

The goal is to have a small set of quick, crude benchmarks that you can run locally and get output as plain text (or maybe JSON if we feel like it) and that runs automatically on GHA and posts comments with a results table on PRs. We can then later add more features if/when we want them, such as

  • A standardised set of more comprehensive benchmarks that one can run locally.
  • Quarto output.

@shravanngoswamii
Copy link
Member

@mhauru I don't know if the current approach I used is correct or not, just have a look at it let me know whatever changes are required!

Copy link
Member

@mhauru mhauru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shravanngoswamii, the overall structure and approach here looks good. I had one bug fix to propose, and then some style points and simplifications.

I'll also start making a list of models to test. Would you like for me to push changes to the models to this same PR, or make a PR into this PR?

context = DefaultContext()

# Create the chosen varinfo.
vi = nothing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all the branches of the if-statement set vi, maybe this could be replaced with something like

vi = if varinfo_choice == :untyped
    vi = VarInfo()
    model(vi)
    vi
elseif varinfo_choice == :typed
    VarInfo(model)
elseif [blahblahblah]
end

Note that in Julia if-statements always evaluate to a value, that is the value of the last statement in the branch of the if-block that got evaluated.

This is just a minor style point, the current code works fine.

Comment on lines +1 to +6
using DynamicPPL
using DynamicPPLBenchmarks
using BenchmarkTools
using TuringBenchmarking
using Distributions
using PrettyTables
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying to move away from unqualified using X statements in TuringLang (see TuringLang/Turing.jl#2288). Could these be replaced with either using X: X, which then forces to qualify the use of the module later as X.foo, or with using X: foo if only one or two names need to be imported from X?

Comment on lines 3 to 4
using DynamicPPL
using BenchmarkTools
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before about imports.

:forwarddiff => :forwarddiff,
:reversediff => :reversediff,
:zygote => :zygote
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary and unused.

Comment on lines +38 to +51
# Define available VarInfo types.
# Each entry is (Name, function to produce the VarInfo)
available_varinfo_types = Dict(
:untyped => ("UntypedVarInfo", VarInfo),
:typed => ("TypedVarInfo", m -> VarInfo(m)),
:simple_namedtuple => ("SimpleVarInfo (NamedTuple)", m -> SimpleVarInfo{Float64}(m())),
:simple_dict => ("SimpleVarInfo (Dict)", m -> begin
retvals = m()
varnames = map(keys(retvals)) do k
VarName{k}()
end
SimpleVarInfo{Float64}(Dict(zip(varnames, values(retvals))))
end)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary and unused.

Comment on lines +73 to +85
# Convert results to a 2D array for PrettyTables
function to_matrix(tuples::Vector{<:NTuple{5,Any}})
n = length(tuples)
data = Array{Any}(undef, n, 5)
for i in 1:n
for j in 1:5
data[i, j] = tuples[i][j]
end
end
return data
end

table_matrix = to_matrix(results_table)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be simplified into

table_matrix = hcat(Iterators.map(collect, zip(results_table...))...)

You could also skip the Iterators.map(collect, blah) part if in the earlier loop you made the elements of results_table be vectors rather than tuples, although I appreciate the neatness of having them be tuples. Or you could have results_table be an Array{Any, 2}(undef, length(chosen_combinations), 5) from the start. There are a few ways to simplify this, I might not have thought of the simplest way, feel free to pick your favourite.

Comment on lines +44 to +46
# Add the evaluation benchmark.
suite["evaluation"] = @benchmarkable $model($vi, $context)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unnecessary, because the make_turing_suite suite already includes a benchmark of just the plain model evaluation. The results can be found under results["AD_Benchmarking"]["evaluation"]["standard"], whereas the AD backend results are at results["AD_Benchmarking"]["gradient"]["standard"].

Comment on lines +68 to +69
eval_time = median(results["evaluation"]).time
ad_eval_time = median(results["AD_Benchmarking"]["evaluation"]["standard"]).time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

results["AD_Benchmarking"]["evaluation"]["standard"] is actually getting the time for just the plain model evaluation without gradient, so very similar to the results["evaluation"] one. I think you want results["AD_Benchmarking"]["gradient"]["standard"] and results["AD_Benchmarking"]["evaluation"]["standard"]. See also a comment I left in DynamicPPLBenchmarks.jl that relates to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants