Improvements to DynamicPPLBenchmarks #346

torfjelde · 2021-12-03T00:42:24Z

Produces results such as can be seen here: #309 (comment)

… for downstream tasks

benchmarks/src/DynamicPPLBenchmarks.jl

benchmarks/src/tables.jl

benchmarks/src/DynamicPPLBenchmarks.jl

yebai · 2021-12-16T18:48:16Z

This might be helpful for running benchmarks via CI - https://github.com/tkf/BenchmarkCI.jl

benchmarks/src/DynamicPPLBenchmarks.jl

yebai · 2022-08-29T22:34:23Z

@torfjelde should we improve this PR by incorporating TuringBenchmarks ? Alternatively, we can move all benchmarking code here into TuringBenchmarks . I am happy with both cases, but ideally, these benchmarking utilities should live in only one place to minimise confusion.

Also, https://github.com/TuringLang/TuringExamples contains some very old benchmarking code.

cc @xukai92 @devmotion

benchmarks/benchmark_body.jmd

benchmarks/benchmarks.jmd

coveralls · 2023-02-02T22:36:16Z

Pull Request Test Coverage Report for Build 13093265728

Details

0 of 0 changed or added relevant lines in 0 files are covered.
51 unchanged lines in 11 files lost coverage.
Overall coverage remained the same at 86.259%

Files with Coverage Reduction	New Missed Lines	%
src/varnamedvector.jl	1	88.25%
src/sampler.jl	1	94.03%
src/utils.jl	2	73.2%
src/contexts.jl	3	30.21%
src/values_as_in_model.jl	3	69.23%
src/distribution_wrappers.jl	4	41.67%
src/model.jl	5	80.0%
src/varinfo.jl	6	84.17%
src/simple_varinfo.jl	6	81.96%
src/compiler.jl	8	86.58%

Totals
Change from base Build 12993040441:	0.0%
Covered Lines:	3710
Relevant Lines:	4301

💛 - Coveralls

codecov · 2023-02-02T22:39:42Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.17%. Comparing base (29a6c7e) to head (6f255d1).

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #346   +/-   ##
=======================================
  Coverage   86.17%   86.17%           
=======================================
  Files          36       36           
  Lines        4305     4305           
=======================================
  Hits         3710     3710           
  Misses        595      595

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

torfjelde · 2024-10-03T11:56:53Z

I think there are few different things we need to address:

How to set up the benchmarks for a given Model. This is already taken care of in TuringBenchmarking.jl; if anything is missing, we should just contribute to that, since this is also useful for end-users.
How do we track and compare benchmarks across versions?
How do we present in the information? Do we use Weave docs like in this PR or do we just present stuff in a table?
Which models should we benchmark?
- Might want to look at https://github.com/JasonPekos/TuringPosteriorDB.jl.
- We should also benchmark different implementations of the same model, so we can see how the different "types" of approaches are affected by changes we make.
Should the benchmarking be part of the CI? If so, how should this be triggered? How do we get compute for this (we can't just use a standard GH action for this but will need our "own" server to run this on)?

IMO, the CI stuff is not really that crucial. The most important things are a) choose a suite of models that answers all the questions we want, e.g. how does changes we make affect different impls of a model, how is scaling wrt. number of parameters affacted, how are compilation times affect, etc., and b) what's the output format for all of this.

torfjelde · 2024-10-03T12:14:37Z

How do we present in the information? Do we use Weave docs like in this PR or do we just present stuff in a table?

Some further notes on this. IMO we're mainly interested in a few different "experiments". We don't want to be testing every model out there, and so there are things we want to "answer" with our benchmarks.

As a result, I'm leaning more towards a Weave approach with each notebook containing answering a distinct question, e.g. "how does the model scale with number of observations", which subsequently produces outputs that can be compared across versions somehow. That is, I think the overall approach taken in this PR is "correct", but we need to make it much nicer + update how the benchmarks are performed.

But then the question is: what are the "questions" we want to answer. Here's few I can think of:

How does performance vary across implementations, going from "everything uses for-loops" to "everything is vectorized"?
How does both runtime performance and compilation times scale wrt. number of parameters and observations?

shravanngoswamii · 2024-10-19T08:23:59Z

How do we track and compare benchmarks across versions?

We can store html of benchmarks.md with some setup of different versions in gh-pages and serve it on /benchmarks

How do we present in the information? Do we use Weave docs like in this PR or do we just present stuff in a table?

Weave approach looks fine as each notebook could address a specific questions!

Should the benchmarking be part of the CI? If so, how should this be triggered? How do we get compute for this (we can't just use a standard GH action for this but will need our "own" server to run this on)?

It took a lot of time to run benchmarks from this PR locally, so I guess GH action is not preferred for this!

Let me know what to do next, I will proceed as you say!

shravanngoswamii · 2024-10-19T09:56:28Z

Might want to look at https://github.com/JasonPekos/TuringPosteriorDB.jl.

I have looked into this, there are many models, we must figure out which ones to benchmark.

yebai · 2024-12-16T17:33:55Z

@shravanngoswamii can you run all models in https://github.com/JasonPekos/TuringPosteriorDB.jl and provide an output like: https://nsiccha.github.io/StanBlocks.jl/performance.html#visualization?

Let's create a qmd notebook for this benchmarking that is easy to run on CI and local machines.

EDIT: a first step is to

cleanup this PR
setup CI to run the jmd scripts and push output to the gh-pages branch
merge this PR as "unit benchmarking" for DynamicPPL models

After this is done, start a new PR, work on adding TuringPosteriorDB as an additional set of benchmarks

shravanngoswamii · 2024-12-25T20:53:59Z

setup CI to run the jmd scripts and push output to the gh-pages branch

I don't think we can run this in GHA, it takes too much time to run, so how are we expecting to run individual models? And can you give me the rough idea of what are we expecting from DynamicPPL benchmarking PR as of now?

Can we pick some particular models that can run on GH Action? And if we are going with JMD Weave approach, then I guess we will use many JMD scripts in future...

What parameters should be kept in benchmarks, and do you have any particular format in which we should display benchmark results?

I am working on this kind of stuff for the first time, so I guess I am taking too much time to understand even simple things! Really sorry for it!

yebai · 2025-01-06T15:20:18Z

Ideally, we cherry-pick a suitable set of benchmarks that could run on Github CI. Let's consider replacing jmd files with Julia scripts. We could use PrettyTables.jl's to produce readable Github comments.

More expensive benchmarks could be transferred into a separate script which we can run on private machines if necessary.

shravanngoswamii · 2025-01-20T14:12:43Z

We could use PrettyTables.jl's to produce readable Github comments.

Is this kind of GitHub comment fine?

===== Running benchmarks for demo1 and demo2 =====

--- Benchmarking demo1 ---
  0.011462 seconds (7.19 k allocations: 494.820 KiB, 99.82% compilation time)

--- Benchmarking demo2 ---
  0.004775 seconds (5.35 k allocations: 361.320 KiB, 99.48% compilation time)

===== Benchmark Summary Table =====
┌───────┬───────────────────────────────────────────┬───────────┬──────────────┬────────────┬────────┐
│ Model │                              BenchmarkKey │ Time (ns) │ GC Time (ns) │ Memory (B) │ Allocs │
├───────┼───────────────────────────────────────────┼───────────┼──────────────┼────────────┼────────┤
│ demo1 │                          evaluation_typed │     245.0 │          0.0 │        160 │      3 │
│ demo1 │                        evaluation_untyped │    1249.0 │          0.0 │       1552 │     32 │
│ demo1 │            evaluation_simple_varinfo_dict │     771.0 │          0.0 │        704 │     26 │
│ demo1 │              evaluation_simple_varinfo_nt │      50.0 │          0.0 │          0 │      0 │
│ demo1 │    evaluation_simple_varinfo_dict_from_nt │      56.0 │          0.0 │          0 │      0 │
│ demo1 │ evaluation_simple_varinfo_componentarrays │      51.0 │          0.0 │          0 │      0 │
│ demo2 │                          evaluation_typed │     257.0 │          0.0 │        160 │      3 │
│ demo2 │                        evaluation_untyped │    2573.0 │          0.0 │       3552 │     67 │
│ demo2 │            evaluation_simple_varinfo_dict │    2122.0 │          0.0 │       1456 │     60 │
│ demo2 │              evaluation_simple_varinfo_nt │     112.0 │          0.0 │          0 │      0 │
│ demo2 │    evaluation_simple_varinfo_dict_from_nt │     121.0 │          0.0 │          0 │      0 │
│ demo2 │ evaluation_simple_varinfo_componentarrays │     129.0 │          0.0 │          0 │      0 │
└───────┴───────────────────────────────────────────┴───────────┴──────────────┴────────────┴────────┘

Done!

Or maybe this:

===== Running benchmarks for demo1 and demo2 =====

--- Benchmarking demo1 ---
  0.009682 seconds (7.19 k allocations: 494.820 KiB, 99.75% compilation time)

--- Benchmarking demo2 ---
  0.004272 seconds (5.35 k allocations: 361.320 KiB, 99.41% compilation time)

===== Pivoting so each row = 1 BenchmarkKey, multiple models as columns =====
┌───────────────────────────────────────────┬───────────────┬─────────────────┬──────────────┬──────────────┬───────────────┬─────────────────┬──────────────┬──────────────┐
│                                 bench_key │ time_ns_demo1 │ gctime_ns_demo1 │ memory_demo1 │ allocs_demo1 │ time_ns_demo2 │ gctime_ns_demo2 │ memory_demo2 │ allocs_demo2 │
├───────────────────────────────────────────┼───────────────┼─────────────────┼──────────────┼──────────────┼───────────────┼─────────────────┼──────────────┼──────────────┤
│                          evaluation_typed │         189.0 │             0.0 │          160 │            3 │         260.0 │             0.0 │          160 │            3 │
│                        evaluation_untyped │         953.0 │             0.0 │         1552 │           32 │        2720.0 │             0.0 │         3552 │           67 │
│            evaluation_simple_varinfo_dict │         654.0 │             0.0 │          704 │           26 │        2226.0 │             0.0 │         1456 │           60 │
│              evaluation_simple_varinfo_nt │          42.0 │             0.0 │            0 │            0 │         129.0 │             0.0 │            0 │            0 │
│    evaluation_simple_varinfo_dict_from_nt │          49.0 │             0.0 │            0 │            0 │         118.0 │             0.0 │            0 │            0 │
│ evaluation_simple_varinfo_componentarrays │          44.0 │             0.0 │            0 │            0 │         130.0 │             0.0 │            0 │            0 │
└───────────────────────────────────────────┴───────────────┴─────────────────┴──────────────┴──────────────┴───────────────┴─────────────────┴──────────────┴──────────────┘

Done!

mhauru · 2025-01-28T11:38:09Z

Hi @shravanngoswamii, thanks for working on this, and sorry that we've been neglecting this issue a bit. @willtebbutt and I will be more active in the future and can help you out with any questions or help you need with DPPL benchmarking.

Could you summarise the work you've done so far? I see you've made a thing that produces some nice tables, is that built on top of this PR? Is the code on some public branch yet?

On a couple of the more specific questions you had:

I think it would be great to get some basic benchmarks running on GHA. Results will be variable because who knows what sort of resources the GHA runner has available, and we can't run anything very heavy, but that's okay. Just getting a GitHub comment with a table like the one you made would be helpful to spot any horrible regressions where suddenly we are allocating a lot and runtime has gone up tenfold because of some type instability.

We could also have a heavier benchmarking suite that can only reasonably be run locally, but does more extensive benchmarks. However, if you've got code for producing the tables for some simple quick benchmarks half done already then very happy to finish that first and worry about different sorts of benchmarks later.

For the two table formats you proposed, I think either is workable, but maybe the first one would be nicer, to avoid side scrolling. If we add a few more models it's going to get quite long though. Would it make sense to run the full set of six benchmarks (typed, untyped, 4 different simple varinfos) only on one model, and then run only one or two (evaluation_typed for sure, maybe something else?) on all of the other models? Would help keep the table concise.

github-actions

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

JuliaFormatter