Stagger benchmarks to smooth background noise #595

jbrockmendel · 2017-12-11T17:18:30Z

When running asv continuous to compare commits A and B, all the benchmarks from A run followed by all the benchmarks from B, e.g. "A.foo, A.bar, A.baz, B.foo, B.bar, B.baz". Would it be feasible to instead run "A.foo, B.foo, A.bar, B.bar, A.baz, B.baz"?

(In fact because each benchmark is run multiple times, Ideally I would like the staggering to be even finer-grained.)

The thought here is that background processes make noise auto-correlated, so the running comparisons back-to-back may give more informative ratios. (Based on amateur speculation)

The text was updated successfully, but these errors were encountered:

pv · 2017-12-11T17:41:05Z

See here https://github.com/pv/asv/commits/many-proc

Full staggering may not be so sensible, because you need to uninstall/install the project in between.

The general issue is also probably not so much background processes, but hardware causes for CPU performance fluctuations. These occur on laptops, but I guess desktop CPUs are have similar behavior vs. thermal throttling etc.

jbrockmendel · 2017-12-11T18:21:46Z

The hardware issue sounds tough, but also sounds like you've put a lot of time and thought into this, which is reassuring.

What am I looking at with many-proc? Is the idea to save+aggregate results across multiple runs? That seems like a reasonable approach (and possibly easier to implement than staggering?).

Is the "not so sensible" an indication that I should give up on this? The status-quo of running asv continuous repeatedly and hunting for which ratios are similar across many runs is... not ideal.

pv · 2017-12-11T19:14:32Z

Staggering can be implemented (in the same way as multiple runs + aggregating results), but I expect it will be slower because between each benchmark you need to uninstall/reinstall the project to the environment.

jbrockmendel · 2017-12-13T21:07:55Z

it will be slower because between each benchmark you need to uninstall/reinstall the project to the environment.

Can you expand on that a bit? Could the parent process spawn two env-specific processes which in turn spawn processes for each benchmark without switching in between?

Even if not, slower and more accurate is a tradeoff I'd be happy with. I implemented asv for dateutil but the PR stalled because ATM the results are just way too noisy.

pv · 2018-07-26T21:19:00Z

@jbrockmendel: you can perhaps try asv master branch now that the multi-process benchmarking is in there. E.g. asv run ... -a processes=4 to override benchmark attributes from command line.

jbrockmendel · 2018-07-28T01:18:17Z

I'll give it a try. How do I enable the feature where it stores results to calculate statistics over multiple runs? If we can get that working then stability becomes a problem we can throw hardware-hours at.

pv · 2018-07-28T12:56:19Z

It's automatically on (by default processes=2), adjustable on command line as above.

There's no option to combine results from multiple asv continuous runs to each other, however.

Ultimately, if you want good benchmarking accuracy, you probably need to at least disable CPU frequency tuning for one CPU and then use taskset (on linux) to run processes on that CPU (and maybe isolating it from general allocation).

None of this is particularly necessary if you just accumulate historical data, as done here and by most projects using asv: https://pv.github.io/numpy-bench/
Sure, the results are somewhat noisy, but unlike asv continuous (which was not a main intended use case for asv in the beginning) small noise in time series won't matter much.

jbrockmendel · 2018-07-28T16:41:52Z

time asv continuous -E virtualenv -f 1.1 master $HEAD -b groupby.Categories
[...]
real	0m44.505s
user	0m41.800s
sys	0m10.168s

time asv continuous -E virtualenv -f 1.1 master $HEAD -b groupby.Categories -a processes=4
[...]
real	1m13.417s
user	1m9.164s
sys	0m18.696s

Is this the expected behavior? I thought it would go the other way.

jbrockmendel · 2018-07-28T16:51:11Z

There's no option to combine results from multiple asv continuous runs to each other, however.

Darn. Would a PR implementing this be accepted? (or feasible?)

None of this is particularly necessary if you just accumulate historical data,
Ultimately, if you want good benchmarking accuracy, you probably need to at least disable CPU frequency tuning for one CPU and then use taskset [...]

I have taken to using taskset, am not familiar with the tuning bit.

It sounds like our [pandas, prospectively dateutil] use case is not exactly the intended one, but I'm optimistic/hopeful we can make it work. Generally when a PR is made in a perf-sensitive part of the code the maintainer asks for asv comparison. A single run of asv continuous [...] says that un-changed parts of the code have gotten much faster/slower; subsequent runs often flip those results and/or give a new set of benchmarks that have changed.

Cranking the sample size up to 11 seems like the least labor-intensive way to address this. Am I wrong?

pv · 2018-07-28T17:32:54Z

Is this the expected behavior? I thought it would go the other way.

I'm not sure what you mean --- the default is processes=2, and increasing the number to 4 makes it take longer --- as expected?

Darn. Would a PR implementing this be accepted? (or feasible?)

It's feasible, just needs some plumbing in the right places.

But I'm not sure it it will ultimately solve the problem with the accuracy of the results --- you may be able to somewhat reduce the number of false positives, but not fully... E.g., if there is performance variation on time scale of 10sec (such as with laptop CPU thermal control), then you still need some luck to have all your benchmarks sample the full variation.

It sounds like our [pandas, prospectively dateutil] use case is not exactly the intended one, but I'm optimistic/hopeful we can make it work. Generally when a PR is made in a perf-sensitive part of the code the maintainer asks for asv comparison. A single run of asv continuous [...] says that un-changed parts of the code have gotten much faster/slower; subsequent runs often flip those results and/or give a new set of benchmarks that have changed.

Cranking the sample size up to 11 seems like the least labor-intensive way to address this. Am I wrong?

Sure, it should be possible to make it work by measuring longer (i.e. adjust repeat and processes). However, I'm not sure if it will be feasible to e.g. run the whole pandas benchmark suite on an unconfigured laptop CPU without false positives.

This is not specific to asv, except in the sense that asv runs many timing benchmarks, and the chance of false positives is multiplied by the number of benchmarks run.

However, I expect the situation is already better with asv 0.3.x, which does the statistics properly, than with asv 0.2.x.

pv · 2018-07-28T18:48:26Z

cf gh-689

jbrockmendel · 2018-07-29T17:56:34Z

#689 looks like it could be a big help, thanks.

I'm not sure what you mean --- the default is processes=2, and increasing the number to 4 makes it take longer --- as expected?

I'm back to being confused. I expected more processes to mean faster execution; why is that the wrong intuition?

pv · 2018-07-29T18:24:58Z

The processes are run sequentially, not at the same time. If they would be run at the same time, this would change the load on the machine and affect results.

pv · 2018-07-29T18:26:05Z

#689 looks like it could be a big help, thanks.

You can try it out in principle. Also, it's unclear to me if the issues you mention were with asv 0.2.x, or whether they are an issue with the current master branch.

pv · 2018-08-15T19:45:21Z

gh-697

pv added the enhancement Triaged as an enhancement request label Jan 20, 2018

jbrockmendel mentioned this issue Jul 1, 2018

Benchmark runner refactoring and interleaved benchmark sampling #647

Merged

pv mentioned this issue Aug 15, 2018

Implement --interleave-processes for spreading processes across commits #697

Merged

pv closed this as completed in #697 Aug 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stagger benchmarks to smooth background noise #595

Stagger benchmarks to smooth background noise #595

jbrockmendel commented Dec 11, 2017

pv commented Dec 11, 2017

jbrockmendel commented Dec 11, 2017

pv commented Dec 11, 2017 •

edited

Loading

jbrockmendel commented Dec 13, 2017

pv commented Jul 26, 2018

jbrockmendel commented Jul 28, 2018

pv commented Jul 28, 2018 •

edited

Loading

jbrockmendel commented Jul 28, 2018

jbrockmendel commented Jul 28, 2018

pv commented Jul 28, 2018 •

edited

Loading

pv commented Jul 28, 2018

jbrockmendel commented Jul 29, 2018

pv commented Jul 29, 2018

pv commented Jul 29, 2018

pv commented Aug 15, 2018

Stagger benchmarks to smooth background noise #595

Stagger benchmarks to smooth background noise #595

Comments

jbrockmendel commented Dec 11, 2017

pv commented Dec 11, 2017

jbrockmendel commented Dec 11, 2017

pv commented Dec 11, 2017 • edited Loading

jbrockmendel commented Dec 13, 2017

pv commented Jul 26, 2018

jbrockmendel commented Jul 28, 2018

pv commented Jul 28, 2018 • edited Loading

jbrockmendel commented Jul 28, 2018

jbrockmendel commented Jul 28, 2018

pv commented Jul 28, 2018 • edited Loading

pv commented Jul 28, 2018

jbrockmendel commented Jul 29, 2018

pv commented Jul 29, 2018

pv commented Jul 29, 2018

pv commented Aug 15, 2018

pv commented Dec 11, 2017 •

edited

Loading

pv commented Jul 28, 2018 •

edited

Loading

pv commented Jul 28, 2018 •

edited

Loading