Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stagger benchmarks to smooth background noise #595

Closed
jbrockmendel opened this issue Dec 11, 2017 · 15 comments
Closed

Stagger benchmarks to smooth background noise #595

jbrockmendel opened this issue Dec 11, 2017 · 15 comments
Labels
enhancement Triaged as an enhancement request

Comments

@jbrockmendel
Copy link
Contributor

When running asv continuous to compare commits A and B, all the benchmarks from A run followed by all the benchmarks from B, e.g. "A.foo, A.bar, A.baz, B.foo, B.bar, B.baz". Would it be feasible to instead run "A.foo, B.foo, A.bar, B.bar, A.baz, B.baz"?

(In fact because each benchmark is run multiple times, Ideally I would like the staggering to be even finer-grained.)

The thought here is that background processes make noise auto-correlated, so the running comparisons back-to-back may give more informative ratios. (Based on amateur speculation)

@pv
Copy link
Collaborator

pv commented Dec 11, 2017

See here https://github.com/pv/asv/commits/many-proc

Full staggering may not be so sensible, because you need to uninstall/install the project in between.

The general issue is also probably not so much background processes, but hardware causes for CPU performance fluctuations. These occur on laptops, but I guess desktop CPUs are have similar behavior vs. thermal throttling etc.

@jbrockmendel
Copy link
Contributor Author

The hardware issue sounds tough, but also sounds like you've put a lot of time and thought into this, which is reassuring.

What am I looking at with many-proc? Is the idea to save+aggregate results across multiple runs? That seems like a reasonable approach (and possibly easier to implement than staggering?).

Is the "not so sensible" an indication that I should give up on this? The status-quo of running asv continuous repeatedly and hunting for which ratios are similar across many runs is... not ideal.

@pv
Copy link
Collaborator

pv commented Dec 11, 2017

Staggering can be implemented (in the same way as multiple runs + aggregating results), but I expect it will be slower because between each benchmark you need to uninstall/reinstall the project to the environment.

@jbrockmendel
Copy link
Contributor Author

it will be slower because between each benchmark you need to uninstall/reinstall the project to the environment.

Can you expand on that a bit? Could the parent process spawn two env-specific processes which in turn spawn processes for each benchmark without switching in between?

Even if not, slower and more accurate is a tradeoff I'd be happy with. I implemented asv for dateutil but the PR stalled because ATM the results are just way too noisy.

@pv
Copy link
Collaborator

pv commented Jul 26, 2018

@jbrockmendel: you can perhaps try asv master branch now that the multi-process benchmarking is in there. E.g. asv run ... -a processes=4 to override benchmark attributes from command line.

@jbrockmendel
Copy link
Contributor Author

I'll give it a try. How do I enable the feature where it stores results to calculate statistics over multiple runs? If we can get that working then stability becomes a problem we can throw hardware-hours at.

@pv
Copy link
Collaborator

pv commented Jul 28, 2018

It's automatically on (by default processes=2), adjustable on command line as above.

There's no option to combine results from multiple asv continuous runs to each other, however.

Ultimately, if you want good benchmarking accuracy, you probably need to at least disable CPU frequency tuning for one CPU and then use taskset (on linux) to run processes on that CPU (and maybe isolating it from general allocation).

None of this is particularly necessary if you just accumulate historical data, as done here and by most projects using asv: https://pv.github.io/numpy-bench/
Sure, the results are somewhat noisy, but unlike asv continuous (which was not a main intended use case for asv in the beginning) small noise in time series won't matter much.

@jbrockmendel
Copy link
Contributor Author

time asv continuous -E virtualenv -f 1.1 master $HEAD -b groupby.Categories
[...]
real	0m44.505s
user	0m41.800s
sys	0m10.168s

time asv continuous -E virtualenv -f 1.1 master $HEAD -b groupby.Categories -a processes=4
[...]
real	1m13.417s
user	1m9.164s
sys	0m18.696s

Is this the expected behavior? I thought it would go the other way.

@jbrockmendel
Copy link
Contributor Author

There's no option to combine results from multiple asv continuous runs to each other, however.

Darn. Would a PR implementing this be accepted? (or feasible?)

None of this is particularly necessary if you just accumulate historical data,
Ultimately, if you want good benchmarking accuracy, you probably need to at least disable CPU frequency tuning for one CPU and then use taskset [...]

I have taken to using taskset, am not familiar with the tuning bit.

It sounds like our [pandas, prospectively dateutil] use case is not exactly the intended one, but I'm optimistic/hopeful we can make it work. Generally when a PR is made in a perf-sensitive part of the code the maintainer asks for asv comparison. A single run of asv continuous [...] says that un-changed parts of the code have gotten much faster/slower; subsequent runs often flip those results and/or give a new set of benchmarks that have changed.

Cranking the sample size up to 11 seems like the least labor-intensive way to address this. Am I wrong?

@pv
Copy link
Collaborator

pv commented Jul 28, 2018

Is this the expected behavior? I thought it would go the other way.

I'm not sure what you mean --- the default is processes=2, and increasing the number to 4 makes it take longer --- as expected?

Darn. Would a PR implementing this be accepted? (or feasible?)

It's feasible, just needs some plumbing in the right places.

But I'm not sure it it will ultimately solve the problem with the accuracy of the results --- you may be able to somewhat reduce the number of false positives, but not fully... E.g., if there is performance variation on time scale of 10sec (such as with laptop CPU thermal control), then you still need some luck to have all your benchmarks sample the full variation.

It sounds like our [pandas, prospectively dateutil] use case is not exactly the intended one, but I'm optimistic/hopeful we can make it work. Generally when a PR is made in a perf-sensitive part of the code the maintainer asks for asv comparison. A single run of asv continuous [...] says that un-changed parts of the code have gotten much faster/slower; subsequent runs often flip those results and/or give a new set of benchmarks that have changed.

Cranking the sample size up to 11 seems like the least labor-intensive way to address this. Am I wrong?

Sure, it should be possible to make it work by measuring longer (i.e. adjust repeat and processes). However, I'm not sure if it will be feasible to e.g. run the whole pandas benchmark suite on an unconfigured laptop CPU without false positives.

This is not specific to asv, except in the sense that asv runs many timing benchmarks, and the chance of false positives is multiplied by the number of benchmarks run.

However, I expect the situation is already better with asv 0.3.x, which does the statistics properly, than with asv 0.2.x.

@pv
Copy link
Collaborator

pv commented Jul 28, 2018

cf gh-689

@jbrockmendel
Copy link
Contributor Author

#689 looks like it could be a big help, thanks.

I'm not sure what you mean --- the default is processes=2, and increasing the number to 4 makes it take longer --- as expected?

I'm back to being confused. I expected more processes to mean faster execution; why is that the wrong intuition?

@pv
Copy link
Collaborator

pv commented Jul 29, 2018

The processes are run sequentially, not at the same time. If they would be run at the same time, this would change the load on the machine and affect results.

@pv
Copy link
Collaborator

pv commented Jul 29, 2018

#689 looks like it could be a big help, thanks.

You can try it out in principle. Also, it's unclear to me if the issues you mention were with asv 0.2.x, or whether they are an issue with the current master branch.

@pv
Copy link
Collaborator

pv commented Aug 15, 2018

gh-697

@pv pv closed this as completed in #697 Aug 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Triaged as an enhancement request
Projects
None yet
Development

No branches or pull requests

2 participants