-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New benchmarks #220
Comments
I believe that there's some settings you can use with criterion that will cause the task to error out, and thus CI to fail, if a run regresses by too much on a benchmark. For this to work, you need to have some "baseline" data checked in somewhere in the repo of course. I have only used criterion a tiny bit, so I don't want to say for sure. Minor Reminder: black_box is somewhat dodgy because fundamentally there's no way for it to work. So instead it kinda mostly works, but using it too much or too little can quickly throw off the results. |
Things discovered:
--- a/crates/libm-bench/benches/bench.rs
+++ b/crates/libm-bench/benches/bench.rs
@@ -11,13 +11,13 @@ macro_rules! unary {
pub fn [<$func>](bh: &mut Bencher) {
let mut rng = rand::thread_rng();
let x = rng.gen::<f64>();
- bh.iter(|| test::black_box(libm::[<$func>](x)))
+ bh.iter(|| test::black_box(libm::[<$func>](test::black_box(x))))
}
#[bench]
pub fn [<$func f>](bh: &mut Bencher) {
let mut rng = rand::thread_rng();
let x = rng.gen::<f32>();
- bh.iter(|| test::black_box(libm::[<$func f>](x)))
+ bh.iter(|| test::black_box(libm::[<$func f>](test::black_box(x))))
}
} And I got real results for exp2f: 1ns regressed to 4ns. I wrote a simple bench using criterion:
And it gave me Testing exp2f on a constant with no black_box:
gave me
|
The math functions in this libraries are not constant time, that is, they take a different amount of time to execute depending on the input value passed to them because they do a different amount of work. For example, calling The current benchmarks generate a random value each time, making comparing the measurements of two benchmarks runs meaningless, e.g., the first time That is, the current benchmarks results are meaningless for evaluating whether one algorithm is faster than another. It can go either way, depending on the randomly generated inputs. If you want to compare two algorithms, the first thing you need is a way to test that the algorithms are correct. The current tests only test that the result "almost" match those of Then you need to actually generate benchmarks for relevant inputs requiring different amounts of work, and only then can you compare two algorithms, and see if one is truly better than the other for all inputs, only some, etc. |
They are not constant time in terms of crypto but overall they do a constant amount of work in the worst case.
gives
Which is fair. Isn't it?
Yes they can, however I generate a random value per each call which gives us an avg cost of a call over the distribution of f32. |
Alright. I got it. In the current benchmarks we bench an average time, whereas you would like to see some special cases. How about moving to criterion to bench avg in the first iteration and then add special cases one by one and only if we need them? |
We don't benchmark average times. The current benchmarks measure the time it takes for two different implementations to compute two different things. Trying to conclude anything about averages from that is flawed. If you want to make any claims about average performance, you either need an actual proof, or to measure the performance of all possible inputs, and average that.
What's the point of doing this? What would comparing two algorithms for two different random inputs tell you about the algorithm ? At best you can compare the same algorithm against itself, and conclude that your algorithm is orders of magnitude slower or faster depending on the inputs, e.g., libstd pow vs libstd pow. |
You benchmark the performance over a one single random input:
It can be the best case, it can be the worst case, it can be somewhere in the middle.
That's right. That's what I am talking about.
An output from
Go and calculate the probability of all these 341M iterations be the worst cases for an old implementation and the best cases for a new implementation. Here are nice plots for avg benches. Old
New
Understanding this report:The plot on the left displays the average time per iteration for this benchmark. The shaded region shows the estimated probabilty of an iteration taking a certain amount of time, while the line shows the mean. Click on the plot for a larger view showing the outliers. The plot on the right shows the linear regression calculated from the measurements. Each point represents a sample, though here it shows the total time for the sample rather than time per iteration. The line is the line of best fit for these measurements. See the documentation for more details on the additional statistics. Let's get back to the subject
A lot. More than comparing two different implementations for only two different inputs. |
@kpp could you articulate what that is, with words? You are generating a random input once, and then measuring a math function for that single input many times. The function is pure, does no I/O, does not access the memory hierarchy, etc. The only thing all those beautiful graphs are showing is how much noise your measurement system has (the fastest execution takes ~15ns, the slowest ~3Xns, that's some noise!). I at least do not care about that. I care about what's the execution time of the function, and the closest value one can obtain to that from criterion, which is a very bad tool for this (you probably want to use the TSC, which criterion doesn't use), is the time of the fastest execution. So the only information your N page criterion report is giving you is a single number, the fastest execution of algorithm A0 for input I0. Everything else could be relevant for other algorithms that do memory i/o, file i/o, network i/o, etc. but isn't relevant here. (EDIT: or at least I don't know why we would care about how slow can a CPU under load execute a single function without characterizing the load, where the answer to how slow is probably infinitely slow if the OS never schedules it). Now, because you pick up the input at random, when you rerun for algorithm A1, you get the fastest execution of algorithm A1 for the different input I1. What does that tell you? Absolutely nothing, you have two measurements for two different algorithms, but because these are for two different inputs, you cannot conclude anything about the relative performance of algorithm A0 and A1. Were you to use the same input I for measuring both algorithms, you would get two results, and at least be able to conclude, for that single input, which of the two algorithms is faster. |
Yes. Let's say we got a function F with special cases at So we got four bench results:
A new impl will give us:
That is a significant improvement for B1, B2, B4 (6ns -> 1ns) but a little bit of regression for B3 (4ns -> 5ns). Can we accept a PR with such changes without avg bench having no idea what a is? |
We can use the same seed in benchmarks =) |
Sure, but the examples you were using above (e.g. here) did not do that.
If someone posts a benchmark comparing two implementations I am going to ask two questions:
|
So ideally we would have, for each function, a set of inputs that would allow us to get a good picture of the performance characteristics for the function and that we can use to compare different implementations. Such a set can grow over time. Also ideally, we would have a good way to measure how long does it take for a function to actually run. Instead of criterion, on x86 at least, we can probably just use the TSC to measure how long it takes for a single function to run for a low N number of times for the same input. This only gives us the average over N executions, so we can try to repeat this a couple of times to obtain the smallest value of this average. Maybe we can make N as low as 1, I don't know, might depend on the API being tested and the input being passed. |
Yes, another benchmark. With an a → INF you will get a performance boost for 3 special cases only and a regression for all inputs excluding 0, NaN, INF, an avg bench will show that in average you will get a regression from 4ns to 5ns which is meaningful. Right? With an a → 0 we will get avg boost from 6ns to 1ns which is great.
Yes. That's what I suggested in #220 (comment). We can have both avg with a const seed and const len of rand inputs and a set of special cases that can grow over time. @gnzlbg your thoughts? |
I'm not sure. Imagine the case of If the function takes more or less the same time when computing any input in Does this make sense? |
Definitely.
I assumed all functions in libm in some range (preferably (-INF, INF)) should more or less take the same time. Now I have to figure out whether it's true or not. Does is sound like a plan? |
@gnzlbg |
Sure. I recommend searching for loop keywords like "while" in the library to see where the loops are. Also take into account periodic functions, e.g., sin, that take an argument and map it into a region (e.g. [0, 2pi]) and then evaluate it in there. It probably makes litter sense to test for a->inf for these, because you would only be testing for [0, 2pi] anyways.
IIRC running tests for all bit-patterns for all unary f32 functions took like 30-60 seconds on my machine. Doing 100 iterations would put that in the 1-2h ballpark, but one never needs to run these benchmarks for all functions (e.g. typically a PR discusses a single function). So for a single function benchmarking all inputs might be doable. Doing this would give us, e.g., the fastest and slowest inputs for a function, which is something that we could then add to the manually specified benchmarks. This would already make it worth doing IMO. Computing the average over all inputs is definitely a metric that can be used to conclude whether an implementation is faster or slower than another on average. Whether that's useful information, I don't know. It depends on the function. |
It is interesting to plot |
@burrbull for all x? or only for x in [0, 2pi] (and inf, nan, etc.) ? |
For both. |
UPD description. |
Hey guys... we used
According to https://rust-random.github.io/book/guide-start.html |
fn bench_sin(c: &mut Criterion) {
c.bench(
"sin",
ParameterizedBenchmark::new("const",
|b, i| b.iter(|| libm::sin(*i)),
vec![std::f64::NAN, std::f64::INFINITY, std::f64::NEG_INFINITY])
);
c.bench_function("sin [0,1)", |b| {
b.iter_batched(
|| rand::thread_rng().gen::<f64>(),
|v| libm::sin(v),
BatchSize::SmallInput
)
});
c.bench_function("sin [1,2)", |b| {
b.iter_batched(
|| rand::thread_rng().gen::<f64>() + 1.,
|v| libm::sin(v),
BatchSize::SmallInput
)
});
c.bench_function("sin [2,3)", |b| {
b.iter_batched(
|| rand::thread_rng().gen::<f64>() + 2.,
|v| libm::sin(v),
BatchSize::SmallInput
)
});
c.bench_function("sin [3,4)", |b| {
b.iter_batched(
|| rand::thread_rng().gen::<f64>() + 3.,
|v| libm::sin(v),
BatchSize::SmallInput
)
});
c.bench_function("sin [4,5)", |b| {
b.iter_batched(
|| rand::thread_rng().gen::<f64>() + 4.,
|v| libm::sin(v),
BatchSize::SmallInput
)
});
c.bench_function("sin [5,6)", |b| {
b.iter_batched(
|| rand::thread_rng().gen::<f64>() + 5.,
|v| libm::sin(v),
BatchSize::SmallInput
)
});
c.bench_function("sin [6,7)", |b| {
b.iter_batched(
|| rand::thread_rng().gen::<f64>() + 6.,
|v| libm::sin(v),
BatchSize::SmallInput
)
});
c.bench_function("sin [0, 2*PI)", |b| {
b.iter_batched(
|| rand::thread_rng().gen::<f64>() * 2. * std::f64::consts::PI,
|v| libm::sin(v),
BatchSize::SmallInput
)
});
}
|
Why |
Can |
I don't know =( |
Wild guess: it messes up the branch prediction more often. |
I would properly just run the benchmark against all bit patterns to begin with. for i in 0..=u32::max_value() {
let x = f32::from_bits(i);
libm::sinf(x);
} |
Here are some benches of libm found on the internets: |
For now there are some PRs blocked because of lack of good benches (#169 and friends).
We can probably just use the TSC to measure how long it takes for a single function to run for a low N number of times for the same input?No need to. There is https://bheisler.github.io/criterion.rs/book/faq.html#how-should-i-benchmark-small-functionsI don't know how to detect bench regression in CI. Will maintainers check regression on their local machines or is there a better way?There is: https://bheisler.github.io/criterion.rs/book/faq.html#how-should-i-run-criterionrs-benchmarks-in-a-ci-pipelineThe text was updated successfully, but these errors were encountered: