-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
15ns run time on empty benchmark #77
Comments
Tested on linux, seems like the issue is with the zig's standard timer itself. Here what i am doing
Result
|
The measurement code is at Line 91 in 1f0249e
Looking at this again, the switch should ideally be excluded from what's measured, but there will always be the overhead of calling to and returning from a function pointer since we can't inline the function body into that It would seem like a feature rather than fixing a bug to go to special efforts to account for the overhead of a single function call which is likely to be completely dominated by the run time of most benchmarks people would want. Micro-benchmarking properly is a niche and difficult topic - again, take a look at the lengths Sushi Roll goes to for that. |
To be a bit more constructive, maybe do what you're wanting to benchmark in a loop in your benchmark function, and then you get an average time over many executions of it which will make the function call overhead affect the result a lot less. Then you'll have the overhead of looping back each time, so you might want to do some loop unrolling to mitigate that, though the compiler might do that for you. You probably want to examine the assembly with godbolt.org to make sure it's actually benchmarking what you intend to benchmark. |
@bens I don't think the call overheads are an issue here, they seem to be pretty much optimized by the compiler. The clock however seems to be a problem, I haven't looked into it much but even when i just start the timer and read time immediately its reporting around 20ns, while similar clocks written in C report around 1-2ns. |
I reported it as a bug, because 15 - 20ns while doing nothing seems a bit more than it should be, i am not trying to benchmark in terms of cpu cycles but i think baseline should be a bit lower. Zig uses boottime clock for linux, maybe like you said the os is context switching but i get around 20ns consistently, Maybe its better to use monotonic clock instead of bootime for benchmarks. edit: |
I think this is why we switched to using the As for the empty benchmark, can we detect that? Would be nice to have a warning of some kind there. Besides, @AshishBhattarai I would be interested in the C code you were mentioning; could you share that? edit: I just checked; Zig |
@FObersteiner @bens Yea seems like this is the limitation of timers. Could we maybe switch to hardware performance counters. https://github.com/wcohen/libpfm4, not sure if this under the scope of zbench. But, it would be cool if the benchmark report had baseline on it. |
Yes, it doesn't seem to be to do with the zig Timer implementation, I wrote a little C program to call
It's a lot higher than you're seeing, no idea why. I can post the source for each but they're not interesting.
I think this could be helpful but not a replacement. If a program uses fewer CPU cycles or has fewer cache misses, that doesn't necessarily mean it takes less time to run if it's spending more time switched out because it's blocked on IO for example. I think wall clock time is often most important, even if it's noisy and difficult to get a clean measurement. |
that's two orders of magnitude... is this on a per-run basis? i.e. is this an offset on each iteration of a benchmark? |
Just in case I'm doing something crazy, I'll put up the source. I don't often write C so there could well be dumb things wrong. #include <time.h>
#include <stdio.h>
#include <stdint.h>
uint64_t check_resolution(void) {
struct timespec ts;
clock_getres(CLOCK_BOOTTIME, &ts);
return (ts.tv_sec * 1000000000) + ts.tv_nsec;
}
uint64_t measure(void) {
const uint64_t iterations = 100000;
struct timespec ts1, ts2;
uint64_t total_ns = 0;
uint64_t i = 0;
while (i < iterations) {
clock_gettime(CLOCK_BOOTTIME, &ts1);
clock_gettime(CLOCK_BOOTTIME, &ts2);
// The idea here is to avoid dealing with nanoseconds
// wrapping around on a second boundary.
if (ts1.tv_nsec <= ts2.tv_nsec) {
total_ns += ts2.tv_nsec - ts1.tv_nsec;
i++;
}
}
return total_ns / iterations;
}
int main(int argc, char **argv) {
printf("resolution: %dns\n", check_resolution());
printf("average: %dns\n", measure());
return 0;
} |
interesting, the result seems to depend on the number of iterations and the compiler.
not sure I totally understand what's happening here; it's not just an offset per call to |
It doesn't seem to make much difference on my system, either with gcc-13.2.0 or zig cc (zig 0.12.0).
This is with the CPU governor on "performance", if I set it to "on demand" it doesn't change a lot. But the idea of getting a baseline is looking very sensible if measuring nothing takes more than a microsecond on some systems. |
so I played around with the Timer instant a bit more, with something like var t = try std.time.Timer.start();
const n_runs = [6]usize{ 10, 100, 1_000, 10_000, 100_000, 1_000_000 };
inline for (n_runs) |n| {
var noop_times: [n]u64 = undefined;
var i: usize = 0;
while (i < n) : (i += 1) {
noop_times[i] = t.read();
t.reset();
}
var sum: u64 = 0;
for (noop_times) |_t| {
sum += _t;
}
std.debug.print("n: {d}, sum: {d}, avg: {d}\n", .{ n, sum, sum / n });
} and get mixed results; for example:
Curiously, n=10 seems to run "quicker" than n=100 most of the time. At higher n, results seem more consistent and converge towards ~30ns on my system, but still with some variability.
how sure can you be that the actual (true) run time is In general, I think we should consider that this is only an issue if you have something that runs really fast. If you're in the microseconds or even milliseconds range, the relative uncertainty gets really small. But still, since we want to make fast stuff with Zig, the nanosecond range should be covered 🚀 edit 1: turns out you can plot stuff with Zig ;-) Still a bit rough but... x-axis: timer no-op run #, y-axis: reported time in ns. Pretty spiky, seems like a lot is going on that disturbs a "stable" delta-t. edit 2: the code for the plot above is on my codeberg. I've added some statistics; the (1-fold) standard deviation is almost as high as the average delta-t for each no-op. So definitely a lot of noise. |
see my "closing remarks" on #78 - I think we can close this with a note in the readme; 15 ns no-op time (at least; probably more) is a given, and not an issue of zBench itself. |
Add note to readme, closed the issue. |
Zig Version
0.12.0
Steps to Reproduce and Observed Behavior
Run an empty benchmark
Expected Behavior
0ns, or something to measure code that can execute faster than 15ns
The text was updated successfully, but these errors were encountered: