15ns run time on empty benchmark #77

AshishBhattarai · 2024-05-09T04:04:44Z

Zig Version

0.12.0

Steps to Reproduce and Observed Behavior

Run an empty benchmark

Expected Behavior

0ns, or something to measure code that can execute faster than 15ns

AshishBhattarai · 2024-05-09T04:07:24Z

Tested on linux, seems like the issue is with the zig's standard timer itself.
Maybe consider adding baseline time to the report?

Here what i am doing

const Runn = struct {
    pub fn run(_: Runn, _: std.mem.Allocator) void {}
};

pub fn main() !void {
    const allocator = std.heap.page_allocator;
    var bench = zbench.Benchmark.init(allocator, .{});

    const runn: Runn = .{};

    try bench.addParam("run empty", &runn, .{});
    defer bench.deinit();

    try bench.run(std.io.getStdOut().writer());
}

Result

benchmark              runs     total time     time/run (avg ± σ)     (min ... max)                p75        p99        p995      
-----------------------------------------------------------------------------------------------------------------------------
run empty              65535    1.013ms        15ns ± 5ns             (10ns ... 31ns)              20ns       21ns       21ns

bens · 2024-05-09T05:09:42Z

The measurement code is at

zBench/zbench.zig

Line 91 in 1f0249e

var t = try std.time.Timer.start();

.

Looking at this again, the switch should ideally be excluded from what's measured, but there will always be the overhead of calling to and returning from a function pointer since we can't inline the function body into that runImpl code. The benchmarking process could also be context switched away from in between when the timer is started and stopped on most operating systems, so it's not reasonable to expect the result to always be a flat 0ns. If your interest is in benchmarking things where these nanoseconds really matter, maybe something like Sushi Roll would be more appropriate? There will always be enough noise on ordinary user systems that makes measurements this precise very difficult.

It would seem like a feature rather than fixing a bug to go to special efforts to account for the overhead of a single function call which is likely to be completely dominated by the run time of most benchmarks people would want. Micro-benchmarking properly is a niche and difficult topic - again, take a look at the lengths Sushi Roll goes to for that.

bens · 2024-05-09T05:16:17Z

To be a bit more constructive, maybe do what you're wanting to benchmark in a loop in your benchmark function, and then you get an average time over many executions of it which will make the function call overhead affect the result a lot less. Then you'll have the overhead of looping back each time, so you might want to do some loop unrolling to mitigate that, though the compiler might do that for you. You probably want to examine the assembly with godbolt.org to make sure it's actually benchmarking what you intend to benchmark.

AshishBhattarai · 2024-05-09T05:27:30Z

@bens I don't think the call overheads are an issue here, they seem to be pretty much optimized by the compiler. The clock however seems to be a problem, I haven't looked into it much but even when i just start the timer and read time immediately its reporting around 20ns, while similar clocks written in C report around 1-2ns.

AshishBhattarai · 2024-05-09T05:37:13Z

I reported it as a bug, because 15 - 20ns while doing nothing seems a bit more than it should be, i am not trying to benchmark in terms of cpu cycles but i think baseline should be a bit lower.

Zig uses boottime clock for linux, maybe like you said the os is context switching but i get around 20ns consistently, ~~which leads me to assume that something is wrong with the timer.~~

Maybe its better to use monotonic clock instead of bootime for benchmarks.

edit:
Just to clarify here, nothing wrong with timer, this is as precise as they can get.

FObersteiner · 2024-05-09T06:12:38Z

Maybe its better to use monotonic clock instead of bootime for benchmarks.

I think this is why we switched to using the Timer that Zig offers, to get a monotonic clock. But keep in mind, monotonic does not mean more accurate. That still depends on what the OS gives you. Also, OS context switching happens on a different order of magnitude (milliseconds) to my knowledge, so I suppose this is not the issue here.

As for the empty benchmark, can we detect that? Would be nice to have a warning of some kind there.

Besides, @AshishBhattarai I would be interested in the C code you were mentioning; could you share that?

edit: I just checked; Zig Timer uses CLOCK.BOOTTIME on Linux, which is monotonic afaik. But again, we're OS-dependent here ;-) Things might be different on Windows for instance.

AshishBhattarai · 2024-05-09T06:28:09Z

@FObersteiner @bens Yea seems like this is the limitation of timers. Could we maybe switch to hardware performance counters. https://github.com/wcohen/libpfm4, not sure if this under the scope of zbench.

But, it would be cool if the benchmark report had baseline on it.

bens · 2024-05-09T06:53:38Z

Yes, it doesn't seem to be to do with the zig Timer implementation, I wrote a little C program to call clock_gettime(CLOCK_BOOTTIME, &ts1) twice and got the same result as in zig.

$ gcc -o overhead overhead.c -lc -O3
$ ./overhead 
resolution: 1ns
average: 1497ns
$ zig build -Doptimize=ReleaseFast test_examples
$ ./zig-out/bin/overhead
std.time.Timer.read overhead: 1466.061ns
std.time.Timer.lap  overhead: 1465.965ns

It's a lot higher than you're seeing, no idea why. I can post the source for each but they're not interesting.

Could we maybe switch to hardware performance counters.

I think this could be helpful but not a replacement. If a program uses fewer CPU cycles or has fewer cache misses, that doesn't necessarily mean it takes less time to run if it's spending more time switched out because it's blocked on IO for example. I think wall clock time is often most important, even if it's noisy and difficult to get a clean measurement.

FObersteiner · 2024-05-09T07:04:01Z

It's a lot higher than you're seeing, no idea why

that's two orders of magnitude... is this on a per-run basis? i.e. is this an offset on each iteration of a benchmark?

bens · 2024-05-09T10:47:04Z

Just in case I'm doing something crazy, I'll put up the source. I don't often write C so there could well be dumb things wrong.

#include <time.h>
#include <stdio.h>
#include <stdint.h>

uint64_t check_resolution(void) {
  struct timespec ts;
  clock_getres(CLOCK_BOOTTIME, &ts);
  return (ts.tv_sec * 1000000000) + ts.tv_nsec;
}

uint64_t measure(void) {
  const uint64_t iterations = 100000;
  struct timespec ts1, ts2;
  uint64_t total_ns = 0;
  uint64_t i = 0;
  while (i < iterations) {
    clock_gettime(CLOCK_BOOTTIME, &ts1);
    clock_gettime(CLOCK_BOOTTIME, &ts2);
    // The idea here is to avoid dealing with nanoseconds
    // wrapping around on a second boundary.
    if (ts1.tv_nsec <= ts2.tv_nsec) {
      total_ns += ts2.tv_nsec - ts1.tv_nsec;
      i++;
    }
  }
  return total_ns / iterations;
}

int main(int argc, char **argv) {
  printf("resolution: %dns\n", check_resolution());
  printf("average: %dns\n", measure());
  return 0;
}

FObersteiner · 2024-05-09T11:09:00Z

Just in case I'm doing something crazy, I'll put up the source.

interesting, the result seems to depend on the number of iterations and the compiler.

At 10 iterations, I start with a noisy 50ns (sometimes more, sometimes less) with zig cc (clang; zig 0.13-dev), if I go up to 1E7 iterations, I get 11ns very consistently
With gcc (11.4), I start at around 30ns and also get down to 11ns

not sure I totally understand what's happening here; it's not just an offset per call to clock_gettime. At least it seems to need some spin-up time to reach a constant baseline.

bens · 2024-05-09T12:17:51Z

It doesn't seem to make much difference on my system, either with gcc-13.2.0 or zig cc (zig 0.12.0).

$ ./overhead 
average(10):     1578ns
average(100):    1563ns
average(1000):   1572ns
average(10000):  1510ns
average(100000): 1495ns

This is with the CPU governor on "performance", if I set it to "on demand" it doesn't change a lot. But the idea of getting a baseline is looking very sensible if measuring nothing takes more than a microsecond on some systems.

FObersteiner · 2024-05-11T13:22:58Z

so I played around with the Timer instant a bit more, with something like

    var t = try std.time.Timer.start();
    const n_runs = [6]usize{ 10, 100, 1_000, 10_000, 100_000, 1_000_000 };
    inline for (n_runs) |n| {
        var noop_times: [n]u64 = undefined;
        var i: usize = 0;
        while (i < n) : (i += 1) {
            noop_times[i] = t.read();
            t.reset();
        }
        var sum: u64 = 0;
        for (noop_times) |_t| {
            sum += _t;
        }
        std.debug.print("n: {d}, sum: {d}, avg: {d}\n", .{ n, sum, sum / n });
    }

and get mixed results; for example:

n: 10, sum: 832, avg: 83
n: 100, sum: 16069, avg: 160
n: 1000, sum: 38054, avg: 38
n: 10000, sum: 424879, avg: 42
n: 100000, sum: 3401798, avg: 34
n: 1000000, sum: 32128462, avg: 32

Curiously, n=10 seems to run "quicker" than n=100 most of the time. At higher n, results seem more consistent and converge towards ~30ns on my system, but still with some variability.
So there definitely is some kind of baseline, however I'm not sure how stable it is. This is hard to test I guess; if you would do something like

[start timer] -> [get timer value; t_baseline] -> [reset timer] 
-> [run function to be benchmarked] -> [get timer value; t_benchmark]

how sure can you be that the actual (true) run time is t_benchmark - t_baseline?

In general, I think we should consider that this is only an issue if you have something that runs really fast. If you're in the microseconds or even milliseconds range, the relative uncertainty gets really small. But still, since we want to make fast stuff with Zig, the nanosecond range should be covered 🚀

edit 1: turns out you can plot stuff with Zig ;-) Still a bit rough but...

x-axis: timer no-op run #, y-axis: reported time in ns. Pretty spiky, seems like a lot is going on that disturbs a "stable" delta-t.

edit 2: the code for the plot above is on my codeberg. I've added some statistics; the (1-fold) standard deviation is almost as high as the average delta-t for each no-op. So definitely a lot of noise.

FObersteiner · 2024-08-16T09:49:34Z

see my "closing remarks" on #78 - I think we can close this with a note in the readme; 15 ns no-op time (at least; probably more) is a given, and not an issue of zBench itself.

hendriknielaender · 2024-08-16T20:46:10Z

Add note to readme, closed the issue.

AshishBhattarai added the bug Something isn't working label May 9, 2024

AshishBhattarai closed this as completed May 9, 2024

AshishBhattarai reopened this May 9, 2024

FObersteiner added the performance label May 11, 2024

FObersteiner mentioned this issue May 12, 2024

Benchmark run time baseline estimation #78

Closed

8 tasks

hendriknielaender closed this as completed Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

15ns run time on empty benchmark #77

15ns run time on empty benchmark #77

AshishBhattarai commented May 9, 2024

AshishBhattarai commented May 9, 2024 •

edited

Loading

bens commented May 9, 2024

bens commented May 9, 2024 •

edited

Loading

AshishBhattarai commented May 9, 2024 •

edited

Loading

AshishBhattarai commented May 9, 2024 •

edited

Loading

FObersteiner commented May 9, 2024 •

edited

Loading

AshishBhattarai commented May 9, 2024 •

edited

Loading

bens commented May 9, 2024

FObersteiner commented May 9, 2024

bens commented May 9, 2024 •

edited

Loading

FObersteiner commented May 9, 2024

bens commented May 9, 2024 •

edited

Loading

FObersteiner commented May 11, 2024 •

edited

Loading

FObersteiner commented Aug 16, 2024 •

edited

Loading

hendriknielaender commented Aug 16, 2024

15ns run time on empty benchmark #77

15ns run time on empty benchmark #77

Comments

AshishBhattarai commented May 9, 2024

Zig Version

Steps to Reproduce and Observed Behavior

Expected Behavior

AshishBhattarai commented May 9, 2024 • edited Loading

bens commented May 9, 2024

bens commented May 9, 2024 • edited Loading

AshishBhattarai commented May 9, 2024 • edited Loading

AshishBhattarai commented May 9, 2024 • edited Loading

FObersteiner commented May 9, 2024 • edited Loading

AshishBhattarai commented May 9, 2024 • edited Loading

bens commented May 9, 2024

FObersteiner commented May 9, 2024

bens commented May 9, 2024 • edited Loading

FObersteiner commented May 9, 2024

bens commented May 9, 2024 • edited Loading

FObersteiner commented May 11, 2024 • edited Loading

FObersteiner commented Aug 16, 2024 • edited Loading

hendriknielaender commented Aug 16, 2024

AshishBhattarai commented May 9, 2024 •

edited

Loading

bens commented May 9, 2024 •

edited

Loading

AshishBhattarai commented May 9, 2024 •

edited

Loading

AshishBhattarai commented May 9, 2024 •

edited

Loading

FObersteiner commented May 9, 2024 •

edited

Loading

AshishBhattarai commented May 9, 2024 •

edited

Loading

bens commented May 9, 2024 •

edited

Loading

bens commented May 9, 2024 •

edited

Loading

FObersteiner commented May 11, 2024 •

edited

Loading

FObersteiner commented Aug 16, 2024 •

edited

Loading