Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

15ns run time on empty benchmark #77

Closed
AshishBhattarai opened this issue May 9, 2024 · 15 comments
Closed

15ns run time on empty benchmark #77

AshishBhattarai opened this issue May 9, 2024 · 15 comments
Labels
bug Something isn't working performance

Comments

@AshishBhattarai
Copy link

Zig Version

0.12.0

Steps to Reproduce and Observed Behavior

Run an empty benchmark

Expected Behavior

0ns, or something to measure code that can execute faster than 15ns

@AshishBhattarai AshishBhattarai added the bug Something isn't working label May 9, 2024
@AshishBhattarai
Copy link
Author

AshishBhattarai commented May 9, 2024

Tested on linux, seems like the issue is with the zig's standard timer itself.
Maybe consider adding baseline time to the report?

Here what i am doing

const Runn = struct {
    pub fn run(_: Runn, _: std.mem.Allocator) void {}
};

pub fn main() !void {
    const allocator = std.heap.page_allocator;
    var bench = zbench.Benchmark.init(allocator, .{});

    const runn: Runn = .{};

    try bench.addParam("run empty", &runn, .{});
    defer bench.deinit();

    try bench.run(std.io.getStdOut().writer());
}

Result

benchmark              runs     total time     time/run (avg ± σ)     (min ... max)                p75        p99        p995      
-----------------------------------------------------------------------------------------------------------------------------
run empty              65535    1.013ms        15ns ± 5ns             (10ns ... 31ns)              20ns       21ns       21ns  

@bens
Copy link
Collaborator

bens commented May 9, 2024

The measurement code is at

var t = try std.time.Timer.start();
.

Looking at this again, the switch should ideally be excluded from what's measured, but there will always be the overhead of calling to and returning from a function pointer since we can't inline the function body into that runImpl code. The benchmarking process could also be context switched away from in between when the timer is started and stopped on most operating systems, so it's not reasonable to expect the result to always be a flat 0ns. If your interest is in benchmarking things where these nanoseconds really matter, maybe something like Sushi Roll would be more appropriate? There will always be enough noise on ordinary user systems that makes measurements this precise very difficult.

It would seem like a feature rather than fixing a bug to go to special efforts to account for the overhead of a single function call which is likely to be completely dominated by the run time of most benchmarks people would want. Micro-benchmarking properly is a niche and difficult topic - again, take a look at the lengths Sushi Roll goes to for that.

@bens
Copy link
Collaborator

bens commented May 9, 2024

To be a bit more constructive, maybe do what you're wanting to benchmark in a loop in your benchmark function, and then you get an average time over many executions of it which will make the function call overhead affect the result a lot less. Then you'll have the overhead of looping back each time, so you might want to do some loop unrolling to mitigate that, though the compiler might do that for you. You probably want to examine the assembly with godbolt.org to make sure it's actually benchmarking what you intend to benchmark.

@AshishBhattarai
Copy link
Author

AshishBhattarai commented May 9, 2024

@bens I don't think the call overheads are an issue here, they seem to be pretty much optimized by the compiler. The clock however seems to be a problem, I haven't looked into it much but even when i just start the timer and read time immediately its reporting around 20ns, while similar clocks written in C report around 1-2ns.

@AshishBhattarai
Copy link
Author

AshishBhattarai commented May 9, 2024

I reported it as a bug, because 15 - 20ns while doing nothing seems a bit more than it should be, i am not trying to benchmark in terms of cpu cycles but i think baseline should be a bit lower.

Zig uses boottime clock for linux, maybe like you said the os is context switching but i get around 20ns consistently, which leads me to assume that something is wrong with the timer.

Maybe its better to use monotonic clock instead of bootime for benchmarks.

edit:
Just to clarify here, nothing wrong with timer, this is as precise as they can get.

@FObersteiner
Copy link
Collaborator

FObersteiner commented May 9, 2024

Maybe its better to use monotonic clock instead of bootime for benchmarks.

I think this is why we switched to using the Timer that Zig offers, to get a monotonic clock. But keep in mind, monotonic does not mean more accurate. That still depends on what the OS gives you. Also, OS context switching happens on a different order of magnitude (milliseconds) to my knowledge, so I suppose this is not the issue here.

As for the empty benchmark, can we detect that? Would be nice to have a warning of some kind there.

Besides, @AshishBhattarai I would be interested in the C code you were mentioning; could you share that?


edit: I just checked; Zig Timer uses CLOCK.BOOTTIME on Linux, which is monotonic afaik. But again, we're OS-dependent here ;-) Things might be different on Windows for instance.

@AshishBhattarai
Copy link
Author

AshishBhattarai commented May 9, 2024

@FObersteiner @bens Yea seems like this is the limitation of timers. Could we maybe switch to hardware performance counters. https://github.com/wcohen/libpfm4, not sure if this under the scope of zbench.

But, it would be cool if the benchmark report had baseline on it.

@bens
Copy link
Collaborator

bens commented May 9, 2024

Yes, it doesn't seem to be to do with the zig Timer implementation, I wrote a little C program to call clock_gettime(CLOCK_BOOTTIME, &ts1) twice and got the same result as in zig.

$ gcc -o overhead overhead.c -lc -O3
$ ./overhead 
resolution: 1ns
average: 1497ns
$ zig build -Doptimize=ReleaseFast test_examples
$ ./zig-out/bin/overhead
std.time.Timer.read overhead: 1466.061ns
std.time.Timer.lap  overhead: 1465.965ns

It's a lot higher than you're seeing, no idea why. I can post the source for each but they're not interesting.

Could we maybe switch to hardware performance counters.

I think this could be helpful but not a replacement. If a program uses fewer CPU cycles or has fewer cache misses, that doesn't necessarily mean it takes less time to run if it's spending more time switched out because it's blocked on IO for example. I think wall clock time is often most important, even if it's noisy and difficult to get a clean measurement.

@FObersteiner
Copy link
Collaborator

It's a lot higher than you're seeing, no idea why

that's two orders of magnitude... is this on a per-run basis? i.e. is this an offset on each iteration of a benchmark?

@bens
Copy link
Collaborator

bens commented May 9, 2024

Just in case I'm doing something crazy, I'll put up the source. I don't often write C so there could well be dumb things wrong.

#include <time.h>
#include <stdio.h>
#include <stdint.h>

uint64_t check_resolution(void) {
  struct timespec ts;
  clock_getres(CLOCK_BOOTTIME, &ts);
  return (ts.tv_sec * 1000000000) + ts.tv_nsec;
}

uint64_t measure(void) {
  const uint64_t iterations = 100000;
  struct timespec ts1, ts2;
  uint64_t total_ns = 0;
  uint64_t i = 0;
  while (i < iterations) {
    clock_gettime(CLOCK_BOOTTIME, &ts1);
    clock_gettime(CLOCK_BOOTTIME, &ts2);
    // The idea here is to avoid dealing with nanoseconds
    // wrapping around on a second boundary.
    if (ts1.tv_nsec <= ts2.tv_nsec) {
      total_ns += ts2.tv_nsec - ts1.tv_nsec;
      i++;
    }
  }
  return total_ns / iterations;
}

int main(int argc, char **argv) {
  printf("resolution: %dns\n", check_resolution());
  printf("average: %dns\n", measure());
  return 0;
}

@FObersteiner
Copy link
Collaborator

Just in case I'm doing something crazy, I'll put up the source.

interesting, the result seems to depend on the number of iterations and the compiler.

  • At 10 iterations, I start with a noisy 50ns (sometimes more, sometimes less) with zig cc (clang; zig 0.13-dev), if I go up to 1E7 iterations, I get 11ns very consistently
  • With gcc (11.4), I start at around 30ns and also get down to 11ns

not sure I totally understand what's happening here; it's not just an offset per call to clock_gettime. At least it seems to need some spin-up time to reach a constant baseline.

@bens
Copy link
Collaborator

bens commented May 9, 2024

It doesn't seem to make much difference on my system, either with gcc-13.2.0 or zig cc (zig 0.12.0).

$ ./overhead 
average(10):     1578ns
average(100):    1563ns
average(1000):   1572ns
average(10000):  1510ns
average(100000): 1495ns

This is with the CPU governor on "performance", if I set it to "on demand" it doesn't change a lot. But the idea of getting a baseline is looking very sensible if measuring nothing takes more than a microsecond on some systems.

@FObersteiner
Copy link
Collaborator

FObersteiner commented May 11, 2024

so I played around with the Timer instant a bit more, with something like

    var t = try std.time.Timer.start();
    const n_runs = [6]usize{ 10, 100, 1_000, 10_000, 100_000, 1_000_000 };
    inline for (n_runs) |n| {
        var noop_times: [n]u64 = undefined;
        var i: usize = 0;
        while (i < n) : (i += 1) {
            noop_times[i] = t.read();
            t.reset();
        }
        var sum: u64 = 0;
        for (noop_times) |_t| {
            sum += _t;
        }
        std.debug.print("n: {d}, sum: {d}, avg: {d}\n", .{ n, sum, sum / n });
    }

and get mixed results; for example:

n: 10, sum: 832, avg: 83
n: 100, sum: 16069, avg: 160
n: 1000, sum: 38054, avg: 38
n: 10000, sum: 424879, avg: 42
n: 100000, sum: 3401798, avg: 34
n: 1000000, sum: 32128462, avg: 32

Curiously, n=10 seems to run "quicker" than n=100 most of the time. At higher n, results seem more consistent and converge towards ~30ns on my system, but still with some variability.
So there definitely is some kind of baseline, however I'm not sure how stable it is. This is hard to test I guess; if you would do something like

[start timer] -> [get timer value; t_baseline] -> [reset timer] 
-> [run function to be benchmarked] -> [get timer value; t_benchmark]

how sure can you be that the actual (true) run time is t_benchmark - t_baseline?

In general, I think we should consider that this is only an issue if you have something that runs really fast. If you're in the microseconds or even milliseconds range, the relative uncertainty gets really small. But still, since we want to make fast stuff with Zig, the nanosecond range should be covered 🚀


edit 1: turns out you can plot stuff with Zig ;-) Still a bit rough but...

image

x-axis: timer no-op run #, y-axis: reported time in ns. Pretty spiky, seems like a lot is going on that disturbs a "stable" delta-t.


edit 2: the code for the plot above is on my codeberg. I've added some statistics; the (1-fold) standard deviation is almost as high as the average delta-t for each no-op. So definitely a lot of noise.

@FObersteiner
Copy link
Collaborator

FObersteiner commented Aug 16, 2024

see my "closing remarks" on #78 - I think we can close this with a note in the readme; 15 ns no-op time (at least; probably more) is a given, and not an issue of zBench itself.

@hendriknielaender
Copy link
Owner

Add note to readme, closed the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance
Projects
None yet
Development

No branches or pull requests

4 participants