halide_benchmark.h does not work with GPU filters. #2278

zvookin · 2017-08-18T18:22:00Z

The benchmark function in halide_benchmark.h does not do any synchronization on for GPU filters, thus it times how long it takes to queue kernel requests for filters that return a GPU device buffer. The design of the harness means one cannot insert explicit synchronization in the right place.

Possible resolutions include documenting the behavior and leaving it as is, adding an extra argument that is a buffer to call halide_device_sync on before ending the timing loop (if not-null), or adding a general function object argument to do arbitrary synchronization inside the timing loop.

steven-johnson · 2017-08-21T19:52:53Z

FWIW, extant uses of this with GPU code seem to insert manual device_sync() calls to the lambda (e.g. test/performance/wrap.cpp)... which is suboptimal-but-maybe-adequate as long as iterations=1.

zvookin · 2017-08-21T20:28:40Z

Correctly benchmarking GPU stuff is subtle enough that even knowing the issues, I still get it wrong some of the time. Also we encourage use of this header for writing CPU benchmarks. If something else has to be done to get them to report useful information for GPUs, that hinders our efforts to optimize GPU performance. (Worse, it seems to report timings, they just don't reflect time to run because only the queuing of the requests has been timed. Also if there is no external synchronization, one can end up with bugs. E.g. not doing copy_to_host before a correctness check after the benchmark. This is mainly an issue because someone knowledgeable will assume the synchronization happened inside the benchmark as it has to in order to report useful timings.) Only running a single iteration is problematic as it does not give significance and likely overstates the queuing overhead. Running a bunch of iterations and synchronizing at the end may give a more representative result and I often want to see the trend as iterations increase. To do it right, one really needs to time multiple runs with different synchronization for each. This would report broken out timings for e.d. the time for issuing the request, the times for any copy up and down, and the time to actually execute the kernel. The simple thing to do is to add a buffer parameter that provides the thing to sync. The harness would do two sets of the requested number of iterations. The first set would sync on the buffer each time, the second set would queue all N iterations and sync after all of them have been run. Both times would be reported. (This presumes the filter can be rerun successively on the same inputs and output, though I believe that is assumed for CPU operation as well unless the lambda takes care to do something.) At the very least, we should add a caveat to the header saying not to use it for GPU timing.

…

On Mon, Aug 21, 2017 at 12:52 PM, Steven Johnson ***@***.***> wrote: FWIW, extant uses of this with GPU code seem to insert manual device_sync() calls to the lambda (e.g. test/performance/wrap.cpp)... which is suboptimal-but-maybe-adequate as long as iterations=1. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2278 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABbqFIqB8ConddqfxkCoU4Qlf6MyBlD0ks5saeAWgaJpZM4O74zc> .

steven-johnson · 2017-08-21T20:57:29Z

A handful of use cases already explicitly call device_sync() as part of the lambda; I'm presuming here that what you are requesting is flexibility to allow for calling device_sync() in a per-benchmark specific way.

It would be simple enough to revise the API to be something like

double benchmark(int samples, int iterations, F op, G device_sync = nullptr);

where device_sync (if specified) will be called after every set of iterations; thus we could do something like

  benchmark(10, 10, [&] { pipeline(in, out); }, [&]{ out.device_sync(); });

...that still makes the situation a bit ticklish and error-prone, though, as the caller has to set things up correctly, and it's easy to do wrong.

FWIW: I'm not wild about the design of this benchmarking utility in the first place, as choosing useful values for samples and iterations is a bit ticklish; I'm not sure of the original design decision behind this or if it was just a quick-n-dirty utility that has had its usage expand over time.

A purpose-built benchmarking utility that knows about Halide stages would likely be a lot more useful, especially for GPU code. I'll work on a proposal for such a thing.

shoaibkamil · 2019-05-20T00:06:44Z

Even though it's not ideal, I think this API would be really useful. I'm currently hacking a version of this up for a paper submission, but I think it's useful to be able to answer some performance questions about GPU stages:

How long does it take to compute the stage
How long does the sync take

I think a simple 2-parameter regression would tell us this with this API.

In addition, I imagine that an autotuning loop using the GPU, based on RunGen, is going to be necessary to extend the autoscheduler to work well for GPUs.

abadams · 2019-05-20T02:40:33Z

All of those pipelines have been large enough that adding a device_sync inside the lambda is fine, and I believe that's what rungen does.

zvookin assigned steven-johnson Aug 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

halide_benchmark.h does not work with GPU filters. #2278

halide_benchmark.h does not work with GPU filters. #2278

zvookin commented Aug 18, 2017

steven-johnson commented Aug 21, 2017

zvookin commented Aug 21, 2017 via email

steven-johnson commented Aug 21, 2017

shoaibkamil commented May 20, 2019

abadams commented May 20, 2019

halide_benchmark.h does not work with GPU filters. #2278

halide_benchmark.h does not work with GPU filters. #2278

Comments

zvookin commented Aug 18, 2017

steven-johnson commented Aug 21, 2017

zvookin commented Aug 21, 2017 via email

steven-johnson commented Aug 21, 2017

shoaibkamil commented May 20, 2019

abadams commented May 20, 2019