-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
halide_benchmark.h does not work with GPU filters. #2278
Comments
FWIW, extant uses of this with GPU code seem to insert manual device_sync() calls to the lambda (e.g. test/performance/wrap.cpp)... which is suboptimal-but-maybe-adequate as long as iterations=1. |
Correctly benchmarking GPU stuff is subtle enough that even knowing the
issues, I still get it wrong some of the time. Also we encourage use of
this header for writing CPU benchmarks. If something else has to be done to
get them to report useful information for GPUs, that hinders our efforts to
optimize GPU performance. (Worse, it seems to report timings, they just
don't reflect time to run because only the queuing of the requests has been
timed. Also if there is no external synchronization, one can end up with
bugs. E.g. not doing copy_to_host before a correctness check after the
benchmark. This is mainly an issue because someone knowledgeable will
assume the synchronization happened inside the benchmark as it has to in
order to report useful timings.)
Only running a single iteration is problematic as it does not give
significance and likely overstates the queuing overhead. Running a bunch of
iterations and synchronizing at the end may give a more representative
result and I often want to see the trend as iterations increase. To do it
right, one really needs to time multiple runs with different
synchronization for each. This would report broken out timings for e.d. the
time for issuing the request, the times for any copy up and down, and the
time to actually execute the kernel. The simple thing to do is to add a
buffer parameter that provides the thing to sync. The harness would do two
sets of the requested number of iterations. The first set would sync on the
buffer each time, the second set would queue all N iterations and sync
after all of them have been run. Both times would be reported. (This
presumes the filter can be rerun successively on the same inputs and
output, though I believe that is assumed for CPU operation as well unless
the lambda takes care to do something.)
At the very least, we should add a caveat to the header saying not to use
it for GPU timing.
…On Mon, Aug 21, 2017 at 12:52 PM, Steven Johnson ***@***.***> wrote:
FWIW, extant uses of this with GPU code seem to insert manual
device_sync() calls to the lambda (e.g. test/performance/wrap.cpp)... which
is suboptimal-but-maybe-adequate as long as iterations=1.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2278 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABbqFIqB8ConddqfxkCoU4Qlf6MyBlD0ks5saeAWgaJpZM4O74zc>
.
|
A handful of use cases already explicitly call device_sync() as part of the lambda; I'm presuming here that what you are requesting is flexibility to allow for calling device_sync() in a per-benchmark specific way. It would be simple enough to revise the API to be something like
where device_sync (if specified) will be called after every set of iterations; thus we could do something like
...that still makes the situation a bit ticklish and error-prone, though, as the caller has to set things up correctly, and it's easy to do wrong. FWIW: I'm not wild about the design of this benchmarking utility in the first place, as choosing useful values for samples and iterations is a bit ticklish; I'm not sure of the original design decision behind this or if it was just a quick-n-dirty utility that has had its usage expand over time. A purpose-built benchmarking utility that knows about Halide stages would likely be a lot more useful, especially for GPU code. I'll work on a proposal for such a thing. |
Even though it's not ideal, I think this API would be really useful. I'm currently hacking a version of this up for a paper submission, but I think it's useful to be able to answer some performance questions about GPU stages:
I think a simple 2-parameter regression would tell us this with this API. In addition, I imagine that an autotuning loop using the GPU, based on RunGen, is going to be necessary to extend the autoscheduler to work well for GPUs. |
All of those pipelines have been large enough that adding a device_sync inside the lambda is fine, and I believe that's what rungen does. |
The benchmark function in halide_benchmark.h does not do any synchronization on for GPU filters, thus it times how long it takes to queue kernel requests for filters that return a GPU device buffer. The design of the harness means one cannot insert explicit synchronization in the right place.
Possible resolutions include documenting the behavior and leaving it as is, adding an extra argument that is a buffer to call halide_device_sync on before ending the timing loop (if not-null), or adding a general function object argument to do arbitrary synchronization inside the timing loop.
The text was updated successfully, but these errors were encountered: