Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

halide_benchmark.h does not work with GPU filters. #2278

Open
zvookin opened this issue Aug 18, 2017 · 5 comments
Open

halide_benchmark.h does not work with GPU filters. #2278

zvookin opened this issue Aug 18, 2017 · 5 comments
Assignees

Comments

@zvookin
Copy link
Member

zvookin commented Aug 18, 2017

The benchmark function in halide_benchmark.h does not do any synchronization on for GPU filters, thus it times how long it takes to queue kernel requests for filters that return a GPU device buffer. The design of the harness means one cannot insert explicit synchronization in the right place.

Possible resolutions include documenting the behavior and leaving it as is, adding an extra argument that is a buffer to call halide_device_sync on before ending the timing loop (if not-null), or adding a general function object argument to do arbitrary synchronization inside the timing loop.

@steven-johnson
Copy link
Contributor

FWIW, extant uses of this with GPU code seem to insert manual device_sync() calls to the lambda (e.g. test/performance/wrap.cpp)... which is suboptimal-but-maybe-adequate as long as iterations=1.

@zvookin
Copy link
Member Author

zvookin commented Aug 21, 2017 via email

@steven-johnson
Copy link
Contributor

A handful of use cases already explicitly call device_sync() as part of the lambda; I'm presuming here that what you are requesting is flexibility to allow for calling device_sync() in a per-benchmark specific way.

It would be simple enough to revise the API to be something like

double benchmark(int samples, int iterations, F op, G device_sync = nullptr);

where device_sync (if specified) will be called after every set of iterations; thus we could do something like

  benchmark(10, 10, [&] { pipeline(in, out); }, [&]{ out.device_sync(); });

...that still makes the situation a bit ticklish and error-prone, though, as the caller has to set things up correctly, and it's easy to do wrong.

FWIW: I'm not wild about the design of this benchmarking utility in the first place, as choosing useful values for samples and iterations is a bit ticklish; I'm not sure of the original design decision behind this or if it was just a quick-n-dirty utility that has had its usage expand over time.

A purpose-built benchmarking utility that knows about Halide stages would likely be a lot more useful, especially for GPU code. I'll work on a proposal for such a thing.

@shoaibkamil
Copy link
Contributor

Even though it's not ideal, I think this API would be really useful. I'm currently hacking a version of this up for a paper submission, but I think it's useful to be able to answer some performance questions about GPU stages:

  1. How long does it take to compute the stage
  2. How long does the sync take

I think a simple 2-parameter regression would tell us this with this API.

In addition, I imagine that an autotuning loop using the GPU, based on RunGen, is going to be necessary to extend the autoscheduler to work well for GPUs.

@abadams
Copy link
Member

abadams commented May 20, 2019

All of those pipelines have been large enough that adding a device_sync inside the lambda is fine, and I believe that's what rungen does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants