Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Halide and Adams 2019 autoscheduler performance drastically decreases with environment variable KMP_AFFINITY set to granularity=fine,scatter #8538

Open
ivangarcia44 opened this issue Dec 23, 2024 · 2 comments

Comments

@ivangarcia44
Copy link

We are comparing the performance of Halide with Adams 2019 on various sizes of matrix multiplication against another technology.

As part of that comparison we set the following two environment variables:

  • export KMP_AFFINITY=granularity=fine,scatter
  • export OMP_NUM_THREADS=6

The runtime performance of Halide drops by around 5x when KMP_AFFINITY is set as above, compared as being empty. The OMP_NUM_THREADS environment variable does not affect much. The other technology runtime performance is not affected much by these two environment variables.

Is it known why the KMP_AFFINITY setting above is affecting Halide runtime performance? What would the recommended setting for this would be? Please let me know if you have a link with the recommended environment variable settings for having the best performance for Halide and Adams 2019.

My machine is an AMD EPYC 74F3 24-Core Processor x86_64 with 10 CPU's.

Thanks,
Ivan

@abadams
Copy link
Member

abadams commented Dec 23, 2024

Are you using a custom thread pool? Or are you reusing your openmp threads for Halide's threads somehow? As far as I can tell, KMP_AFFINITY should only affect code using openmp. Maybe all of Halide's threads are getting pinned to the same core as the main thread. I advise doing your Halide tests in a separate process without KMP_AFFINITY set.

But matrix multiplication is really not a good use case for Adams 2019. You can write down a good schedule for a matrix multiply directly, but it's somewhat fiddly (see test/performance/matrix_multiplication.cpp). Adams 2019 is designed for imaging pipelines, and would have to get extraordinarily lucky to find that matrix multiply schedule. It won't even attempt the rfactor, so any split-k schedules are out, and if you don't add the wrapper Func yourself, it's going to be forced to do a whole separate pass just to zero-initialize the output. It also doesn't use Func::in() so there can't be any staging of inputs, which is sometimes helpful. Scheduling a matrix multiply is unlike scheduling most other code (e.g. register pressure is the key concern for the inner loop, tiled storage actually makes sense for the memory hierarchy, etc).

If I were autoscheduling CPU matrix multiplies in Halide I'd just use the schedule from that test and add autotuning over the split factors (mostly tile_y and tile_k).

@ivangarcia44
Copy link
Author

Sorry for the late reply. I am not using a custom thread pool. In the experiments where we found this, we were executing raw/independent Halide code on a Linux Debian AMD machine:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 45 bits physical, 48 bits virtual
CPU(s): 10
On-line CPU(s) list: 0-9
Vendor ID: AuthenticAMD
Model name: AMD EPYC 74F3 24-Core Processor
CPU family: 25
Model: 1
Thread(s) per core: 1
Core(s) per socket: 10

The experiments were done in isolation. The same was done in the machines of other two colleagues and they got the same result.

We are doing the Halide experiments here without setting the KMP_AFFINITY environment flag. Thank you for pointing out what is expected in matrix multiplication. We are focusing instead on computer vision and image processing applications (e.g., CNN’s, edge detection) since that is what Halide was designed for as you pointed out.

Thank you for the quick response and information on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants