-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate possible speedups through Numba #364
Comments
This is a self ping to try to follow up on this in the coming months. |
Looking at a few things in the NumPy backend with @seibert it was seen that since most of the backend is just wrappers directly to NumPy calls we unprobable to get speedup there. However, for things like and there could be speedup but SciPy functions are not yet supported by Numba (they might be in the future). For the moment there is https://github.com/person142/numba_special, but it might just be worth waiting 6 months and seeing if Numba gets time to work on this. |
yes this is what I suspected. One things that might be worth investigating is whether the slow interpolation codes (which are pure python) could actually become faster than the "fast tensorized" versions once passed through numba. I.e. there are a few nested loops that from a discussion with @jpivarski seem to be a good candidate. If that is the case, it's not clear how to proceed. We have settled on the array-computation underpinnings for backend portability (switching from e.g. numpy to tensorflow or pytorch) .. so even if we get a speed up from numba, we'd need to think about how to integrate this into pyhf if it requires breaking the "numpy"-ness/tensorlib character of the codebase |
Yeah, I can look into this. The number of places where we can get speedup from Numba is quite limited, as all of tensorlib, optimizer = get_backend() in all of them. |
Using this script I get
Which isn't too surprising given that I'm at the moment only able to so the overhead here probably just isn't worth it given that I'm doing a small 2-bin example. |
As not even
then I suggest that we just follow this Numba Issue and try again in about 6 months or whenever the Issue gets resolved. |
Random comment from the peanut gallery, but In [1]: import numba
In [2]: import scipy.special as sc
In [3]: import numba_special
In [4]: import numpy as np
In [5]: @numba.njit
...: def poisson_logpdf(n, lam):
...: # More stuff should go here...
...: return n * np.log(lam) - lam - sc.gammaln(n + 1.0)
...:
In [6]: poisson_logpdf(1.0, 1.0)
Out[6]: -1.0 Likely what you are seeing is that it specifically only supports It would be good to add those things (and it's not hard to do), but I haven't had time to do it yet. Feel free to open an issue. (When |
A long time ago, I implemented Here's a link: https://github.com/opendatagroup/augustus/blob/master/trunk/augustus-pmml-library/augustus/core/NumpyInterface.py#L128-L154 It's unfortunately a little obscured by using The thing you have to do to make Numba efficient is to stay in the compiled code block through the big calculation. In your case, I don't know if it's more compute-limited or data throughput-limited. (I would say compute-limited because it's toy MCs?) But either way, you might have to implement your own Or use C code linked through |
thanks @jpivarski -- do you have any type of guidance on what to do wrt vectorized code? A lot of performance critical code in the python ecosystem is using numpy to get C-level performance, and now it seems numba might be able to speed up beyond what you might have achieved with numpy. De-vectorizing all that existing code seems infeasible. Is there some way that this could be automated? |
If you already have a vectorized codebase, then you may be limited by flushing CPU caches where Numpy can't fuse operations. Like y = g(f(x)) where I have heard (*rumor! I've never confirmed it!) that Numpy now fuses some operations like To manually fuse them, you can write @numba.vectorize
def fog(x_i): # function argument takes one element of x
y_i = ...
return y_i # and returns one element of y and if you need @numba.vectorize([numba.float64(numba.float64)]) Now the operation of If we think of the process as operating on a big table (rows are data points, columns are named, typed attributes), there are two ways to de-vectorize: (1) by chunking rows into row groups and (2) by fusing column-at-a-time operations into several-columns-at-a-time operations. Both of these are used to fit into memory constraints: moving data into some memory unit is expensive, so you want to finish everything you're going to do to it before letting it out of the memory unit to get new data in. Pure row-wise processing could either be thought of as "every row is its own row group" or "all operations are a single, fused operation." Columnar processing is relaxing these two constraints as much as memory will allow. In GPU programming, row groups are called "blocks" (you set the block size of every kernel you run) and fusing is putting more steps in a single kernel, rather than defining separate kernels and running them sequentially. In GPU programming, limiting block size (1) is used more than fusing operations (2) because launching a new kernel from C++ is relatively inexpensive. In Python/Numpy, calling a compiled function from Python can be quite expensive, so fusing operations is more useful when you're allowed to do it—when you can have dependencies beyond just Numpy. (Also, in GPU programming, you have complete control over your "cache," the GPU's global memory, but in CPU programming, you don't.) In all of the above, I've assumed that you're data throughput-limited, not compute-limited. Of course, you should profile your code and watch your memory use to find out where the bottlenecks are and see where you're limited by memory movement. (I wonder if you can view hardware counters from Python, to see if you rack up a lot of cache page misses at some step...) |
We have done cProfile at some point, and noted that almost all of the time is spent inside numpy functions for example -- but it's not clear if this included memory movements or not... |
Scumba now exists under the Numba project. |
some studies here: #1173 |
For some belated follow up on this, @HDembinski has recently made a
|
Description
As @lukasheinrich has pointed out in team discussions, it would be interesting to see if it would be possible to get speedup from Numba — from the Numba "5 minute guide" the answer seems to be "probably yes". If pyhf can easily work with Numba then tests should be done to study the speedup.
For some references, there was a DIANA/HEP meeting on Numba.
The text was updated successfully, but these errors were encountered: