-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question on performance #3403
Comments
Edit: the numba function always returns the last element (why?). The following works, and takes 160ms, whether accessing the wkward array or the contained numpy @numba.njit(nogil=True, cache=True)
def mymax(x):
m = -np.inf
for v in x:
m = max(v, m)
if not np.isfinite(m):
return np.nan
return m |
Hi @martindurant ,
I noticed this difference in performance between highlevel ak arrays and numpy arrays aswell (also without arrow). For vector I added a PR that automatically uses I'm still in the process of understanding why there's such a difference between applying a kernel to a highlevel ak array and directly to the leaf array(s). It's high on my priority list and something I want to investigate right after the VirtualArray PR. I should mention though that awkward array does much more metadata processing than numpy (attrs, parameters, ...). Also, the dispatch happens from import numpy as np
import awkward as ak
# numpy
x = np.ones((10_000,))
%timeit np.max(x)
# 1.55 μs ± 3.38 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# ----------
# awkward
# ak.max - highlevel function
%timeit ak.max(x)
# 107 μs ± 285 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
# ak.max - only: reducer + layout traversal
ak_max = ak._reducers.Max(None)
layout = ak.to_layout(x)
%timeit ak._do.reduce(layout, ak_max, axis=None, mask=True, keepdims=False, behavior=None)
# 77 μs ± 108 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
# ak.max - only: kernel application
ak_max = ak._reducers.Max(None)
layout = ak.to_layout(x)
parents = ak.index.Index64.zeros(layout.length, layout.backend.index_nplike)
starts = ak.index.Index64.zeros(1, layout.backend.index_nplike)
shifts = None
%timeit ak_max.apply(layout, parents, starts, shifts, 1)
# 44.2 μs ± 1e+03 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) So what can we learn from that?
I'm convinced we can reduce the time spent in python, this is what I want to investigate soon. However, I think we won't ever reach numpy speed given that awkward's kernels are not as tuned as the numpy ones (and of course because they need to be able to deal with ragged arrays). Also I don't think we can in general get rid of the time spend to allocate all the needed intermediate arrays (parents, starts, output for the kernel to write into). The python overhead is a constant and independent of the array size. The larger the arrays are the more negligible is the python overhead of course. Runtime differences for very large arrays are primarily accountable to the difference in how fast the kernels are and how long it take to create all intermediates in the awkward case. |
Given that the function call overhead in python is ~0.1us (getting better with python versions...), we probably can't hope to get close to 1us in anything we do (how does numpy??) and should probably increase the data size to something more realistic. ak.max(axis=None) should not depend on the raggedness, since it operates only on the leaf. I might need to care about None-ness, though. Do such specialisations not exist? |
import numpy as np
import awkward as ak
x = np.ones(1_000_000)
# using `np.zeros` for `ak.index.Index64.zeros`:
%timeit ak.max(x)
# 3.16 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# using `np.empty & [...]=0` for `ak.index.Index64.zeros`:
%timeit ak.max(x)
# 1.69 ms ± 28.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
edit: this is wrong |
I have to give it a closer look next week, but we might not do such shortcuts/optimizations right now. |
I would have expected the opposite for zeros versus empty
I don't even know where an index enters into this - but let's pick it up next week. |
My current understanding is that at small array sizes there's a non-negligible overhead from Python, but at larger sizes it's dominated by intermediates and slower kernels compared to NumPy. |
I ran a test whereby I replaced @martindurant you're correct that we do not perform |
Hi @agoose77, Not sure if you have a typo, but |
I'm comparing 308E-6 vs 410E-3, ~1E3 order difference. If I run this function and loop the number of kernel iterations, we see:
the following graph It looks like the intrinsic overhead here is about 220 ms; as you point out, memory allocation of My point (echoing @martindurant) is that if we specialise Furthermore our actual "kernel" execution speed is roughly 130 ms in this example, which is nearly double NumPy's 76ms. With the no-parents kernel, we could get closer to this by using SIMD intrinsics. |
But do we need so many of them for the axis=0 case? For an array with type
I don't know, but I suspect, that allocations Vs execution is probably even more unfavourable on GPUs.
I have found, very roughly, that numba achieves the former number, at least for the naive algorithm I would write for max() and similar. That's not bad! I wonder if there are scenarios where numba beats the C++ kernels. At lest they are a whole lot easier to write (for me). |
@martindurant it is definitely an interesting idea to use Numba to write kernels, and we have periodically thought about that UX.
Are you actually interested in |
I think we're on the same page here: |
Obviously, I am interested in all! I'm just trying to understand what all these arrays are doing. |
We can also fast-path |
Ah my reply came across a bit terse -- I was thinking more about the ease of optimising axis=-1 vs the 0 axis |
I would tentatively suggest that axis=-1 and axis=None will be the most common cases. |
Version of Awkward Array
2.7.4
Description and code to reproduce
numpy 1.26.4
pyarrow 19.0.0
The origin of the data I will use here is not really important, but for reference, it is:
1.9GB of points in feather2 format.
130M points. The "geometry" column has x, y fields, both float64.
Issue 1
(the lesser issue)
Depending on how I convert the data, I get different layouts:
Here, the second variant is what you should get - we know there are no NULLs. If you don't select "x", you see UnmaskedArray s even for the first route.
Issue 2
Doing some timings:
And with a handwritten simple numba kernel:
we get
So, my question is: how can we avoid the >600ms for this operation while maintaining the awkward API? Am I seeing some kind of weird caching from the many original chunks of the arrow data?
The text was updated successfully, but these errors were encountered: