Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on performance #3403

Open
martindurant opened this issue Feb 21, 2025 · 17 comments
Open

Question on performance #3403

martindurant opened this issue Feb 21, 2025 · 17 comments
Labels
performance Works, but not fast enough or uses too much memory

Comments

@martindurant
Copy link
Contributor

Version of Awkward Array

2.7.4

Description and code to reproduce

numpy 1.26.4
pyarrow 19.0.0

The origin of the data I will use here is not really important, but for reference, it is:
1.9GB of points in feather2 format.

table = pyarrow.feather.read_table("microsoft-buildings-point.arrow")

130M points. The "geometry" column has x, y fields, both float64.

Issue 1

(the lesser issue)

Depending on how I convert the data, I get different layouts:

>>> ak.from_arrow(table)["geometry", "x"].layout
<IndexedOptionArray len='129735970'>
    <index><Index dtype='int64' len='129735970'>
        [        0         1         2 ... 129735967 129735968 129735969]
    </Index></index>
    <content><NumpyArray dtype='float64' len='129735970'>
        [ -84.95972352  -84.95973298  -84.9599375  ... -111.04598275
         -111.047405   -111.0478207 ]
    </NumpyArray></content>
</IndexedOptionArray>

>>> ak.from_arrow(table["geometry"])["x"].layout
<UnmaskedArray len='129735970'>
    <content><NumpyArray dtype='float64' len='129735970'>
        [ -84.95972352  -84.95973298  -84.9599375  ... -111.04598275
         -111.047405   -111.0478207 ]
    </NumpyArray></content>
</UnmaskedArray>

Here, the second variant is what you should get - we know there are no NULLs. If you don't select "x", you see UnmaskedArray s even for the first route.

Issue 2

Doing some timings:

>>> x = ak.from_arrow(table["geometry"])["x"] # the unmasked variant
>>> np.max(x)
656ms
>>> ak.max(x)
666ms, OK, so dispatch does what we expect
>>> %timeit np.max(x.layout.content.data)
18ms, well that is just a bit faster
>>> %timeit np.nanmax(x.layout.content.data)
20ms, in case of nan (since we shold have no NULLs)
>>>  np.nanmax(np.where(True, x.layout.content.data, np.nan))
176ms, maybe this is what awkward actually does?

And with a handwritten simple numba kernel:

@numba.njit(nogil=True, cache=True)
def mymax(x):
    max = -np.inf
    for v in x:
        if np.isfinite(v) and v > max:
            max = v
    return v

we get

>>> mymax(x)
40.3ms
>>> mymax(x.layout.content.data)
20.2ms

So, my question is: how can we avoid the >600ms for this operation while maintaining the awkward API? Am I seeing some kind of weird caching from the many original chunks of the arrow data?

@martindurant martindurant added the performance Works, but not fast enough or uses too much memory label Feb 21, 2025
@martindurant
Copy link
Contributor Author

Edit: the numba function always returns the last element (why?). The following works, and takes 160ms, whether accessing the wkward array or the contained numpy

@numba.njit(nogil=True, cache=True)
def mymax(x):
    m = -np.inf
    for v in x:
        m = max(v, m)
    if not np.isfinite(m):
        return np.nan
    return m

@pfackeldey
Copy link
Collaborator

pfackeldey commented Feb 22, 2025

Hi @martindurant ,

So, my question is: how can we avoid the >600ms for this operation while maintaining the awkward API? Am I seeing some kind of weird caching from the many original chunks of the arrow data?

I noticed this difference in performance between highlevel ak arrays and numpy arrays aswell (also without arrow). For vector I added a PR that automatically uses ak.transform to "push down" all behavior methods & properties to act on the "leaf" numpy arrays instead of the highlevel ak arrays. Some functions were sped up by >20x by that. I suspected here that it is related to layout traversals, but I can't say for sure yet.

I'm still in the process of understanding why there's such a difference between applying a kernel to a highlevel ak array and directly to the leaf array(s). It's high on my priority list and something I want to investigate right after the VirtualArray PR.

I should mention though that awkward array does much more metadata processing than numpy (attrs, parameters, ...). Also, the dispatch happens from np.* to ak.*, and ak.* are generally not expected to be as efficient and tuned as the corresponding numpy one. Let's break it down for ak.max:

import numpy as np
import awkward as ak

# numpy
x = np.ones((10_000,))

%timeit np.max(x)
# 1.55 μs ± 3.38 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

# ----------

# awkward 

# ak.max - highlevel function
%timeit ak.max(x)
# 107 μs ± 285 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

# ak.max - only: reducer + layout traversal
ak_max = ak._reducers.Max(None)
layout = ak.to_layout(x)

%timeit ak._do.reduce(layout, ak_max, axis=None, mask=True, keepdims=False, behavior=None)
# 77 μs ± 108 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

# ak.max - only: kernel application
ak_max = ak._reducers.Max(None)
layout = ak.to_layout(x)

parents = ak.index.Index64.zeros(layout.length, layout.backend.index_nplike)
starts = ak.index.Index64.zeros(1, layout.backend.index_nplike)
shifts = None

%timeit ak_max.apply(layout, parents, starts, shifts, 1)
# 44.2 μs ± 1e+03 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

So what can we learn from that?

  • The awkward kernel is slower than the numpy one: 1.55 μs <-> 44.2 μs (ok there's still a little bit of python involved in the ak_max.apply call, maybe I need to further break it down?).
  • The layout traversal almost doubles the runtime from 44.2 μs -> 77 μs (recursive layout traversal in python). This is surprising to me, because it's just a flat 1D array. However, during this one level traversal awkward still has to deal with all metadata - so maybe it's purely that?
  • The remaining time (77 μs -> 107 μs) is spent purely in metadata processing, highlevel array (un-)wrapping, dispatching, ... (all python).

I'm convinced we can reduce the time spent in python, this is what I want to investigate soon. However, I think we won't ever reach numpy speed given that awkward's kernels are not as tuned as the numpy ones (and of course because they need to be able to deal with ragged arrays). Also I don't think we can in general get rid of the time spend to allocate all the needed intermediate arrays (parents, starts, output for the kernel to write into).

The python overhead is a constant and independent of the array size. The larger the arrays are the more negligible is the python overhead of course. Runtime differences for very large arrays are primarily accountable to the difference in how fast the kernels are and how long it take to create all intermediates in the awkward case.

@martindurant
Copy link
Contributor Author

Given that the function call overhead in python is ~0.1us (getting better with python versions...), we probably can't hope to get close to 1us in anything we do (how does numpy??) and should probably increase the data size to something more realistic.

ak.max(axis=None) should not depend on the raggedness, since it operates only on the leaf. I might need to care about None-ness, though. Do such specialisations not exist?

@pfackeldey
Copy link
Collaborator

pfackeldey commented Feb 22, 2025

One thing I just found out...:

If I replace the ak.index.Index64.zeros implementation with a empty allocation and then setting all values to 0 by x[...] = 0, I significantly speedup things:

import numpy as np
import awkward as ak

x = np.ones(1_000_000)

# using `np.zeros` for `ak.index.Index64.zeros`:
%timeit ak.max(x)
# 3.16 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# using `np.empty & [...]=0` for `ak.index.Index64.zeros`:
%timeit ak.max(x)
# 1.69 ms ± 28.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

I wasn't aware of that difference in time related to array creation routines (and how expensive these intermediates are).

edit: this is wrong

@pfackeldey
Copy link
Collaborator

Given that the function call overhead in python is ~0.1us (getting better with python versions...), we probably can't hope to get close to 1us in anything we do (how does numpy??) and should probably increase the data size to something more realistic.

ak.max(axis=None) should not depend on the raggedness, since it operates only on the leaf. I might need to care about None-ness, though. Do such specialisations not exist?

I have to give it a closer look next week, but we might not do such shortcuts/optimizations right now.

@martindurant
Copy link
Contributor Author

I would have expected the opposite for zeros versus empty

In [62]: %timeit np.zeros(1_000_000)
541 μs ± 57.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [63]: %timeit x = np.empty(1_000_000); x[:] = 0
1.02 ms ± 65.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

I don't even know where an index enters into this - but let's pick it up next week.

@pfackeldey
Copy link
Collaborator

My current understanding is that at small array sizes there's a non-negligible overhead from Python, but at larger sizes it's dominated by intermediates and slower kernels compared to NumPy.
We might be able to reduce some overhead, and maybe find some shortcuts to avoid intermediates in certain scenarios.

@agoose77
Copy link
Collaborator

I ran a test whereby I replaced ak._do.reduce with ak._do.reduce = lambda *a, **b: ak.to_layout(1). This has a %timeit overhead of 308±7 μs on my machine, vs 410±11 ms for the original operation; the overhead is negligible at this scale.

@martindurant you're correct that we do not perform axis=None specialisations right now. We're at a point where I think that would be sensible. See also #1250 (less perf, more correctness).

@pfackeldey
Copy link
Collaborator

pfackeldey commented Feb 24, 2025

Hi @agoose77,
I think the performance of reducers is mainly limited by the creation of intermediate arrays, in particular this one: https://github.com/scikit-hep/awkward/blob/main/src/awkward/_do.py#L253. This overhead scales with the array size.

Not sure if you have a typo, but 308±7 μs vs 410±11 ms is not a negligible overhead I'd say. It's 3 orders of magnitude different?

@agoose77
Copy link
Collaborator

agoose77 commented Feb 24, 2025

Not sure if you have a typo, but 308±7 μs vs 410±11 ms is not a negligible overhead I'd say. It's one order of magnitude different?

I'm comparing 308E-6 vs 410E-3, ~1E3 order difference.

If I run this function and loop the number of kernel iterations, we see:

%%timeit self=ak._reducers.Max(None); array=ak.from_arrow(table['geometry']).x.layout.content

starts = ak.index.Index64.zeros(1, array.backend.index_nplike)
parents = ak.index.Index64.zeros(array.length, array.backend.index_nplike)
shifts = None
outlength = 1

# View array data in kernel-supported dtype
kernel_array_data = array.data.view(self._dtype_for_kernel(array.dtype))
result = array.backend.nplike.empty(
    self._length_for_kernel(array.dtype.type, outlength),
    dtype=kernel_array_data.dtype,
)
# View array data in kernel-supported dtype
def kernel():
    array.backend.maybe_kernel_error(
        array.backend[
            "awkward_reduce_max",
            result.dtype.type,
            kernel_array_data.dtype.type,
            parents.dtype.type,
        ](
            result,
            kernel_array_data,
            parents.data,
            parents.length,
            outlength,
            self._identity_for(result.dtype),
        )
    )

for i in range(1):
    kernel()

out = ak.contents.NumpyArray(result.view(array.dtype), backend=array.backend)

the following graph

Image

It looks like the intrinsic overhead here is about 220 ms; as you point out, memory allocation of parents (and assignment to zero) is a significant overhead of this function.

My point (echoing @martindurant) is that if we specialise axis=None, then we don't need parents!

Furthermore our actual "kernel" execution speed is roughly 130 ms in this example, which is nearly double NumPy's 76ms. With the no-parents kernel, we could get closer to this by using SIMD intrinsics.

@martindurant
Copy link
Contributor Author

if we specialise axis=None, then we don't need parents

But do we need so many of them for the axis=0 case? For an array with type 3 * var * int64, @pfackeldey showed that max(axis=0) required:

  • 2 calls to zeros (1 with size=1)
  • 11 calls to empty (4 with size=1)

I don't know, but I suspect, that allocations Vs execution is probably even more unfavourable on GPUs.

Furthermore our actual "kernel" execution speed is roughly 130 ms in this example, which is nearly double NumPy's 76ms.

I have found, very roughly, that numba achieves the former number, at least for the naive algorithm I would write for max() and similar. That's not bad! I wonder if there are scenarios where numba beats the C++ kernels. At lest they are a whole lot easier to write (for me).

@agoose77
Copy link
Collaborator

@martindurant it is definitely an interesting idea to use Numba to write kernels, and we have periodically thought about that UX.

But do we need so many of them for the axis=0 case?

Are you actually interested in axis=0, or axis=-1?

@pfackeldey
Copy link
Collaborator

I think we're on the same page here:
The expensive overhead is this one specific parent allocation. All others are cheap, they're either small or empty. We can shortcut that explicitly for axis=None (and mask=False), but not for other combinations (as far as I can see).

@martindurant
Copy link
Contributor Author

Are you actually interested in axis=0, or axis=-1?

Obviously, I am interested in all! I'm just trying to understand what all these arrays are doing.

@agoose77
Copy link
Collaborator

We can also fast-path axis=N for all-regular arrays, but I don't think that's worth maintaining given that we normally want at least one axis=Var.

@agoose77
Copy link
Collaborator

Are you actually interested in axis=0, or axis=-1?

Obviously, I am interested in all! I'm just trying to understand what all these arrays are doing.

Ah my reply came across a bit terse -- I was thinking more about the ease of optimising axis=-1 vs the 0 axis

@martindurant
Copy link
Contributor Author

I would tentatively suggest that axis=-1 and axis=None will be the most common cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Works, but not fast enough or uses too much memory
Projects
None yet
Development

No branches or pull requests

3 participants