Improve performance of GCXS dot ndarray #643

jcapriot · 2024-02-17T21:52:59Z

Improves the memory access pattern for GCXS arrays dot ndarrays.

Was running a few benchmarks of the GCXS class against scipy's csr/csc arrays and noticed that they were significantly slower. So was curious what the biggest difference was as their structures shouldn't be too fundamentally different. When looking at the csr_dot_ndarray code and the csc_dot_ndarray code I realized that the access pattern of the arrays was inefficient.

Here are some of the relevant timings of things:

import scipy.sparse as sp
import sparse
import numpy as np

state = np.random.default_rng(seed=1337)
A_gc0s = sparse.random((9000, 10000), format='gcxs', random_state=state).change_compressed_axes((0,))
A_gc1s = A_gc0s.change_compressed_axes((1,))
A_csr = A_gc0s.to_scipy_sparse()
A_csc = A_gc1s.to_scipy_sparse()

n_vecs = 20
n_broadcasted = 5

v = state.random((A_gc0s.shape[1], n_vecs))
u = state.random((n_broadcasted, A_gc0s.shape[1], n_vecs))
x = state.random((n_vecs, A_gc0s.shape[0]))

Scipy Timing:

%timeit A_csr @ v

18.1 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit A_csc @ v

19 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit x @ A_csr

21.9 ms ± 3.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit x @ A_csc

21 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Sparse timing:

# trigger compilers
A_gc0s @ v
A_gc1s @ v
x @ A_gc0s
x @ A_gc1s
A_gc0s @ u
A_gc1s @ u;

Timing on main:

%timeit A_gc0s @ v

58.4 ms ± 4.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit A_gc1s @ v

179 ms ± 7.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit A_gc0s @ u

336 ms ± 44.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit A_gc1s @ u

1.06 s ± 80 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit x @ A_gc0s

126 ms ± 7.89 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit x @ A_gc1s

55.7 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timing on PR:

%timeit A_gc0s @ v

19.8 ms ± 2.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit A_gc1s @ v

23.4 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit A_gc0s @ u

65 ms ± 5.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit A_gc1s @ u

77 ms ± 5.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit x @ A_gc0s

32.8 ms ± 3.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit x @ A_gc1s

57.6 ms ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The left ndarray dot GCXS arrays are still a little slower than I'd like, but couldn't quite see where to dive into that at, as it looks like it's doing what it should be for the matvec operation (basically just doing (GCXS.T @ x.T).T)

One other thing, for 1D ndarrays numba doesn't seem to be able to optimize these as well as it could (considering benchmarks with scipy csr/csc times a 1D array). There should likely be a branching for 1D ndarrays to use a version that doesn't include the loop over the second dimension.

hameerabbasi

Thank you for the changes! Would you be willing to write benchmarks (examples in the benchmarks directory, see https://asv.readthedocs.io/en/stable/writing_benchmarks.html)

jcapriot · 2024-02-17T21:57:12Z

Thank you for the changes! Would you be willing to write benchmarks (examples in the benchmarks directory, see https://asv.readthedocs.io/en/stable/writing_benchmarks.html)

Sure!

codecov · 2024-02-17T21:58:28Z

Codecov Report

Merging #643 (c06b832) into main (a1d2081) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #643      +/-   ##
==========================================
- Coverage   90.22%   90.21%   -0.02%     
==========================================
  Files          20       20              
  Lines        3674     3670       -4     
==========================================
- Hits         3315     3311       -4     
  Misses        359      359

sparse/_common.py

jcapriot added 4 commits February 17, 2024 13:39

improve memory access order for csr_ndarray_dot operation

d3fd014

improve access order for csc_ndarray_dot

6ee3bd5

re-use csr_dot_ndarray for ndarray_dot_csc

f037950

make it easier for numba to optimize the csc_ndarray function.

1859869

hameerabbasi previously approved these changes Feb 17, 2024

View reviewed changes

jcapriot added 2 commits February 20, 2024 00:38

add benchmarking

39f1291

ensure aligned memory

c69b250

jcapriot dismissed hameerabbasi’s stale review via c69b250 February 20, 2024 07:40

jcapriot commented Feb 20, 2024

View reviewed changes

sparse/_common.py Show resolved Hide resolved

hameerabbasi approved these changes Feb 20, 2024

View reviewed changes

hameerabbasi enabled auto-merge (squash) February 20, 2024 07:53

Merge branch 'main' into gcsx_ndarray_dot

c06b832

hameerabbasi merged commit cb6b604 into pydata:main Feb 20, 2024
12 checks passed

jcapriot deleted the gcsx_ndarray_dot branch February 20, 2024 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of GCXS dot ndarray #643

Improve performance of GCXS dot ndarray #643

jcapriot commented Feb 17, 2024

hameerabbasi left a comment

jcapriot commented Feb 17, 2024

codecov bot commented Feb 17, 2024 •

edited

Loading

Improve performance of GCXS dot ndarray #643

Improve performance of GCXS dot ndarray #643

Conversation

jcapriot commented Feb 17, 2024

Scipy Timing:

Sparse timing:

hameerabbasi left a comment

Choose a reason for hiding this comment

jcapriot commented Feb 17, 2024

codecov bot commented Feb 17, 2024 • edited Loading

Codecov Report

codecov bot commented Feb 17, 2024 •

edited

Loading