Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of GCXS dot ndarray #643

Merged
merged 7 commits into from
Feb 20, 2024

Conversation

jcapriot
Copy link
Contributor

Improves the memory access pattern for GCXS arrays dot ndarrays.

Was running a few benchmarks of the GCXS class against scipy's csr/csc arrays and noticed that they were significantly slower. So was curious what the biggest difference was as their structures shouldn't be too fundamentally different. When looking at the csr_dot_ndarray code and the csc_dot_ndarray code I realized that the access pattern of the arrays was inefficient.

Here are some of the relevant timings of things:

import scipy.sparse as sp
import sparse
import numpy as np
state = np.random.default_rng(seed=1337)
A_gc0s = sparse.random((9000, 10000), format='gcxs', random_state=state).change_compressed_axes((0,))
A_gc1s = A_gc0s.change_compressed_axes((1,))
A_csr = A_gc0s.to_scipy_sparse()
A_csc = A_gc1s.to_scipy_sparse()

n_vecs = 20
n_broadcasted = 5

v = state.random((A_gc0s.shape[1], n_vecs))
u = state.random((n_broadcasted, A_gc0s.shape[1], n_vecs))
x = state.random((n_vecs, A_gc0s.shape[0]))

Scipy Timing:

%timeit A_csr @ v
18.1 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit A_csc @ v
19 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit x @ A_csr
21.9 ms ± 3.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit x @ A_csc
21 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Sparse timing:

# trigger compilers
A_gc0s @ v
A_gc1s @ v
x @ A_gc0s
x @ A_gc1s
A_gc0s @ u
A_gc1s @ u;

Timing on main:

%timeit A_gc0s @ v
58.4 ms ± 4.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit A_gc1s @ v
179 ms ± 7.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit A_gc0s @ u
336 ms ± 44.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit A_gc1s @ u
1.06 s ± 80 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit x @ A_gc0s
126 ms ± 7.89 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit x @ A_gc1s
55.7 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timing on PR:

%timeit A_gc0s @ v
19.8 ms ± 2.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit A_gc1s @ v
23.4 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit A_gc0s @ u
65 ms ± 5.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit A_gc1s @ u
77 ms ± 5.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit x @ A_gc0s
32.8 ms ± 3.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit x @ A_gc1s
57.6 ms ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The left ndarray dot GCXS arrays are still a little slower than I'd like, but couldn't quite see where to dive into that at, as it looks like it's doing what it should be for the matvec operation (basically just doing (GCXS.T @ x.T).T)

One other thing, for 1D ndarrays numba doesn't seem to be able to optimize these as well as it could (considering benchmarks with scipy csr/csc times a 1D array). There should likely be a branching for 1D ndarrays to use a version that doesn't include the loop over the second dimension.

hameerabbasi
hameerabbasi previously approved these changes Feb 17, 2024
Copy link
Collaborator

@hameerabbasi hameerabbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes! Would you be willing to write benchmarks (examples in the benchmarks directory, see https://asv.readthedocs.io/en/stable/writing_benchmarks.html)

@jcapriot
Copy link
Contributor Author

Thank you for the changes! Would you be willing to write benchmarks (examples in the benchmarks directory, see https://asv.readthedocs.io/en/stable/writing_benchmarks.html)

Sure!

Copy link

codecov bot commented Feb 17, 2024

Codecov Report

Merging #643 (c06b832) into main (a1d2081) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #643      +/-   ##
==========================================
- Coverage   90.22%   90.21%   -0.02%     
==========================================
  Files          20       20              
  Lines        3674     3670       -4     
==========================================
- Hits         3315     3311       -4     
  Misses        359      359              

@hameerabbasi hameerabbasi enabled auto-merge (squash) February 20, 2024 07:53
@hameerabbasi hameerabbasi merged commit cb6b604 into pydata:main Feb 20, 2024
12 checks passed
@jcapriot jcapriot deleted the gcsx_ndarray_dot branch February 20, 2024 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants