Speedup CAReduce C-implementation with loop reordering #971

ricardoV94 · 2024-08-12T23:04:09Z

Description

This PR adds runtime loop re-ordering for CAReduce operations in the C-backend, to alleviate the bad performance reported in #935

Using the new benchmark test, I confirm substantial speedup on summation along leading axis (30-110x) depending on C-contiguity or not. A full reduction on a non C-contiguous array is also 7.5x faster than before

On the down side, there are some slowdowns when summing the last axis, but those are smaller in magnitude (2-3.3x). This is probably harder for the compiler to optimize due to the many to one location when iterating over the array (not trivially parallelizable).

The old speedup had probably to do with defining a temporary allocation variable and only writing it to the output after consuming all the reduced axis at each step. I didn't want to add a special case for this (only works when the reduced axes are the ones with smallest strides).

If people are concerned I can investigate adding a special case. However I'm already happy with reducing the worst case scenarios, which were really lame.

Name (time in ms)                                                 Before                 After        Speedup    Slowdown
-------------------------------------------------------------------------------------------------------------------------
test_careduce_benchmark[c_contiguous=True-axis=0]           1,205.7843 (82.38)      11.3370 (1.41)    109x
test_careduce_benchmark[c_contiguous=True-axis=1]              29.2075 (2.00)        8.5521 (1.06)    3.4x
test_careduce_benchmark[c_contiguous=True-axis=2]              14.6362 (1.0)        48.1791 (5.98)                3.3x
test_careduce_benchmark[c_contiguous=True-axis=(0, 1)]        104.3763 (7.13)        9.4987 (1.18)    11x
test_careduce_benchmark[c_contiguous=True-axis=(0, 2)]         21.9930 (1.50)       48.9057 (6.07)                2.2x
test_careduce_benchmark[c_contiguous=True-axis=(1, 2)]         15.1861 (1.04)       48.8651 (6.07)                3.2x
test_careduce_benchmark[c_contiguous=True-axis=None]           15.1642 (1.04)       14.2544 (1.77)
                                                                                                      
test_careduce_benchmark[c_contiguous=False-axis=0]             14.7053 (1.00)       46.6626 (5.79)                3.2x
test_careduce_benchmark[c_contiguous=False-axis=1]          1,060.9745 (72.49)     109.3699 (13.58)   9.7x
test_careduce_benchmark[c_contiguous=False-axis=2]            107.0985 (7.32)       93.6451 (11.62)   
test_careduce_benchmark[c_contiguous=False-axis=(0, 1)]     1,332.4760 (91.04)      46.5326 (5.78)    28.6x
test_careduce_benchmark[c_contiguous=False-axis=(0, 2)]        28.8102 (1.97)       49.1205 (6.10)                1.7x
test_careduce_benchmark[c_contiguous=False-axis=(1, 2)]       103.2976 (7.06)        8.0562 (1.0)     12.9x
test_careduce_benchmark[c_contiguous=False-axis=None]         107.4091 (7.34)       14.3483 (1.78)    7.5x
-------------------------------------------------------------------------------------------------------------------------

Related Issue

Related to Reductions along leading axes can be incredibly slow in C and Numba backends #935

Checklist

Checked that the pre-commit linting/style checks pass
Included tests that prove the fix is effective or that the new feature works
Added necessary documentation (docstrings and/or example notebooks)
If you are a pro: each commit corresponds to a relevant logical change

Type of change

codecov · 2024-08-13T17:31:14Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.70%. Comparing base (29183c7) to head (061ed3a).
Report is 9 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #971      +/-   ##
==========================================
+ Coverage   81.69%   81.70%   +0.01%     
==========================================
  Files         182      182              
  Lines       47585    47590       +5     
  Branches    11584    11587       +3     
==========================================
+ Hits        38875    38884       +9     
  Misses       6518     6518              
+ Partials     2192     2188       -4

Files	Coverage Δ
pytensor/tensor/elemwise.py	`88.49% <100.00%> (+0.11%)`	⬆️
pytensor/tensor/elemwise_cgen.py	`96.80% <100.00%> (+1.40%)`	⬆️
pytensor/tensor/math.py	`91.28% <100.00%> (+<0.01%)`	⬆️

... and 2 files with indirect coverage changes

ricardoV94 · 2024-08-14T11:09:27Z

pytensor/tensor/elemwise_cgen.py

+                for(int i = 0; i < n; i++)
+                {
+                    npy_float64 &acc_i = acc_iter[i];
+                    acc_i = 0;
+                }


This works because we allow allocate a new contiguous array for the output

ricardoV94 added C-backend performance labels Aug 12, 2024

Cleanup elemwise_cgen.py

6e6a17a

ricardoV94 force-pushed the improve_careduce_speed branch 2 times, most recently from bf9f6e6 to 054a5e4 Compare August 13, 2024 08:35

Add benchmark test for CAReduce

1a944b7

ricardoV94 force-pushed the improve_careduce_speed branch 2 times, most recently from ab1fbdd to 756170b Compare August 13, 2024 11:47

ricardoV94 changed the title ~~Improve CAReduce C-implementation with loop reordering~~ Speedup CAReduce C-implementation with loop reordering Aug 13, 2024

ricardoV94 force-pushed the improve_careduce_speed branch from 79d5600 to cfc8c77 Compare August 13, 2024 16:29

ricardoV94 marked this pull request as ready for review August 13, 2024 16:34

ricardoV94 requested a review from aseyboldt August 13, 2024 16:34

CAReduce loop reordering C-impl

061ed3a

ricardoV94 force-pushed the improve_careduce_speed branch from cfc8c77 to 061ed3a Compare August 13, 2024 16:59

ricardoV94 commented Aug 14, 2024

View reviewed changes

twiecki approved these changes Aug 15, 2024

View reviewed changes

ricardoV94 merged commit e752fc3 into pymc-devs:main Aug 21, 2024
60 checks passed

ricardoV94 added the maintenance label Aug 21, 2024

ricardoV94 mentioned this pull request Aug 24, 2024

Reductions along leading axes can be incredibly slow in C and Numba backends #935

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup CAReduce C-implementation with loop reordering #971

Speedup CAReduce C-implementation with loop reordering #971

ricardoV94 commented Aug 12, 2024 •

edited

Loading

codecov bot commented Aug 13, 2024 •

edited

Loading

ricardoV94 Aug 14, 2024

Speedup CAReduce C-implementation with loop reordering #971

Speedup CAReduce C-implementation with loop reordering #971

Conversation

ricardoV94 commented Aug 12, 2024 • edited Loading

Description

Related Issue

Checklist

Type of change

codecov bot commented Aug 13, 2024 • edited Loading

Codecov Report

ricardoV94 Aug 14, 2024

Choose a reason for hiding this comment

ricardoV94 commented Aug 12, 2024 •

edited

Loading

codecov bot commented Aug 13, 2024 •

edited

Loading