Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup CAReduce C-implementation with loop reordering #971

Merged
merged 3 commits into from
Aug 21, 2024

Conversation

ricardoV94
Copy link
Member

@ricardoV94 ricardoV94 commented Aug 12, 2024

Description

This PR adds runtime loop re-ordering for CAReduce operations in the C-backend, to alleviate the bad performance reported in #935

Using the new benchmark test, I confirm substantial speedup on summation along leading axis (30-110x) depending on C-contiguity or not. A full reduction on a non C-contiguous array is also 7.5x faster than before

On the down side, there are some slowdowns when summing the last axis, but those are smaller in magnitude (2-3.3x). This is probably harder for the compiler to optimize due to the many to one location when iterating over the array (not trivially parallelizable).

The old speedup had probably to do with defining a temporary allocation variable and only writing it to the output after consuming all the reduced axis at each step. I didn't want to add a special case for this (only works when the reduced axes are the ones with smallest strides).

If people are concerned I can investigate adding a special case. However I'm already happy with reducing the worst case scenarios, which were really lame.

Name (time in ms)                                                 Before                 After        Speedup    Slowdown
-------------------------------------------------------------------------------------------------------------------------
test_careduce_benchmark[c_contiguous=True-axis=0]           1,205.7843 (82.38)      11.3370 (1.41)    109x
test_careduce_benchmark[c_contiguous=True-axis=1]              29.2075 (2.00)        8.5521 (1.06)    3.4x
test_careduce_benchmark[c_contiguous=True-axis=2]              14.6362 (1.0)        48.1791 (5.98)                3.3x
test_careduce_benchmark[c_contiguous=True-axis=(0, 1)]        104.3763 (7.13)        9.4987 (1.18)    11x
test_careduce_benchmark[c_contiguous=True-axis=(0, 2)]         21.9930 (1.50)       48.9057 (6.07)                2.2x
test_careduce_benchmark[c_contiguous=True-axis=(1, 2)]         15.1861 (1.04)       48.8651 (6.07)                3.2x
test_careduce_benchmark[c_contiguous=True-axis=None]           15.1642 (1.04)       14.2544 (1.77)
                                                                                                      
test_careduce_benchmark[c_contiguous=False-axis=0]             14.7053 (1.00)       46.6626 (5.79)                3.2x
test_careduce_benchmark[c_contiguous=False-axis=1]          1,060.9745 (72.49)     109.3699 (13.58)   9.7x
test_careduce_benchmark[c_contiguous=False-axis=2]            107.0985 (7.32)       93.6451 (11.62)   
test_careduce_benchmark[c_contiguous=False-axis=(0, 1)]     1,332.4760 (91.04)      46.5326 (5.78)    28.6x
test_careduce_benchmark[c_contiguous=False-axis=(0, 2)]        28.8102 (1.97)       49.1205 (6.10)                1.7x
test_careduce_benchmark[c_contiguous=False-axis=(1, 2)]       103.2976 (7.06)        8.0562 (1.0)     12.9x
test_careduce_benchmark[c_contiguous=False-axis=None]         107.4091 (7.34)       14.3483 (1.78)    7.5x
-------------------------------------------------------------------------------------------------------------------------

Related Issue

Checklist

Type of change

  • New feature / enhancement
  • Bug fix
  • Documentation
  • Maintenance
  • Other (please specify):

@ricardoV94 ricardoV94 force-pushed the improve_careduce_speed branch 2 times, most recently from bf9f6e6 to 054a5e4 Compare August 13, 2024 08:35
@ricardoV94 ricardoV94 force-pushed the improve_careduce_speed branch 2 times, most recently from ab1fbdd to 756170b Compare August 13, 2024 11:47
@ricardoV94 ricardoV94 changed the title Improve CAReduce C-implementation with loop reordering Speedup CAReduce C-implementation with loop reordering Aug 13, 2024
@ricardoV94 ricardoV94 marked this pull request as ready for review August 13, 2024 16:34
Copy link

codecov bot commented Aug 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.70%. Comparing base (29183c7) to head (061ed3a).
Report is 9 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #971      +/-   ##
==========================================
+ Coverage   81.69%   81.70%   +0.01%     
==========================================
  Files         182      182              
  Lines       47585    47590       +5     
  Branches    11584    11587       +3     
==========================================
+ Hits        38875    38884       +9     
  Misses       6518     6518              
+ Partials     2192     2188       -4     
Files Coverage Δ
pytensor/tensor/elemwise.py 88.49% <100.00%> (+0.11%) ⬆️
pytensor/tensor/elemwise_cgen.py 96.80% <100.00%> (+1.40%) ⬆️
pytensor/tensor/math.py 91.28% <100.00%> (+<0.01%) ⬆️

... and 2 files with indirect coverage changes

Comment on lines +594 to +598
for(int i = 0; i < n; i++)
{
npy_float64 &acc_i = acc_iter[i];
acc_i = 0;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works because we allow allocate a new contiguous array for the output

@ricardoV94 ricardoV94 merged commit e752fc3 into pymc-devs:main Aug 21, 2024
60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants