Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFactor: bulkerify GEMV (both MC and GPU) #1214

Draft
wants to merge 44 commits into
base: master
Choose a base branch
from

Conversation

albestro
Copy link
Collaborator

@albestro albestro commented Nov 15, 2024

In this PR we try to increase parallelisation of GEMV step of TFactor, which is by construction serial (all results goes into the same block tile), by using workspaces for storing partial results and then reducing them before the final TRMV step.

Algorithmically, the main change is that stepGEMV loop has been replaced with a single stepGEMVAll, and the concept has been applied in a similar way for both backends, MC and GPU:

  • MC: pika::bulk splits input tiles over multiple task, each one stores the partial results in their workspace and just one at the end does the reduction.
  • GPU: similarly to the CPU, the work is forked over different pika tasks, each one computing a partial result on a different GPU stream. These tasks then join into a single task which performs the reduction.

In order to implement this solution, workspaces for intermediate results have been added (there is another option which does not require any additional workspace nor reduction, and that we might explore for MC in another PR).

TODO:

Close #798.

EDIT: since this is conceptually similar to #798 and it is going to be closed as soon as this gets merged, I migrated here the doc fixes happened there.

@albestro albestro added this to the Optimizations milestone Nov 15, 2024
@albestro albestro self-assigned this Nov 15, 2024
@albestro albestro force-pushed the alby/tfactor-optim/bulk branch from 6558780 to bf17fcc Compare November 18, 2024 13:16
Base automatically changed from alby/tfactor-optim/no-gemv-divergence to master November 22, 2024 11:28
@albestro albestro force-pushed the alby/tfactor-optim/bulk branch 4 times, most recently from 37ea75b to 2845339 Compare December 2, 2024 16:48
@albestro
Copy link
Collaborator Author

albestro commented Dec 2, 2024

cscs-ci run

just to check if there is any major problems

@albestro albestro marked this pull request as ready for review December 3, 2024 16:08
Copy link
Collaborator

@rasolca rasolca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Please name new functions in snake case, and rename existing internal functions if possible.

Comment on lines 268 to 269
matrix::Matrix<T, Device::GPU> ws_T({nworkspaces * nrefls_step, nrefls_step},
{nrefls_step, nrefls_step});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All Ts are allocated at scheduling. Better reuse it.

@albestro albestro requested a review from rasolca December 5, 2024 11:22
Copy link
Collaborator

@msimberg msimberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a few questions, nothing blocking.

include/dlaf/eigensolver/bt_reduction_to_band/impl.h Outdated Show resolved Hide resolved
include/dlaf/factorization/qr/api.h Outdated Show resolved Hide resolved
include/dlaf/factorization/qr/t_factor_impl.h Outdated Show resolved Hide resolved
@albestro albestro force-pushed the alby/tfactor-optim/bulk branch from 05dc48d to f83f317 Compare December 5, 2024 15:00

namespace dlaf::factorization::internal {

inline size_t getTFactorNWorkers() noexcept {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this to be snake case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see 36f179d

include/dlaf/tune.h Outdated Show resolved Hide resolved
@albestro albestro force-pushed the alby/tfactor-optim/bulk branch from 0397c9e to d370894 Compare December 9, 2024 09:47
@albestro albestro requested a review from rasolca December 9, 2024 09:47
@albestro
Copy link
Collaborator Author

cscs-ci run

@albestro albestro marked this pull request as draft December 11, 2024 16:15
@albestro
Copy link
Collaborator Author

Converted to draft in order to prevent merging, since I'm still doing some checks.

@albestro albestro force-pushed the alby/tfactor-optim/bulk branch from f0638e1 to 5997e52 Compare December 17, 2024 09:10
in case of odd division of tiles, it might end up being selected but not used.
@albestro
Copy link
Collaborator Author

cscs-ci run

@codecov-commenter
Copy link

codecov-commenter commented Dec 17, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 91.76471% with 14 lines in your changes missing coverage. Please review.

Project coverage is 94.29%. Comparing base (b82a100) to head (d199b88).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
include/dlaf/factorization/qr/t_factor_impl.h 88.59% 10 Missing and 3 partials ⚠️
src/tune.cpp 0.00% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1214      +/-   ##
==========================================
- Coverage   94.38%   94.29%   -0.10%     
==========================================
  Files         154      155       +1     
  Lines        9374     9469      +95     
  Branches     1160     1169       +9     
==========================================
+ Hits         8848     8929      +81     
- Misses        320      331      +11     
- Partials      206      209       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@albestro albestro force-pushed the alby/tfactor-optim/bulk branch from 2f90f23 to d199b88 Compare December 20, 2024 15:14
@albestro
Copy link
Collaborator Author

albestro commented Jan 7, 2025

cscs-ci run

@albestro
Copy link
Collaborator Author

albestro commented Jan 7, 2025

Apart the fixes committed between the two last CI runs, which fix for sure some problems, one major difference between the CI runs is that the oldest one ran on Eiger while the latest ran on Daint (because of the rebase on master).

The oldest one on Eiger had some failures that are concerning, specifically all the CPU UNIT_RANK_6, which failed due to a timeout on Eigensolver/GenEigensolver MC Distributed test (example)

I will try to investigate why that's the case and why it haven't happened on Daint.

@albestro albestro mentioned this pull request Jan 8, 2025
2 tasks

inline size_t get_tfactor_nworkers() noexcept {
// Note: precautionarily we leave at least 1 thread "free" to do other stuff (if possible)
const std::size_t available_workers = pika::resource::get_thread_pool("default").get_os_thread_count();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not make sense, since the worker is backend dependent (i.e. on MC a worker is a thread, on a GPU is a stream).

@albestro albestro force-pushed the alby/tfactor-optim/bulk branch from bc153ce to 152f8e7 Compare January 21, 2025 16:40
@albestro albestro force-pushed the alby/tfactor-optim/bulk branch from 152f8e7 to 0d2bd4f Compare January 21, 2025 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

4 participants