-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TFactor: bulkerify GEMV (both MC and GPU) #1214
base: master
Are you sure you want to change the base?
Conversation
6558780
to
bf17fcc
Compare
37ea75b
to
2845339
Compare
cscs-ci run just to check if there is any major problems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Please name new functions in snake case, and rename existing internal functions if possible.
matrix::Matrix<T, Device::GPU> ws_T({nworkspaces * nrefls_step, nrefls_step}, | ||
{nrefls_step, nrefls_step}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All Ts are allocated at scheduling. Better reuse it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a few questions, nothing blocking.
05dc48d
to
f83f317
Compare
|
||
namespace dlaf::factorization::internal { | ||
|
||
inline size_t getTFactorNWorkers() noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this to be snake case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see 36f179d
0397c9e
to
d370894
Compare
cscs-ci run |
Converted to draft in order to prevent merging, since I'm still doing some checks. |
taus are set for gemv, but then full tile is reduced, hence also the diag with taus.
tile_t might get moved by first worker before any other work is able to use its size
f0638e1
to
5997e52
Compare
in case of odd division of tiles, it might end up being selected but not used.
cscs-ci run |
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #1214 +/- ##
==========================================
- Coverage 94.38% 94.29% -0.10%
==========================================
Files 154 155 +1
Lines 9374 9469 +95
Branches 1160 1169 +9
==========================================
+ Hits 8848 8929 +81
- Misses 320 331 +11
- Partials 206 209 +3 ☔ View full report in Codecov by Sentry. |
add assert also for gpu code that calls add manually
2f90f23
to
d199b88
Compare
cscs-ci run |
Apart the fixes committed between the two last CI runs, which fix for sure some problems, one major difference between the CI runs is that the oldest one ran on Eiger while the latest ran on Daint (because of the rebase on master). The oldest one on Eiger had some failures that are concerning, specifically all the I will try to investigate why that's the case and why it haven't happened on Daint. |
|
||
inline size_t get_tfactor_nworkers() noexcept { | ||
// Note: precautionarily we leave at least 1 thread "free" to do other stuff (if possible) | ||
const std::size_t available_workers = pika::resource::get_thread_pool("default").get_os_thread_count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not make sense, since the worker is backend dependent (i.e. on MC a worker is a thread, on a GPU is a stream).
bc153ce
to
152f8e7
Compare
152f8e7
to
0d2bd4f
Compare
In this PR we try to increase parallelisation of GEMV step of TFactor, which is by construction serial (all results goes into the same block tile), by using workspaces for storing partial results and then reducing them before the final TRMV step.
Algorithmically, the main change is that
stepGEMV
loop has been replaced with a singlestepGEMVAll
, and the concept has been applied in a similar way for both backends, MC and GPU:pika::bulk
splits input tiles over multiple task, each one stores the partial results in their workspace and just one at the end does the reduction.In order to implement this solution, workspaces for intermediate results have been added (there is another option which does not require any additional workspace nor reduction, and that we might explore for MC in another PR).
TODO:
Close #798.
EDIT: since this is conceptually similar to #798 and it is going to be closed as soon as this gets merged, I migrated here the doc fixes happened there.