-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split compilation of dense kernels #1375
Conversation
Kudos, SonarCloud Quality Gate passed! |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## develop #1375 +/- ##
===========================================
+ Coverage 91.18% 91.19% +0.01%
===========================================
Files 600 608 +8
Lines 50695 50693 -2
===========================================
+ Hits 46228 46232 +4
+ Misses 4467 4461 -6
☔ View full report in Codecov by Sentry. |
Can we somehow make this automatic and not manually split this into files ? I think this unnecessarily complicates our file structure. How about something like a .tpp file, which marked with macros and using CMake to automatically generate the separate compilation units at build time to enable parallelism ? |
@pratikvn I don't believe that's possible in the way we do it, e.g. for Jacobi. We would most likely need to do this on the individual sub-configuration instantiation like we do with jacobi_kernels.cu, which is not really possible with the current file structure of the reduction kernels. At least it would mean a separation of the kernel and its configurations into two different files. |
Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>
@pratikvn or maybe I am misunderstanding your suggestion, can you give an example? |
I think some script to split all functions with the GKO_INSTANTIATION function into different files.
each instantiation creates a file and compiles those files. |
Also, how much of an effect does this splitting have ? If possible, I would like to avoid this fragmentation. I think it makes finding functions quite hard and also not really scalable as each reduction operation now has its own file. |
Ah thanks, I like the idea. We only have a single file with enough compile time where this is necessary though, maybe we can pick this idea up again if we have more such files at some point? |
@pratikvn Compiling the full file takes 5m53.388s, compiling the individual files in parallel takes 1m1.452s. In my tests, the total build was stuck at least 2-3 minutes at this single file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I would like to avoid this type of fragmentation. But given that the benefit seems to be large, as a temporary measure, it looks acceptable to be. But I think we should try to do this automatically if possible, so that it is scalable, and the build system can handle the splitting into different instantiation units. See suggestion from @yhmtsai above.
@pratikvn this is not meant as a temporary measure. We do the same thing already for Jacobi and ParILUT kernels. If we want to split things into definition and instantiation, we don't really need to use automation (and I don't see any advantage of that), but it still leads to the same number of files, with the small disadvantage that IDEs will in my experience have a harder time displaying information about the function templates without instantiations available. |
small note: We may need to do something similar for csr_kernels (at least in HIP), because I currently have a debug build taking 56 minutes |
from the splitting definition and instantiation, it generates more files than the origin. |
After spending way too much time on this, I think I can conclude that making this work semi-automatically is not worth the effort. The only alternative is what Mike suggested, e.g. putting all template instantiations into separate files and keeping the code in one place. What do you think about that? |
superseded by #1378 |
The compilation time, especially for DPC++, is dominated by
dense_kernels.cpp
, more precisely the reduction kernels with their large number of combinations. By splitting them up, we can improve the parallel efficiency of the builds significantly.I think these changes are also part of #972