-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarks auto repetitions #791
Conversation
do we need to estimate the timing or just use warmup iteration + do repetition until reach the 0.5 sec? |
What's the reasoning for taking total time as the criterion? With statistics in mind, perhaps it is the number of repetitions that's more important. I like total time for other considerations, such as making sure enough samples are collected by a profiler, but that does not seem to be applicable here. |
@yhmtsai The repetitions are estimated before the real benchmark run. This way, the timings within the rep loop might be moved outside to reduce the overhead in that loop. |
@Slaedr In my experience, the runtime for small problems (within L1/L2 cache) can vary quite significantly if only a low number of repetitions is used. With this PR I want to get a more stable average runtime for such problems. I should add, that now the number of repetitions is also exported in the json output. |
How about we start off with warmup + a fixed number of iterations, compute the standard deviation of the runtime in the fixed set, and if that exceeds a certain value relative to the average, we increase the number of iterations by some scheme? I'll check what quickbench, GBench and nonius are doing there. |
for blas, use the prepare for the estimate. Some prepare functions are empty. Is it acceptable? |
Concerning the pre and post operations, I was also not sure if they are used correctly/necessary. I guess I will look a bit more into that. |
@MarcelKoch Ah, perhaps I see your logic. The total time is a good rule-of-thumb for estimating a good number of repetitions. If something takes very little time, we are likely to need many repeats, which your rule will do. If something takes a lot of time, perhaps it needs only a few repeats, which again is factored in your scheme. There might be some corner cases though, where some poorly-implemented algorithm takes a lot of time, but its runtime is still quite variable. I think what @upsj proposed might be more statistically sound; I remember doing that manually for some my studies in the past. |
@Slaedr Yes exactly, the runtime is just used as a rule of thumb. I picked up 0.5s as a reasonable minimal runtime somewhere, although I'm not sure where exactly. That runtime can be discussed, if it's too high or too low. @upsj Running these kinds of statistical tests seems a bit overkill to me. At least in my experience, the variation was quite low for these larger runtimes (>=0.5s), assuming that the machine does not use frequency scaling. If that is enabled, benchmarks can be quite unreliable, so I just ignored that case. |
So nonius provides really powerful analysis, but also mainly analyzes tiny pieces of code, so the methods they use there (bootstrapping) don't make much sense on such small sample sizes. We have a certain variety of runtimes (going from fast to slow: overhead benchmarks (nanoseconds), BLAS, SpMV, preconditioners, solvers (seconds)), so I think it might make sense to have something robust that works on all of them. Requiring SpMVs run for 0.5s looks like a lot of overhead to me, especially since GPUs often have much less variability than CPUs. |
So remembering my statistics lectures back in the day, I think if we want to reduce the standard deviation by a factor of 2, we need to run 4x as many benchmarks, so I guess my suggestion would be
|
I definitely agree with most of what is said here. I remember at the very beginning when we were considering using google benchmarks, the obscure algorithm they used and exactly this scheme of IMHO, either you use a same amount of runs for matrices of comparable time scale (like what we do now), or you use an algorithm like what Tobias tried to outline which you apply equally all the time to reach the same timing accuracy all the time. At the risk of some benchmarks taking a ridiculous amount of time. |
On the cache effects issue for small problems, another probably more accurate approach is to have proper cache warmup and cache flushing strategies (depending on the context) to stabilize the timings, see this excellent paper on the topic. https://homes.sice.indiana.edu/rcwhaley/papers/timing_SPE08.pdf Of course, there are still important potential performance issues which can come into play (particularly for large data sets), like process placement, turboboost and other speed scaling effects, ... |
I guess I don't fully understand the purpose of these benchmarks. I interpreted them as a quick check to see if some changes significantly impact the performance in either way. (With this I mean that the purpose of the benchmarks is not to detect performance changes in the range 5%-1%) For such a broad comparison, I think this approach is reasonable, both the runtime based and statistics based approach. |
Concerning @upsj approach: I'm a bit unsure about the specifics of it. First, what would be a good threshold for the relative stddev? I would guess 1, but I've nothing to support that guess. This is also connected to the underlying distribution, where I'm also not sure which one to assume. The normal distribution would be the easiest choice, but again I've nothing to support that assumption. |
I think we should be able to catch all our current cases (quick benchmarks, slow benchmarks) by providing a maximum repetition count used only for the repetition estimate. If the benchmark is slow, then we will choose the number of iterations based on the runtime. If it is fast, we will choose it so we have a sufficient number of runs (let's say 100), but still stay way below 0.5s |
label! |
d468a76
to
68843e5
Compare
I've incorporated a couple of suggestions from this thread. Now there is more control over the adaptive benchmarking with the additional flags:
Also, larger (or rather slower) benchmarks are now handled correctly, i.e. only the minimal requested number of repetitions is executed. Therefore, I've also enabled the adaptive behavior for the solver benchmark. For the preconditioner, this approach is not possible, since the total runtime is not updated within the repetitions loop. Therefore, I've added a warning if On the implementation side, the usage is quite similar to google benchmark, i.e. the following code is valid: IterationControl ic(timer);
for(auto status: ic.run()){
timer->tic()
// run benchmark
timer->toc()
} Additionally, To clarify, the adaptive benchmarking is only optional and not enabled by default. |
Codecov Report
@@ Coverage Diff @@
## develop #791 +/- ##
===========================================
- Coverage 94.37% 94.36% -0.01%
===========================================
Files 400 400
Lines 32096 32097 +1
===========================================
Hits 30289 30289
- Misses 1807 1808 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great job! I really like the Google Benchmark-like setup.
69f02b4
to
6be0a55
Compare
@@ -260,6 +264,7 @@ class CudaTimer : public Timer { | |||
protected: | |||
void tic_impl() override | |||
{ | |||
exec_->synchronize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this and HipTimer should not have the synchronize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the tic
it doesn't matter right? This is not timed because it's before the eventRecord
and it could even help in ensuring nothing is running on the GPU when we start running things, like if there was a copy previously (for example, x_clone
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the event should also be after the memcpy if it involves gpu.
For me, ensuring nothing is running on the GPU is the stuff before calling timer, but it does not hurt the timing step.
auto x_clone = clone(x); | ||
auto x_clone = clone(x); | ||
for (auto status : ic.run(false)) { | ||
x_clone = clone(x); | ||
|
||
exec->synchronize(); | ||
generate_timer->tic(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we also add the tic/toc function to IterationControl to forward the Timer tic/toc.
such that we call tic/toc on the IC and get output from the same place as others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First thing: you can already use ic.compute_average_timings
to get the output, since ic
has internally a copy of the shared ptr for the timer (but only for the apply timer). However, that seems a bit awkward.
From my viewpoint, IC should not handle any timings in this instance, as manage_timings
is set to false
. Adding timings functions to IC would weaken the distinction to the managed case. If the non-managed IC run is requested, the user should take care of the timings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense from this point.
My point is iteration control needs the results from the corresponding timer, so it can be wrong usage.
but I also realize that may make the tic/toc from TimerManager different?
maybe we need to add some comment about that need to use the corresponding timer such that the iteration control can really check it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I could add a get_timer
to IC. That should help to use the correct timer. Otherwise, I can't think of a graceful way of adding tic/toc directly to IC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is good.
auto x_clone = clone(x); | ||
for (auto _ : ic_tuning.run()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this changes the behavior. we refreshed the memory every time before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does that matter? SpMVs should only write to x
after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no for result.
It depends on what we want the memory status for the benchmark.
Should memory always be a new location (only from software point), or the allocation just be there before the operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least from the caching standpoint that would probably not make a difference, as the clone
calls memcpy
, which might, depending on the implementation, already move the data into cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that this change should be fine, especially if you consider this as a best-case benchmark, i.e. the data is already in the appropriate caches. Considering the worst-case, i.e. at the beginning of each SpMV the data is not cached, is more difficult in general and would probably require more adjustments, especially wrt Tobias' comment.
auto x_clone = clone(x); | ||
for (auto _ : ic.run()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also here
benchmark/utils/general.hpp
Outdated
"runtime is larger than 'min_runtime'"); | ||
|
||
DEFINE_double(min_runtime, 0.05, | ||
"If 'repetitions = auto' is used, the minimal runtime of" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"If 'repetitions = auto' is used, the minimal runtime of" | |
"If 'repetitions = auto' is used, the minimal runtime (seconds) of" |
benchmark/utils/general.hpp
Outdated
* ``` | ||
* auto timer = get_timer(...); | ||
* IterationControl ic(timer); | ||
* for(auto status: ic.[warmup_run|run](manage_timings [default is true])){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the manage_timing also for warmuprun?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not added the parameter there, as the warmup-run uses always a fixed number of repetitions. I will clarify the documentation.
benchmark/utils/general.hpp
Outdated
* Uses the commandline flags to setup the stopping criteria for the | ||
* warmup and timed run. | ||
* | ||
* @param timer the same timer that is to be used for the timings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @param timer the same timer that is to be used for the timings | |
* @param timer the same timer that is to be used for the timings |
benchmark/utils/general.hpp
Outdated
run_control warmup_run() | ||
{ | ||
status_warmup_.cur_it = 0; | ||
status_warmup_.timer->clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
status_warmup_.timer->clear(); | |
status_warmup_.timer->clear(); | |
status_warmup_.timer->manage_timings = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize there's no change from another function, so it is fine.
benchmark/utils/general.hpp
Outdated
// emulate shared_ptr behavior | ||
const TimerManager *operator->() const { return this; } | ||
TimerManager *operator->() { return this; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is it used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was used for being lazy 😄 I will remove that, and adjust the rest accordingly.
benchmark/utils/general.hpp
Outdated
|
||
void tic() | ||
{ | ||
if (manage_timings) timer->tic(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (manage_timings) timer->tic(); | |
if (manage_timings) { | |
timer->tic(); | |
} |
also apply to next one from the gko ref
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that part of the .clang-format specification? If not, perhaps it should be added there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not in current .clang-format. we only mention it in Contribution guideline.
Does clang-format support this after 6?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a bit of digging, it seems like clang-format (up to 13) still does not support this. But there is an PR for this here: https://reviews.llvm.org/D95168
So it seems like that some future version will support this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also a workaround using clang-tidy, but that is over kill (https://stackoverflow.com/a/28437960)
e398d62
to
74476ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
benchmark/utils/general.hpp
Outdated
cur_info->managed_timer.toc(); | ||
stopped = true; | ||
next_timing = | ||
static_cast<IndexType>(std::ceil(next_timing * 1.5)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make the 1.5 controllable for extreme cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. only two comments for missing documentation
* - 'warmup' warmup iterations, applies in fixed and adaptive case | ||
* - 'min_repetitions' minimal number of repetitions (adaptive case) | ||
* - 'max_repetitions' maximal number of repetitions (adaptive case) | ||
* - 'min_runtime' minimal total runtime (adaptive case) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
miss repetition_growth_factor
benchmark/utils/general.hpp
Outdated
* - `warmup_run()`: controls run defined by `warmup` flag | ||
* - `run(bool)`: controls run defined by all other flags | ||
* - `get_timer()`: access to underlying timer | ||
* Both methods return an object that is to be used in a range-based for loop: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not update
benchmark/utils/general.hpp
Outdated
* - 'min_repetitions' minimal number of repetitions (adaptive case) | ||
* - 'max_repetitions' maximal number of repetitions (adaptive case) | ||
* - 'min_runtime' minimal total runtime (adaptive case) | ||
* - 'repetitions_growth_factor' controls the increase between two successive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
* - 'repetitions_growth_factor' controls the increase between two successive | |
* - 'repetition_growth_factor' controls the increase between two successive |
or change the Gflags name
need rebase before merge. and you should be able to see two pipelines running on gitlab after next push |
This reworks the previous adaptiv benchmarking. Now, the number of iteration is determined on-the-fly instead of before-hand. Also new cmd-line flags have been added to allow for greater control over the adaptive benchmarks. The usage is similar to google benchmark. Currently, it is not possible to use this approach for the preconditioner benchmark, as these do not update the runtime in each iteration.
this changes the preconditioner benchmark to time each preconditioner apply/generate individually, unifying the timing approach across all benchmarks.
- adds more documentation - minor formatting - made `status` private s.t. it is not part of the public interface of `IterationControl` Co-authored-by: Yuhsiang Tsai <yhmtsai@gmail.com> Co-authored-by: Tobias Ribizel <ribizel@kit.edu>
Now the `run_control` object also controls taking the timings. The user does not need to issue the timings by hand anymore. This allows to use increasingly larger intervals between two timings, unitl the benchmark run is finished. Drawback: everything within the `ic.run()` loop gets timed, parts that should be exempt need to be moved outside of the loop.
Internally this uses a thin wrapper class for the `timer` object, which just skips the `tic/toc` calls, if the `run_control` object does not manage the timings. In that case, the timings have to be issued outside as before.
- clarify documentation - add accessor to underlying timer - formatting - adds flag to choose repetitions growth factor Co-authored-by: Yuhsiang Tsai <yhmtsai@gmail.com> Co-authored-by: Tobias Ribizel <ribizel@kit.edu> Co-authored-by: Terry Cojean <terry.cojean@kit.edu>
6b726f8
to
33ff686
Compare
Kudos, SonarCloud Quality Gate passed! |
Ginkgo release 1.4.0 The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem which enables Intel-GPU and CPU execution. The only Ginkgo features which have not been ported yet are some preconditioners. Ginkgo's mixed-precision support is greatly enhanced thanks to: 1. The new Accessor concept, which allows writing kernels featuring on-the-fly memory compression, among other features. The accessor can be used as header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example. 2. All LinOps now transparently support mixed-precision execution. By default, this is done through a temporary copy which may have a performance impact but already allows mixed-precision research. Native mixed-precision ELL kernels are implemented which do not see this cost. The accessor is also leveraged in a new CB-GMRES solver which allows for performance improvements by compressing the Krylov basis vectors. Many other features have been added to Ginkgo, such as reordering support, a new IDR solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU for now), machine topology information, and more! Supported systems and requirements: + For all platforms, cmake 3.13+ + C++14 compliant compiler + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+ + clang: 3.9+ + Intel compiler: 2018+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + HIP module: ROCm 3.5+ + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`. + Windows + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+ + Microsoft Visual Studio: VS 2019 + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. Algorithm and important feature additions: + Add a new DPC++ Executor for SYCL execution and other base utilities [#648](#648), [#661](#661), [#757](#757), [#832](#832) + Port matrix formats, solvers and related kernels to DPC++. For some kernels, also make use of a shared kernel implementation for all executors (except Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856) + Add accessors which allow multi-precision kernels, among other things. [#643](#643), [#708](#708) + Add support for mixed precision operations through apply in all LinOps. [#677](#677) + Add incomplete Cholesky factorizations and preconditioners as well as some improvements to ILU. [#672](#672), [#837](#837), [#846](#846) + Add an AMGX implementation and kernels on all devices but DPC++. [#528](#528), [#695](#695), [#860](#860) + Add a new mixed-precision capability solver, Compressed Basis GMRES (CB-GMRES). [#693](#693), [#763](#763) + Add the IDR(s) solver. [#620](#620) + Add a new fixed-size block CSR matrix format (for the Reference executor). [#671](#671), [#730](#730) + Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780) + Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649) + Add matrix assembly support on CPUs. [#644](#644) + Extends ISAI from triangular to general and spd matrices. [#690](#690) Other additions: + Add the possibility to apply real matrices to complex vectors. [#655](#655), [#658](#658) + Add functions to compute the absolute of a matrix format. [#636](#636) + Add symmetric permutation and improve existing permutations. [#684](#684), [#657](#657), [#663](#663) + Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697) + Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850) + Row-major accessor is generalized to more than 2 dimensions and a new "block column-major" accessor has been added. [#707](#707) + Add an heat equation example. [#698](#698), [#706](#706) + Add ccache support in CMake and CI. [#725](#725), [#739](#739) + Allow tuning and benchmarking variables non intrusively. [#692](#692) + Add triangular solver benchmark [#664](#664) + Add benchmarks for BLAS operations [#772](#772), [#829](#829) + Add support for different precisions and consistent index types in benchmarks. [#675](#675), [#828](#828) + Add a Github bot system to facilitate development and PR management. [#667](#667), [#674](#674), [#689](#689), [#853](#853) + Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781) + Add ssh debugging for Github Actions CI. [#749](#749) + Add pipeline segmentation for better CI speed. [#737](#737) Changes: + Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854) + Add implicit residual log for solvers and benchmarks. [#714](#714) + Change handling of the conjugate in the dense dot product. [#755](#755) + Improved Dense stride handling. [#774](#774) + Multiple improvements to the OpenMP kernels performance, including COO, an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740) + Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718) + Improved Identity constructor and treatment of rectangular matrices. [#646](#646) + Allow CUDA/HIP executors to select allocation mode. [#758](#758) + Check if executors share the same memory. [#670](#670) + Improve test install and smoke testing support. [#721](#721) + Update the JOSS paper citation and add publications in the documentation. [#629](#629), [#724](#724) + Improve the version output. [#806](#806) + Add some utilities for dim and span. [#821](#821) + Improved solver and preconditioner benchmarks. [#660](#660) + Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812) Fixes: + Sorting fix for the Jacobi preconditioner. [#659](#659) + Also log the first residual norm in CGS [#735](#735) + Fix BiCG and HIP CSR to work with complex matrices. [#651](#651) + Fix Coo SpMV on strided vectors. [#807](#807) + Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769) + Fix device_reset issue by moving counter/mutex to device. [#810](#810) + Fix `EnableLogging` superclass. [#841](#841) + Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726) + Decreased test size for a few device tests. [#742](#742) + Fix multiple issues with our CMake HIP and RPATH setup. [#712](#712), [#745](#745), [#709](#709) + Cleanup our CMake installation step. [#713](#713) + Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785) + Simplify third-party integration. [#786](#786) + Improve Ginkgo device arch flags management. [#696](#696) + Other fixes and improvements to the CMake setup. [#685](#685), [#792](#792), [#705](#705), [#836](#836) + Clarification of dense norm documentation [#784](#784) + Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840) + Make multiple operators/constructors explicit. [#650](#650), [#761](#761) + Fix some issues, memory leaks and warnings found by MSVC. [#666](#666), [#731](#731) + Improved solver memory estimates and consistent iteration counts [#691](#691) + Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754) + Fix for ForwardIterator requirements in iterator_factory. [#665](#665) + Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722) + Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852) Related PR: #857
Release 1.4.0 to master The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem which enables Intel-GPU and CPU execution. The only Ginkgo features which have not been ported yet are some preconditioners. Ginkgo's mixed-precision support is greatly enhanced thanks to: 1. The new Accessor concept, which allows writing kernels featuring on-the-fly memory compression, among other features. The accessor can be used as header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example. 2. All LinOps now transparently support mixed-precision execution. By default, this is done through a temporary copy which may have a performance impact but already allows mixed-precision research. Native mixed-precision ELL kernels are implemented which do not see this cost. The accessor is also leveraged in a new CB-GMRES solver which allows for performance improvements by compressing the Krylov basis vectors. Many other features have been added to Ginkgo, such as reordering support, a new IDR solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU for now), machine topology information, and more! Supported systems and requirements: + For all platforms, cmake 3.13+ + C++14 compliant compiler + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+ + clang: 3.9+ + Intel compiler: 2018+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + HIP module: ROCm 3.5+ + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`. + Windows + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+ + Microsoft Visual Studio: VS 2019 + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. Algorithm and important feature additions: + Add a new DPC++ Executor for SYCL execution and other base utilities [#648](#648), [#661](#661), [#757](#757), [#832](#832) + Port matrix formats, solvers and related kernels to DPC++. For some kernels, also make use of a shared kernel implementation for all executors (except Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856) + Add accessors which allow multi-precision kernels, among other things. [#643](#643), [#708](#708) + Add support for mixed precision operations through apply in all LinOps. [#677](#677) + Add incomplete Cholesky factorizations and preconditioners as well as some improvements to ILU. [#672](#672), [#837](#837), [#846](#846) + Add an AMGX implementation and kernels on all devices but DPC++. [#528](#528), [#695](#695), [#860](#860) + Add a new mixed-precision capability solver, Compressed Basis GMRES (CB-GMRES). [#693](#693), [#763](#763) + Add the IDR(s) solver. [#620](#620) + Add a new fixed-size block CSR matrix format (for the Reference executor). [#671](#671), [#730](#730) + Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780) + Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649) + Add matrix assembly support on CPUs. [#644](#644) + Extends ISAI from triangular to general and spd matrices. [#690](#690) Other additions: + Add the possibility to apply real matrices to complex vectors. [#655](#655), [#658](#658) + Add functions to compute the absolute of a matrix format. [#636](#636) + Add symmetric permutation and improve existing permutations. [#684](#684), [#657](#657), [#663](#663) + Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697) + Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850) + Row-major accessor is generalized to more than 2 dimensions and a new "block column-major" accessor has been added. [#707](#707) + Add an heat equation example. [#698](#698), [#706](#706) + Add ccache support in CMake and CI. [#725](#725), [#739](#739) + Allow tuning and benchmarking variables non intrusively. [#692](#692) + Add triangular solver benchmark [#664](#664) + Add benchmarks for BLAS operations [#772](#772), [#829](#829) + Add support for different precisions and consistent index types in benchmarks. [#675](#675), [#828](#828) + Add a Github bot system to facilitate development and PR management. [#667](#667), [#674](#674), [#689](#689), [#853](#853) + Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781) + Add ssh debugging for Github Actions CI. [#749](#749) + Add pipeline segmentation for better CI speed. [#737](#737) Changes: + Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854) + Add implicit residual log for solvers and benchmarks. [#714](#714) + Change handling of the conjugate in the dense dot product. [#755](#755) + Improved Dense stride handling. [#774](#774) + Multiple improvements to the OpenMP kernels performance, including COO, an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740) + Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718) + Improved Identity constructor and treatment of rectangular matrices. [#646](#646) + Allow CUDA/HIP executors to select allocation mode. [#758](#758) + Check if executors share the same memory. [#670](#670) + Improve test install and smoke testing support. [#721](#721) + Update the JOSS paper citation and add publications in the documentation. [#629](#629), [#724](#724) + Improve the version output. [#806](#806) + Add some utilities for dim and span. [#821](#821) + Improved solver and preconditioner benchmarks. [#660](#660) + Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812) Fixes: + Sorting fix for the Jacobi preconditioner. [#659](#659) + Also log the first residual norm in CGS [#735](#735) + Fix BiCG and HIP CSR to work with complex matrices. [#651](#651) + Fix Coo SpMV on strided vectors. [#807](#807) + Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769) + Fix device_reset issue by moving counter/mutex to device. [#810](#810) + Fix `EnableLogging` superclass. [#841](#841) + Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726) + Decreased test size for a few device tests. [#742](#742) + Fix multiple issues with our CMake HIP and RPATH setup. [#712](#712), [#745](#745), [#709](#709) + Cleanup our CMake installation step. [#713](#713) + Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785) + Simplify third-party integration. [#786](#786) + Improve Ginkgo device arch flags management. [#696](#696) + Other fixes and improvements to the CMake setup. [#685](#685), [#792](#792), [#705](#705), [#836](#836) + Clarification of dense norm documentation [#784](#784) + Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840) + Make multiple operators/constructors explicit. [#650](#650), [#761](#761) + Fix some issues, memory leaks and warnings found by MSVC. [#666](#666), [#731](#731) + Improved solver memory estimates and consistent iteration counts [#691](#691) + Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754) + Fix for ForwardIterator requirements in iterator_factory. [#665](#665) + Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722) + Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852) Related PR: #866
This PR enables the automatic deduction of repetition number for the benchmarks.
For small working sets, the benchmark timings may be too sensitive regarding outliers. With this PR the number of repetitions for the benchmark run is estimated such that the whole benchmark takes >=0.5s. This should result in more stable benchmarks for small problems.
If the repetitions are set to
auto
the warm-up step is skipped.WIP: The PR enables the new behavior only for the blas, conversion, spmv benchmarks.
Todo: