Benchmarks auto repetitions #791

MarcelKoch · 2021-06-11T13:15:05Z

This PR enables the automatic deduction of repetition number for the benchmarks.

For small working sets, the benchmark timings may be too sensitive regarding outliers. With this PR the number of repetitions for the benchmark run is estimated such that the whole benchmark takes >=0.5s. This should result in more stable benchmarks for small problems.

If the repetitions are set to auto the warm-up step is skipped.

WIP: The PR enables the new behavior only for the blas, conversion, spmv benchmarks.
Todo:

enable for solver, preconditioner bencharks
handle large benchmarks (one repetition takes >0.5s) gracefully

yhmtsai · 2021-06-11T13:25:57Z

do we need to estimate the timing or just use warmup iteration + do repetition until reach the 0.5 sec?
does the warmup (estimate) needs to take 0.25 sec to warmup the device?

Slaedr · 2021-06-11T13:32:27Z

What's the reasoning for taking total time as the criterion? With statistics in mind, perhaps it is the number of repetitions that's more important. I like total time for other considerations, such as making sure enough samples are collected by a profiler, but that does not seem to be applicable here.

MarcelKoch · 2021-06-11T13:37:18Z

@yhmtsai The repetitions are estimated before the real benchmark run. This way, the timings within the rep loop might be moved outside to reduce the overhead in that loop.
The estimate runs are on the same device as the real benchmark run.

MarcelKoch · 2021-06-11T13:41:48Z

@Slaedr In my experience, the runtime for small problems (within L1/L2 cache) can vary quite significantly if only a low number of repetitions is used. With this PR I want to get a more stable average runtime for such problems. I should add, that now the number of repetitions is also exported in the json output.

upsj · 2021-06-11T13:43:39Z

How about we start off with warmup + a fixed number of iterations, compute the standard deviation of the runtime in the fixed set, and if that exceeds a certain value relative to the average, we increase the number of iterations by some scheme? I'll check what quickbench, GBench and nonius are doing there.

yhmtsai · 2021-06-11T13:48:04Z

for blas, use the prepare for the estimate. Some prepare functions are empty. Is it acceptable?
for spmv, use the apply without refresh the memory.
If we would like to refresh the memory in each timing, we can not move the timing out of loop

MarcelKoch · 2021-06-11T13:50:59Z

Concerning the pre and post operations, I was also not sure if they are used correctly/necessary. I guess I will look a bit more into that.

Slaedr · 2021-06-11T13:51:19Z

@MarcelKoch Ah, perhaps I see your logic. The total time is a good rule-of-thumb for estimating a good number of repetitions. If something takes very little time, we are likely to need many repeats, which your rule will do. If something takes a lot of time, perhaps it needs only a few repeats, which again is factored in your scheme. There might be some corner cases though, where some poorly-implemented algorithm takes a lot of time, but its runtime is still quite variable. I think what @upsj proposed might be more statistically sound; I remember doing that manually for some my studies in the past.

MarcelKoch · 2021-06-11T13:59:06Z

@Slaedr Yes exactly, the runtime is just used as a rule of thumb. I picked up 0.5s as a reasonable minimal runtime somewhere, although I'm not sure where exactly. That runtime can be discussed, if it's too high or too low.

@upsj Running these kinds of statistical tests seems a bit overkill to me. At least in my experience, the variation was quite low for these larger runtimes (>=0.5s), assuming that the machine does not use frequency scaling. If that is enabled, benchmarks can be quite unreliable, so I just ignored that case.
Also, starting with a fixed number of iterations might not be a good choice if one iteration already takes long. That case is also on my to-do list.

upsj · 2021-06-11T14:07:02Z

So nonius provides really powerful analysis, but also mainly analyzes tiny pieces of code, so the methods they use there (bootstrapping) don't make much sense on such small sample sizes.
So I would suggest using either the standard deviation or quartile distance as a measure for variability.

We have a certain variety of runtimes (going from fast to slow: overhead benchmarks (nanoseconds), BLAS, SpMV, preconditioners, solvers (seconds)), so I think it might make sense to have something robust that works on all of them. Requiring SpMVs run for 0.5s looks like a lot of overhead to me, especially since GPUs often have much less variability than CPUs.

upsj · 2021-06-11T14:15:06Z

So remembering my statistics lectures back in the day, I think if we want to reduce the standard deviation by a factor of 2, we need to run 4x as many benchmarks, so I guess my suggestion would be

compute rel_stddev = stddev / average
compute scale = rel_stddev / rel_stddev_limit
if scale > 1: run a factor of scale * scale - 1 additional iterations (limited by some max number of iterations)
if the standard deviation didn't decrease significantly then, stop anyways and report statistics (quantiles, outliers)

tcojean · 2021-06-11T15:20:12Z

I definitely agree with most of what is said here. I remember at the very beginning when we were considering using google benchmarks, the obscure algorithm they used and exactly this scheme of stopping after reaching 0.5s without extra control nor detailed information (i.e. vector of timings) was very weird, that creates an apple to oranges comparison in the timings, i.e. you compare qualitatively very different results depending on problem considered, problem size and other such things. Particularly, a lot of our solvers and other building blocks go very much past that time, so all of this is then not very useful.

IMHO, either you use a same amount of runs for matrices of comparable time scale (like what we do now), or you use an algorithm like what Tobias tried to outline which you apply equally all the time to reach the same timing accuracy all the time. At the risk of some benchmarks taking a ridiculous amount of time.

tcojean · 2021-06-11T15:41:53Z

On the cache effects issue for small problems, another probably more accurate approach is to have proper cache warmup and cache flushing strategies (depending on the context) to stabilize the timings, see this excellent paper on the topic. https://homes.sice.indiana.edu/rcwhaley/papers/timing_SPE08.pdf

Of course, there are still important potential performance issues which can come into play (particularly for large data sets), like process placement, turboboost and other speed scaling effects, ...

MarcelKoch · 2021-06-11T16:03:22Z

I guess I don't fully understand the purpose of these benchmarks. I interpreted them as a quick check to see if some changes significantly impact the performance in either way. (With this I mean that the purpose of the benchmarks is not to detect performance changes in the range 5%-1%)

For such a broad comparison, I think this approach is reasonable, both the runtime based and statistics based approach.
On the other hand, a more detailed analysis would require much more effort, as @tcojean already mentioned.

MarcelKoch · 2021-06-11T16:18:07Z

Concerning @upsj approach: I'm a bit unsure about the specifics of it. First, what would be a good threshold for the relative stddev? I would guess 1, but I've nothing to support that guess. This is also connected to the underlying distribution, where I'm also not sure which one to assume. The normal distribution would be the easiest choice, but again I've nothing to support that assumption.

upsj · 2021-06-12T09:57:37Z

I realized I was actually mixing up two things: stddev of the runtime distribution (which is fixed) and stddev of the mean estimator (which decays like sqrt(n) independent of the distribution). You can get the stddev of the mean estimator by bootstrapping, but that may be overkill here. I need to think about this some more.

As an example I ran a small BLAS benchmarks 1000 times and collected runtimes with 4 and 40 threads

upsj · 2021-06-12T21:16:07Z

I think we should be able to catch all our current cases (quick benchmarks, slow benchmarks) by providing a maximum repetition count used only for the repetition estimate. If the benchmark is slow, then we will choose the number of iterations based on the runtime. If it is fast, we will choose it so we have a sufficient number of runs (let's say 100), but still stay way below 0.5s
That is especially interesting if we are testing hundreds to thousands of small problems, where 0.5s per problem becomes significant. Also I would make both 0.5s and 100 configurable via command-line parameter.
At the moment, we don't seem to have benchmarks that require more sophisticated statistical treatment (though our current timer setup allows for it), but that may change when moving to MPI. That is probably a topic for another PR.

upsj · 2021-06-12T21:16:50Z

label!

MarcelKoch · 2021-06-18T14:25:53Z

I've incorporated a couple of suggestions from this thread.

Now there is more control over the adaptive benchmarking with the additional flags:

min_repetitions
max_repetitions
min_runtime

Also, larger (or rather slower) benchmarks are now handled correctly, i.e. only the minimal requested number of repetitions is executed. Therefore, I've also enabled the adaptive behavior for the solver benchmark.

For the preconditioner, this approach is not possible, since the total runtime is not updated within the repetitions loop. Therefore, I've added a warning if -repetitions auto is used in that case and use the default number of repetitions as a fallback.

On the implementation side, the usage is quite similar to google benchmark, i.e. the following code is valid:

IterationControl ic(timer);
for(auto status: ic.run()){
  timer->tic()
  // run benchmark
  timer->toc()
}

Additionally, status may be used to check the number of the current iteration, and if it is the last iteration.

To clarify, the adaptive benchmarking is only optional and not enabled by default.

codecov · 2021-06-18T16:09:29Z

Codecov Report

Merging #791 (6b726f8) into develop (3112263) will decrease coverage by 0.00%.
The diff coverage is n/a.

❗ Current head 6b726f8 differs from pull request most recent head 33ff686. Consider uploading reports for the commit 33ff686 to get more accurate results

@@             Coverage Diff             @@
##           develop     #791      +/-   ##
===========================================
- Coverage    94.37%   94.36%   -0.01%     
===========================================
  Files          400      400              
  Lines        32096    32097       +1     
===========================================
  Hits         30289    30289              
- Misses        1807     1808       +1

Impacted Files	Coverage Δ
omp/reorder/rcm_kernels.cpp	`97.53% <0.00%> (-0.61%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3112263...33ff686. Read the comment docs.

upsj

LGTM, great job! I really like the Google Benchmark-like setup.

benchmark/utils/general.hpp

benchmark/preconditioner/preconditioner.cpp

benchmark/utils/general.hpp

benchmark/utils/timer.hpp

yhmtsai · 2021-06-24T08:36:41Z

benchmark/utils/timer.hpp

@@ -260,6 +264,7 @@ class CudaTimer : public Timer {
 protected:
    void tic_impl() override
    {
+        exec_->synchronize();


this and HipTimer should not have the synchronize

For the tic it doesn't matter right? This is not timed because it's before the eventRecord and it could even help in ensuring nothing is running on the GPU when we start running things, like if there was a copy previously (for example, x_clone).

the event should also be after the memcpy if it involves gpu.
For me, ensuring nothing is running on the GPU is the stuff before calling timer, but it does not hurt the timing step.

yhmtsai · 2021-06-24T08:45:37Z

benchmark/solver/solver.cpp

-            auto x_clone = clone(x);
+        auto x_clone = clone(x);
+        for (auto status : ic.run(false)) {
+            x_clone = clone(x);

            exec->synchronize();
            generate_timer->tic();


maybe we also add the tic/toc function to IterationControl to forward the Timer tic/toc.
such that we call tic/toc on the IC and get output from the same place as others?

First thing: you can already use ic.compute_average_timings to get the output, since ic has internally a copy of the shared ptr for the timer (but only for the apply timer). However, that seems a bit awkward.

From my viewpoint, IC should not handle any timings in this instance, as manage_timings is set to false. Adding timings functions to IC would weaken the distinction to the managed case. If the non-managed IC run is requested, the user should take care of the timings.

It makes sense from this point.
My point is iteration control needs the results from the corresponding timer, so it can be wrong usage.
but I also realize that may make the tic/toc from TimerManager different?
maybe we need to add some comment about that need to use the corresponding timer such that the iteration control can really check it?

Perhaps I could add a get_timer to IC. That should help to use the correct timer. Otherwise, I can't think of a graceful way of adding tic/toc directly to IC.

Yeah, it is good.

yhmtsai · 2021-06-24T08:46:52Z

benchmark/spmv/spmv.cpp

+            auto x_clone = clone(x);
+            for (auto _ : ic_tuning.run()) {


this changes the behavior. we refreshed the memory every time before

does that matter? SpMVs should only write to x after all.

no for result.
It depends on what we want the memory status for the benchmark.
Should memory always be a new location (only from software point), or the allocation just be there before the operations.

at least from the caching standpoint that would probably not make a difference, as the clone calls memcpy, which might, depending on the implementation, already move the data into cache.

I also think that this change should be fine, especially if you consider this as a best-case benchmark, i.e. the data is already in the appropriate caches. Considering the worst-case, i.e. at the beginning of each SpMV the data is not cached, is more difficult in general and would probably require more adjustments, especially wrt Tobias' comment.

yhmtsai · 2021-06-24T08:47:17Z

benchmark/spmv/spmv.cpp

+        auto x_clone = clone(x);
+        for (auto _ : ic.run()) {


yhmtsai · 2021-06-24T08:47:46Z

benchmark/utils/general.hpp

+              "runtime is larger than 'min_runtime'");
+
+DEFINE_double(min_runtime, 0.05,
+              "If 'repetitions = auto' is used, the minimal runtime of"


Suggested change

"If 'repetitions = auto' is used, the minimal runtime of"

"If 'repetitions = auto' is used, the minimal runtime (seconds) of"

yhmtsai · 2021-06-24T08:49:21Z

benchmark/utils/general.hpp

+ * ```
+ * auto timer = get_timer(...);
+ * IterationControl ic(timer);
+ * for(auto status: ic.[warmup_run|run](manage_timings [default is true])){


is the manage_timing also for warmuprun?

I have not added the parameter there, as the warmup-run uses always a fixed number of repetitions. I will clarify the documentation.

yhmtsai · 2021-06-24T08:49:49Z

benchmark/utils/general.hpp

+     * Uses the commandline flags to setup the stopping criteria for the
+     * warmup and timed run.
+     *
+     * @param timer the same timer that is to be used for the timings


Suggested change

* @param timer the same timer that is to be used for the timings

* @param timer the same timer that is to be used for the timings

yhmtsai · 2021-06-24T08:53:25Z

benchmark/utils/general.hpp

+    run_control warmup_run()
+    {
+        status_warmup_.cur_it = 0;
+        status_warmup_.timer->clear();


Suggested change

status_warmup_.timer->clear();

status_warmup_.timer->clear();

status_warmup_.timer->manage_timings = false;

I realize there's no change from another function, so it is fine.

yhmtsai · 2021-06-24T08:56:54Z

benchmark/utils/general.hpp

+        // emulate shared_ptr behavior
+        const TimerManager *operator->() const { return this; }
+        TimerManager *operator->() { return this; }


what is it used for?

That was used for being lazy 😄 I will remove that, and adjust the rest accordingly.

yhmtsai · 2021-06-24T09:46:00Z

benchmark/utils/general.hpp

+
+        void tic()
+        {
+            if (manage_timings) timer->tic();


Suggested change

if (manage_timings) timer->tic();

if (manage_timings) {

timer->tic();

}

also apply to next one from the gko ref

Is that part of the .clang-format specification? If not, perhaps it should be added there.

it is not in current .clang-format. we only mention it in Contribution guideline.
Does clang-format support this after 6?

After a bit of digging, it seems like clang-format (up to 13) still does not support this. But there is an PR for this here: https://reviews.llvm.org/D95168
So it seems like that some future version will support this.

There is also a workaround using clang-tidy, but that is over kill (https://stackoverflow.com/a/28437960)

tcojean

LGTM!

tcojean · 2021-06-24T18:46:23Z

benchmark/utils/general.hpp

+                    cur_info->managed_timer.toc();
+                    stopped = true;
+                    next_timing =
+                        static_cast<IndexType>(std::ceil(next_timing * 1.5));


Maybe make the 1.5 controllable for extreme cases?

yhmtsai

LGTM. only two comments for missing documentation

yhmtsai · 2021-06-25T12:30:55Z

benchmark/utils/general.hpp

+ * - 'warmup' warmup iterations, applies in fixed and adaptive case
+ * - 'min_repetitions' minimal number of repetitions (adaptive case)
+ * - 'max_repetitions' maximal number of repetitions (adaptive case)
+ * - 'min_runtime' minimal total runtime (adaptive case)


miss repetition_growth_factor

yhmtsai · 2021-06-25T12:31:30Z

benchmark/utils/general.hpp

+ * - `warmup_run()`: controls run defined by `warmup` flag
+ * - `run(bool)`: controls run defined by all other flags
+ * - `get_timer()`: access to underlying timer
+ * Both methods return an object that is to be used in a range-based for loop:


yhmtsai · 2021-06-25T15:23:40Z

benchmark/utils/general.hpp

+ * - 'min_repetitions' minimal number of repetitions (adaptive case)
+ * - 'max_repetitions' maximal number of repetitions (adaptive case)
+ * - 'min_runtime' minimal total runtime (adaptive case)
+ * - 'repetitions_growth_factor' controls the increase between two successive


nit

Suggested change

* - 'repetitions_growth_factor' controls the increase between two successive

* - 'repetition_growth_factor' controls the increase between two successive

or change the Gflags name

yhmtsai · 2021-06-25T15:24:44Z

need rebase before merge. and you should be able to see two pipelines running on gitlab after next push

This reworks the previous adaptiv benchmarking. Now, the number of iteration is determined on-the-fly instead of before-hand. Also new cmd-line flags have been added to allow for greater control over the adaptive benchmarks. The usage is similar to google benchmark. Currently, it is not possible to use this approach for the preconditioner benchmark, as these do not update the runtime in each iteration.

this changes the preconditioner benchmark to time each preconditioner apply/generate individually, unifying the timing approach across all benchmarks.

- adds more documentation - minor formatting - made `status` private s.t. it is not part of the public interface of `IterationControl` Co-authored-by: Yuhsiang Tsai <yhmtsai@gmail.com> Co-authored-by: Tobias Ribizel <ribizel@kit.edu>

Now the `run_control` object also controls taking the timings. The user does not need to issue the timings by hand anymore. This allows to use increasingly larger intervals between two timings, unitl the benchmark run is finished. Drawback: everything within the `ic.run()` loop gets timed, parts that should be exempt need to be moved outside of the loop.

Internally this uses a thin wrapper class for the `timer` object, which just skips the `tic/toc` calls, if the `run_control` object does not manage the timings. In that case, the timings have to be issued outside as before.

- clarify documentation - add accessor to underlying timer - formatting - adds flag to choose repetitions growth factor Co-authored-by: Yuhsiang Tsai <yhmtsai@gmail.com> Co-authored-by: Tobias Ribizel <ribizel@kit.edu> Co-authored-by: Terry Cojean <terry.cojean@kit.edu>

sonarqubecloud · 2021-06-28T09:49:49Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
17 Code Smells

0.0% Coverage
0.0% Duplication

Ginkgo release 1.4.0 The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem which enables Intel-GPU and CPU execution. The only Ginkgo features which have not been ported yet are some preconditioners. Ginkgo's mixed-precision support is greatly enhanced thanks to: 1. The new Accessor concept, which allows writing kernels featuring on-the-fly memory compression, among other features. The accessor can be used as header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example. 2. All LinOps now transparently support mixed-precision execution. By default, this is done through a temporary copy which may have a performance impact but already allows mixed-precision research. Native mixed-precision ELL kernels are implemented which do not see this cost. The accessor is also leveraged in a new CB-GMRES solver which allows for performance improvements by compressing the Krylov basis vectors. Many other features have been added to Ginkgo, such as reordering support, a new IDR solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU for now), machine topology information, and more! Supported systems and requirements: + For all platforms, cmake 3.13+ + C++14 compliant compiler + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+ + clang: 3.9+ + Intel compiler: 2018+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + HIP module: ROCm 3.5+ + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`. + Windows + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+ + Microsoft Visual Studio: VS 2019 + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. Algorithm and important feature additions: + Add a new DPC++ Executor for SYCL execution and other base utilities [#648](#648), [#661](#661), [#757](#757), [#832](#832) + Port matrix formats, solvers and related kernels to DPC++. For some kernels, also make use of a shared kernel implementation for all executors (except Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856) + Add accessors which allow multi-precision kernels, among other things. [#643](#643), [#708](#708) + Add support for mixed precision operations through apply in all LinOps. [#677](#677) + Add incomplete Cholesky factorizations and preconditioners as well as some improvements to ILU. [#672](#672), [#837](#837), [#846](#846) + Add an AMGX implementation and kernels on all devices but DPC++. [#528](#528), [#695](#695), [#860](#860) + Add a new mixed-precision capability solver, Compressed Basis GMRES (CB-GMRES). [#693](#693), [#763](#763) + Add the IDR(s) solver. [#620](#620) + Add a new fixed-size block CSR matrix format (for the Reference executor). [#671](#671), [#730](#730) + Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780) + Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649) + Add matrix assembly support on CPUs. [#644](#644) + Extends ISAI from triangular to general and spd matrices. [#690](#690) Other additions: + Add the possibility to apply real matrices to complex vectors. [#655](#655), [#658](#658) + Add functions to compute the absolute of a matrix format. [#636](#636) + Add symmetric permutation and improve existing permutations. [#684](#684), [#657](#657), [#663](#663) + Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697) + Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850) + Row-major accessor is generalized to more than 2 dimensions and a new "block column-major" accessor has been added. [#707](#707) + Add an heat equation example. [#698](#698), [#706](#706) + Add ccache support in CMake and CI. [#725](#725), [#739](#739) + Allow tuning and benchmarking variables non intrusively. [#692](#692) + Add triangular solver benchmark [#664](#664) + Add benchmarks for BLAS operations [#772](#772), [#829](#829) + Add support for different precisions and consistent index types in benchmarks. [#675](#675), [#828](#828) + Add a Github bot system to facilitate development and PR management. [#667](#667), [#674](#674), [#689](#689), [#853](#853) + Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781) + Add ssh debugging for Github Actions CI. [#749](#749) + Add pipeline segmentation for better CI speed. [#737](#737) Changes: + Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854) + Add implicit residual log for solvers and benchmarks. [#714](#714) + Change handling of the conjugate in the dense dot product. [#755](#755) + Improved Dense stride handling. [#774](#774) + Multiple improvements to the OpenMP kernels performance, including COO, an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740) + Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718) + Improved Identity constructor and treatment of rectangular matrices. [#646](#646) + Allow CUDA/HIP executors to select allocation mode. [#758](#758) + Check if executors share the same memory. [#670](#670) + Improve test install and smoke testing support. [#721](#721) + Update the JOSS paper citation and add publications in the documentation. [#629](#629), [#724](#724) + Improve the version output. [#806](#806) + Add some utilities for dim and span. [#821](#821) + Improved solver and preconditioner benchmarks. [#660](#660) + Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812) Fixes: + Sorting fix for the Jacobi preconditioner. [#659](#659) + Also log the first residual norm in CGS [#735](#735) + Fix BiCG and HIP CSR to work with complex matrices. [#651](#651) + Fix Coo SpMV on strided vectors. [#807](#807) + Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769) + Fix device_reset issue by moving counter/mutex to device. [#810](#810) + Fix `EnableLogging` superclass. [#841](#841) + Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726) + Decreased test size for a few device tests. [#742](#742) + Fix multiple issues with our CMake HIP and RPATH setup. [#712](#712), [#745](#745), [#709](#709) + Cleanup our CMake installation step. [#713](#713) + Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785) + Simplify third-party integration. [#786](#786) + Improve Ginkgo device arch flags management. [#696](#696) + Other fixes and improvements to the CMake setup. [#685](#685), [#792](#792), [#705](#705), [#836](#836) + Clarification of dense norm documentation [#784](#784) + Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840) + Make multiple operators/constructors explicit. [#650](#650), [#761](#761) + Fix some issues, memory leaks and warnings found by MSVC. [#666](#666), [#731](#731) + Improved solver memory estimates and consistent iteration counts [#691](#691) + Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754) + Fix for ForwardIterator requirements in iterator_factory. [#665](#665) + Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722) + Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852) Related PR: #857

Release 1.4.0 to master The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem which enables Intel-GPU and CPU execution. The only Ginkgo features which have not been ported yet are some preconditioners. Ginkgo's mixed-precision support is greatly enhanced thanks to: 1. The new Accessor concept, which allows writing kernels featuring on-the-fly memory compression, among other features. The accessor can be used as header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example. 2. All LinOps now transparently support mixed-precision execution. By default, this is done through a temporary copy which may have a performance impact but already allows mixed-precision research. Native mixed-precision ELL kernels are implemented which do not see this cost. The accessor is also leveraged in a new CB-GMRES solver which allows for performance improvements by compressing the Krylov basis vectors. Many other features have been added to Ginkgo, such as reordering support, a new IDR solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU for now), machine topology information, and more! Supported systems and requirements: + For all platforms, cmake 3.13+ + C++14 compliant compiler + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+ + clang: 3.9+ + Intel compiler: 2018+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + HIP module: ROCm 3.5+ + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`. + Windows + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+ + Microsoft Visual Studio: VS 2019 + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. Algorithm and important feature additions: + Add a new DPC++ Executor for SYCL execution and other base utilities [#648](#648), [#661](#661), [#757](#757), [#832](#832) + Port matrix formats, solvers and related kernels to DPC++. For some kernels, also make use of a shared kernel implementation for all executors (except Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856) + Add accessors which allow multi-precision kernels, among other things. [#643](#643), [#708](#708) + Add support for mixed precision operations through apply in all LinOps. [#677](#677) + Add incomplete Cholesky factorizations and preconditioners as well as some improvements to ILU. [#672](#672), [#837](#837), [#846](#846) + Add an AMGX implementation and kernels on all devices but DPC++. [#528](#528), [#695](#695), [#860](#860) + Add a new mixed-precision capability solver, Compressed Basis GMRES (CB-GMRES). [#693](#693), [#763](#763) + Add the IDR(s) solver. [#620](#620) + Add a new fixed-size block CSR matrix format (for the Reference executor). [#671](#671), [#730](#730) + Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780) + Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649) + Add matrix assembly support on CPUs. [#644](#644) + Extends ISAI from triangular to general and spd matrices. [#690](#690) Other additions: + Add the possibility to apply real matrices to complex vectors. [#655](#655), [#658](#658) + Add functions to compute the absolute of a matrix format. [#636](#636) + Add symmetric permutation and improve existing permutations. [#684](#684), [#657](#657), [#663](#663) + Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697) + Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850) + Row-major accessor is generalized to more than 2 dimensions and a new "block column-major" accessor has been added. [#707](#707) + Add an heat equation example. [#698](#698), [#706](#706) + Add ccache support in CMake and CI. [#725](#725), [#739](#739) + Allow tuning and benchmarking variables non intrusively. [#692](#692) + Add triangular solver benchmark [#664](#664) + Add benchmarks for BLAS operations [#772](#772), [#829](#829) + Add support for different precisions and consistent index types in benchmarks. [#675](#675), [#828](#828) + Add a Github bot system to facilitate development and PR management. [#667](#667), [#674](#674), [#689](#689), [#853](#853) + Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781) + Add ssh debugging for Github Actions CI. [#749](#749) + Add pipeline segmentation for better CI speed. [#737](#737) Changes: + Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854) + Add implicit residual log for solvers and benchmarks. [#714](#714) + Change handling of the conjugate in the dense dot product. [#755](#755) + Improved Dense stride handling. [#774](#774) + Multiple improvements to the OpenMP kernels performance, including COO, an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740) + Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718) + Improved Identity constructor and treatment of rectangular matrices. [#646](#646) + Allow CUDA/HIP executors to select allocation mode. [#758](#758) + Check if executors share the same memory. [#670](#670) + Improve test install and smoke testing support. [#721](#721) + Update the JOSS paper citation and add publications in the documentation. [#629](#629), [#724](#724) + Improve the version output. [#806](#806) + Add some utilities for dim and span. [#821](#821) + Improved solver and preconditioner benchmarks. [#660](#660) + Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812) Fixes: + Sorting fix for the Jacobi preconditioner. [#659](#659) + Also log the first residual norm in CGS [#735](#735) + Fix BiCG and HIP CSR to work with complex matrices. [#651](#651) + Fix Coo SpMV on strided vectors. [#807](#807) + Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769) + Fix device_reset issue by moving counter/mutex to device. [#810](#810) + Fix `EnableLogging` superclass. [#841](#841) + Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726) + Decreased test size for a few device tests. [#742](#742) + Fix multiple issues with our CMake HIP and RPATH setup. [#712](#712), [#745](#745), [#709](#709) + Cleanup our CMake installation step. [#713](#713) + Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785) + Simplify third-party integration. [#786](#786) + Improve Ginkgo device arch flags management. [#696](#696) + Other fixes and improvements to the CMake setup. [#685](#685), [#792](#792), [#705](#705), [#836](#836) + Clarification of dense norm documentation [#784](#784) + Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840) + Make multiple operators/constructors explicit. [#650](#650), [#761](#761) + Fix some issues, memory leaks and warnings found by MSVC. [#666](#666), [#731](#731) + Improved solver memory estimates and consistent iteration counts [#691](#691) + Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754) + Fix for ForwardIterator requirements in iterator_factory. [#665](#665) + Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722) + Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852) Related PR: #866

upsj added the 1:ST:ready-for-review This PR is ready for review label Jun 12, 2021

ginkgo-bot added reg:benchmarking This is related to benchmarking. type:preconditioner This is related to the preconditioners type:solver This is related to the solvers labels Jun 12, 2021

upsj assigned MarcelKoch Jun 16, 2021

MarcelKoch force-pushed the benchmarks-auto-repetitions branch from d468a76 to 68843e5 Compare June 18, 2021 14:18

upsj approved these changes Jun 19, 2021

View reviewed changes

benchmark/utils/general.hpp Outdated Show resolved Hide resolved

benchmark/preconditioner/preconditioner.cpp Outdated Show resolved Hide resolved

benchmark/utils/general.hpp Show resolved Hide resolved

benchmark/utils/general.hpp Outdated Show resolved Hide resolved

MarcelKoch requested review from yhmtsai, tcojean and Slaedr June 21, 2021 13:48

tcojean reviewed Jun 23, 2021

View reviewed changes

benchmark/utils/timer.hpp Outdated Show resolved Hide resolved

MarcelKoch force-pushed the benchmarks-auto-repetitions branch from 69f02b4 to 6be0a55 Compare June 24, 2021 07:17

yhmtsai requested changes Jun 24, 2021

View reviewed changes

MarcelKoch force-pushed the benchmarks-auto-repetitions branch from e398d62 to 74476ad Compare June 24, 2021 15:32

tcojean approved these changes Jun 24, 2021

View reviewed changes

tcojean reviewed Jun 24, 2021

View reviewed changes

yhmtsai mentioned this pull request Jun 24, 2021

Add Sparse BLAS benchmark #759

Merged

MarcelKoch requested a review from yhmtsai June 25, 2021 07:00

yhmtsai approved these changes Jun 25, 2021

View reviewed changes

yhmtsai reviewed Jun 25, 2021

View reviewed changes

yhmtsai added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Jun 25, 2021

MarcelKoch and others added 11 commits June 28, 2021 09:42

changes repetition flag type to string

e6f233c

adds repetition number to benchmark json output

1f16f3f

adds include guard in benchmark/utils/timer.hpp

efec4c3

enables adaptive benchmarking for preconditioner

76604a3

this changes the preconditioner benchmark to time each preconditioner apply/generate individually, unifying the timing approach across all benchmarks.

Review update and status visability change

0cf7012

- adds more documentation - minor formatting - made `status` private s.t. it is not part of the public interface of `IterationControl` Co-authored-by: Yuhsiang Tsai <yhmtsai@gmail.com> Co-authored-by: Tobias Ribizel <ribizel@kit.edu>

synchronizes all timers on tic/toc

13fc933

adds switch deactivating timings in run_control

b2387c4

Internally this uses a thin wrapper class for the `timer` object, which just skips the `tic/toc` calls, if the `run_control` object does not manage the timings. In that case, the timings have to be issued outside as before.

removes sync from toc call

9cb36a5

Review updates

33ff686

- clarify documentation - add accessor to underlying timer - formatting - adds flag to choose repetitions growth factor Co-authored-by: Yuhsiang Tsai <yhmtsai@gmail.com> Co-authored-by: Tobias Ribizel <ribizel@kit.edu> Co-authored-by: Terry Cojean <terry.cojean@kit.edu>

MarcelKoch force-pushed the benchmarks-auto-repetitions branch from 6b726f8 to 33ff686 Compare June 28, 2021 07:43

MarcelKoch merged commit a37c101 into develop Jun 28, 2021

MarcelKoch deleted the benchmarks-auto-repetitions branch June 28, 2021 13:31

	"If 'repetitions = auto' is used, the minimal runtime of"
	"If 'repetitions = auto' is used, the minimal runtime (seconds) of"

	* @param timer the same timer that is to be used for the timings
	* @param timer the same timer that is to be used for the timings

	status_warmup_.timer->clear();
	status_warmup_.timer->clear();
	status_warmup_.timer->manage_timings = false;

	* - 'repetitions_growth_factor' controls the increase between two successive
	* - 'repetition_growth_factor' controls the increase between two successive

Benchmarks auto repetitions #791

Benchmarks auto repetitions #791

Conversation

MarcelKoch commented Jun 11, 2021 • edited Loading

yhmtsai commented Jun 11, 2021 • edited Loading

Slaedr commented Jun 11, 2021

MarcelKoch commented Jun 11, 2021

MarcelKoch commented Jun 11, 2021

upsj commented Jun 11, 2021

yhmtsai commented Jun 11, 2021

MarcelKoch commented Jun 11, 2021

Slaedr commented Jun 11, 2021

MarcelKoch commented Jun 11, 2021

upsj commented Jun 11, 2021 • edited Loading

upsj commented Jun 11, 2021 • edited Loading

tcojean commented Jun 11, 2021 • edited Loading

tcojean commented Jun 11, 2021 • edited Loading

MarcelKoch commented Jun 11, 2021

MarcelKoch commented Jun 11, 2021

upsj commented Jun 12, 2021

upsj commented Jun 12, 2021

upsj commented Jun 12, 2021

MarcelKoch commented Jun 18, 2021

codecov bot commented Jun 18, 2021 • edited Loading

Codecov Report

upsj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tcojean left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhmtsai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhmtsai commented Jun 25, 2021 • edited Loading

sonarqubecloud bot commented Jun 28, 2021

MarcelKoch commented Jun 11, 2021 •

edited

Loading

yhmtsai commented Jun 11, 2021 •

edited

Loading

upsj commented Jun 11, 2021 •

edited

Loading

upsj commented Jun 11, 2021 •

edited

Loading

tcojean commented Jun 11, 2021 •

edited

Loading

tcojean commented Jun 11, 2021 •

edited

Loading

codecov bot commented Jun 18, 2021 •

edited

Loading

yhmtsai commented Jun 25, 2021 •

edited

Loading