Skip to content

Latest commit

 

History

History
79 lines (59 loc) · 5.69 KB

NewSystems.md

File metadata and controls

79 lines (59 loc) · 5.69 KB

CanteraCompilerPerformance on two additional systems

1) Performance of reaction rate calculation

Mean runtime over 20 runs for the reaction rate calculations from https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/reactionRates.cpp]. For a detailed description of the compile flags, see [https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/README.md]

System 1: Intel Xeon E5-2660 v4 (Broadwell)

Table 1. Runtime in seconds for calculating reaction rates on Intel Xeon E5-2660 v4.

compiler noOpt O2 strict default fastmathsafe fastmath full fullsafe
g++8 85.91 20.79 n.a. 20.33 19.71 17.23 16.80 19.21
g++9 86.88 20.74 n.a. 20.62 19.67 17.46 17.07 19.44
g++10.2 93.46 20.61 n.a. 20.32 19.41 17.16 16.77 19.14
g++10.3 92.34 20.39 n.a. 20.25 19.59 17.14 16.92 19.35
g++11.1 92.92 20.24 n.a. 20.23 19.40 17.22 16.77 18.83
g++11.2 92.33 20.44 n.a. 19.96 19.28 17.21 16.76 18.87
clang++9.0 90.57 20.48 n.a. 20.43 20.62 17.87 17.85 20.48
clang++10.0 90.51 20.55 n.a. 20.74 20.49 20.28 19.97 20.31
clang++11.0 90.29 20.60 n.a. 20.60 20.64 20.53 20.08 20.53
clang++12.0 90.57 20.34 n.a. 20.54 20.38 18.04 18.10 20.12
clang++13.0 90.66 20.42 n.a. 20.68 20.34 18.00 17.85 20.07
icpx2021.4 81.84 13.43 13.91 13.43 13.06 13.01 12.87 12.83
icpc19.0 109.35 13.05 13.32 12.96 13.02 12.91 12.57 12.67
icpc19.1 110.40 12.97 13.15 12.94 12.95 12.96 12.99 12.59
icpc21.4 109.60 12.84 13.19 13.02 12.90 12.99 12.74 12.64

System 2: Intel Xeon Gold 6230 (Cascade Lake)

Table 2. Runtime in seconds for calculating reaction rates on Intel Xeon Gold 6230.

compiler noOpt O2 strict default fastmathsafe fastmath full fullsafe
g++8 67.29 15.73 n.a. 15.74 15.03 13.06 12.58 14.49
g++9 67.73 15.69 n.a. 15.82 14.93 13.04 12.73 14.50
g++10.2 70.50 15.93 n.a. 15.59 15.03 12.94 12.66 14.54
g++10.3 70.87 15.85 n.a. 15.64 15.08 12.93 12.72 14.56
g++11.1 71.17 15.55 n.a. 15.57 15.00 13.08 12.72 14.37
g++11.2 70.89 15.69 n.a. 15.51 14.93 13.01 12.71 14.33
clang++9.0 69.27 15.85 n.a. 15.76 15.68 13.47 13.51 15.80
clang++10.0 69.15 15.84 n.a. 15.75 15.72 15.49 15.06 15.49
clang++11.0 69.32 15.73 n.a. 15.84 15.61 15.43 15.04 15.34
clang++12.0 68.76 15.67 n.a. 15.57 15.41 13.58 13.39 15.04
clang++13.0 68.27 15.64 n.a. 15.70 15.44 13.59 13.19 15.14
icpx2021.4 64.16 10.21 10.21 9.84 9.56 9.40 9.21 9.19
icpc19.0 82.49 9.52 9.60 9.51 9.46 9.40 9.47 9.38
icpc19.1 82.53 9.51 9.53 9.48 9.57 9.63 9.26 9.38
icpc21.4 83.01 9.48 9.54 9.34 9.48 9.62 9.40 9.34

2) Profiling

All results shown in this section have been obtained on the Intel Xeon E5-2660 v4 system. Measurements are done with perf.

gcc with fastmath vs. fasthmath + nofinitemath

The image below shows profiling of the reaction rate calculation with gcc 10.3 and O3+fastmath. About 39% of the total runtime is spent on calling __ieee754_exp_fma. Additionally, special versions of other functions (log10_finite and exp_finite) appear as well.

Figure 1: gcc 10.3 and O3+fastmath pic

The next image shows the profiling with gcc 10.3 and O3+fastmath+nofinitemath. In addition to the calls to __ieee754_exp_fma, 12% of the runtime are spent on calling __GI___exp, which seems to be a version of exp with additional error handling. Interestingly, log10_finite appears here as well.

Figure 2: gcc 10.3 and O3+fastmath+nofinitemath pic

Both programs yield the exact same bitwise results, but a performance gain of >10% can be observed when using fastmath without nofinitemath.

For completeness, here are the profiling results for icxp:

Figure 3: icxp 21.4 and O3+fp-model fast+fastmath pic

Intel vs. gcc

The Intel compiler finds more opportunities to optimize. For example, the picture below is the generated assembly for updateTemp from https://github.com/Cantera/cantera/blob/ad213c45a39eb0ba39b2f4e418518371d822cc11/src/kinetics/Falloff.cpp#L184-L188.

Figure 4: Optizations done by the Intel compiler pic

The two calls to the exponential function at the beginning are merged into a single call of a vectorized version of that function. Interestingly, pretty much the same assembly is generated for O3, O3+fp-model fast+fastmath and O3+fp-model fast+fastmath+nofinitemath. Gcc and clang, on the other hand, always generate machine code with three exponential function calls. See also here for a direct comparison: https://godbolt.org/z/zMhaEPdYM