-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Cantera compatible with -ffast-math #125
Comments
Hi @g3bk47 ... thanks again for setting up these tests. Could you clarify your comments on of the convergence issues that you observed? @speth commented with separate testing in Cantera/cantera#1155 (for ignition delay) that ...
Is this consistent with your conclusions? |
Hi, In my tests, there was only one case where more aggressive optimization settings affected performance negatively, i.e.1D flame with very tight tolerances (while the 1D flame with more relaxed tolerances showed some speedup). However, the negative effect in the case of tight tolerances is quite extreme. Just to pick a few data points from Table 13 (https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/README.md, https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/oneD.cpp): Running the 1D flame requires the following wall clock time:
So in the three testcases I have looked at (reaction rates, 0D, 1D), the impact of more aggressive compiler optimizations on performance was quite binary: Either using fastmath led to a some speedup of <= 15%, or a massive slowdown of 3x to 40x. I did not look into the solver output to see what the actual reason for that is or whether compiling all external libraries with O3 and only Cantera with O3 and fastmath might fix the problem. With the data so far, my conclusion would be that enabling fastmath only makes sense if Cantera is used for any type of problem that does not involve iterative solutions, which arguably excludes most use cases of Cantera. Of course, my tests so far were limited to just a handful of sample programs. Feel free to suggest any other test programs I could throw at the test suite or if I should look more into the cases with massive slow down. |
Thanks, @g3bk47! ... for my own part, I am planning to revise some of the instances where I recently introduced |
Thanks for the extensive set of tests, @g3bk47. I tried replicating some of these results using your
This is using your
I ran these tests on a system with Xeon E5-2650 v4 (2.20GHz) CPUs, which are a bit older (2016 vintage, rather than your 2021 processors). That may provide some explanation of what's happening with the Intel compiler -- it may be able to generate code that uses some processor features that GCC hasn't been updated to use yet. |
Thanks for the interesting results, @speth. I agree that the difference between Just few additional thoughts on what might cause the differences in our results (apart from the different CPUs):
I will run my test suite again on another cluster. There, 15 different compilers/versions and two different compute nodes are available:
Maybe I can reproduce your results there or at least provide additional data points. |
Yes, you're correct that the Intel compiler (and maybe others) can generate multiple code paths and select different ones at run time based on the specific processor. The most infamous use of this has been to use less-optimal code paths when running on AMD processors. In this case, that behavior may be why there's so little impact of telling it to emit code for your specific processor rather than the more backwards-compatible default, if it's able to use the more optimized path opportunistically. I did not recompile any other libraries that Cantera links to. However, for the code in question, there isn't much happening outside calls to the C++ standard library. The rate evaluations don't even use Eigen. For my system, |
The libc on all clusters I have access to is actually older than yours (version 2.28). Another wild guess why our results differ might be because your CPU clocks down when it get too hot so that there is some kind of lower bound for performance? I mentioned Eigen because it appears here https://github.com/Cantera/cantera/blob/main/include/cantera/kinetics/StoichManager.h#L617-L618. But I am not entirely sure if this is used in the sample program. I ran my test suite again on the two other systems. One of the systems uses an Intel Xeon E5-2660 v4 CPU, which sounds close to your setup. However, I again got pretty much the opposite of your results: using This time, I measured the code performance with a profiler and looked into the generated assembly (again, see https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/NewSystems.md#2-profiling for more details). To briefly sum up my first findings:
With these preliminary findings, my performance measurements sound plausible to me. |
Ah, those profiling results are very interesting. I'm confused as to what's happening on my system -- when compiling with GCC and |
PR Cantera/cantera#1330 |
Reposting here as it has come up elsewhere: someones-been-messing-with-my-subnormals (discussion about pitfalls of |
Abstract
After the recent discussion about compiling Cantera with
-ffast-math
(which is basically the default for the Intel compilers), I set up a benchmark suite to test the accuracy and computational performance of Cantera when using different optimization flags and compilers. The relevant discussions can be found here:Cantera/cantera#1155
Cantera/cantera#1150
Cantera/cantera@9daebd9
Motivation/Results
I ran different sample programs (evaluation of reaction rates, 0D reactor and 1D flame) with 16 different compilers/versions and 8 different optimization settings. The findings can be summarized as follows:
g++/clang++
do not yield the same results (bitwise) in general.g++
,O2
generates slightly slower code compared toO3
but without affecting the results.fastmath
increases performance by 10 % to 15 % forg++
. Usingfastmath
together withno-finite-math-only
increases performance by only 5 %. However, both options can drastically deteriorate convergence behavior and should therefore not be the default.fp-model strict
is sligthly slower thanfp-model precise
but the accuracy is the same in all test cases.fastmath
together withno-finite-math-only
produces sligthly faster code and can be used together with Cantera, however, convergence might again deteriorate drastically. In general, the different optimization settings have much less effect for the Intel compilers than forg++/clang++
.From my tests above, the current defaults of Cantera seem to be the optimal compromise between performance and safety:
O3
forg++/clang++
O3 -fp-model precise
for the Intel compilersSince
fastmath
withoutno-finite-math-only
can improve the performance ofg++
for simple cases like the evaluation of reaction rates by 15 %, it would be nice for Cantera to be compatible with this option, e.g. for users coupling Cantera to other CFD codes. However, this means that the internal use of NaNs and Infs would have to be removed.Let me know if you have any other interesting code snippets that should be benchmarked to aid the discussion.
References
For all details of my benchmark suite, please see: https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/README.md
The text was updated successfully, but these errors were encountered: