-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
include -ftree-vectorize and -fno-math-errno in default compiler optimisation flags for GCC #2388
include -ftree-vectorize and -fno-math-errno in default compiler optimisation flags for GCC #2388
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. And a sensible thing to do. But I won't merge it yet, I'd like to check with Kenneth first, since this has the potential to break some builds (in theory at least).
Oh, it breaks here: https://github.com/easybuilders/easybuild-framework/pull/2388/files#diff-3177e89e1afa2a8305974f5995c31c29R284 Simply because it checks that |
Since Intel auto-vectorizes at -O2 but not at -O1, and GCC auto-vectorizes at -O3 but not at -O2 this makes it consistent.
This avoids seeing "-v" being used in "-ftree-vectorize".
As -ftree-vectorize is there by default it confuses the test case.
7a8f6a0
to
2d44ee4
Compare
@bartoldeman Do you have any idea of the potential negative impact this may have, both in terms of performance and floating-point accuracy of the resulting binaries (or maybe even build problems)? |
I've applied this at Compute Canada. So far we have seen no issues but I can monitor a bit. |
The n-body fortran example implemented here is a good case for this flag. It can help improve performance, at least, for well written fortran codes. The code is written in such a way that the compiler can auto-vectorize the double loop.
ResultsRunning this code on my laptop with a EasyBuilt version of GCC 7.2.0, I get: Using the current default EasyBuild flags (
|
@boegel GCC is conservative by default (vs. Intel which is loose by default but EB's -fp-model source makes it much more precise). as far as I understand -ftree-vectorize without -ffast-math (or one of its sub-options, -funsafe-math-optimizations, -fassociative-math etc.) will not cause any changes in FP results, that is, vectorized sum reductions are not enabled. |
I make some comparisons with a code that vectorizes quite good (Newtonian interaction of some thousand particles). I tested on a Xeon Skylake 6140 dual socket machine (36 cores, AVX2). Intel 2018a and GNU 6.4 were used.
It seems like some other flag is needed for the vectorization. |
According to the Auto-vectorization in GCC guide,
|
Your are right, vectorization is enabled with -O3. But in this case, it does not explain the difference between the times measured with GNU and Intel. GNU seems to be more precise, but slower, unless the flag -ffast-math is used.
|
@Helios07 What you've done is helpful and makes a lot of sense, if you're not allowed to reorder multiplication then vectorisation is going to be difficult. From GCC (7.2):
Do people think it is worthwhile cherry-picking from this set to be more conservative? Is |
@Helios07 that is a good case for the 'loose' and 'veryloose' precision settings. I think gcc 'loose' and 'veryloose' settings should get some/all of the -ffast-math options, and not just -mrecip -mno-ieee-fp but that needs some further testing. What do you get with? Note that -fassociative-math allows to change FP evaluation of a+b+c from (a+b)+c to a+(b+c), and the more aggressive options will even disregards () put there by programmers (ie. a+(b+c) => (a+b)+c), at least in Fortran (not sure about C/C++). |
Here you are:
If you want to try this by yourself, I am using a slightly modified version of this code: |
Thanks. However I forgot a few things. To properly compare we need this: icpc -xHost -O2 -ftz -fp-speculation=safe -fp-model source |
No problem. |
@ocaisa I tried some cherry picking:
It seems to me like the combination of the flags -fno-math-errno -funsafe-math-optimizations are the key here. Unfortunately, I have no idea, how dangerous they are. |
The gcc manual states that -funsafe-math-optimizations enables '-fno-signed-zeros', '-fno-trapping-math', '-fassociative-math' and '-freciprocal-math' so perhaps one of those 4 already do the trick? |
-fno-math-errno -fno-signed-zeros -fno-trapping-math -fassociative-math is the minimal set that does the trick for me with GCC. (GCC manual states that -fno-signed-zeros -fno-trapping-math is needed for -fassociative-math) |
The full story is as follows here:
Now if you do NOT use If you DO use the following loop with
then both GCC and Intel vectorize with the above options (no need for -fassociative-math etc), however Intel does a better job (better than with the auto vectorizer in fact!). GCC gets to full speed with -funsafe-math-optimizations but for some reason not the other optimizations. Anyway:
|
@bartoldeman As discussed during the EasyBuild conf call today:
|
Thanks. Just wanted clarification to reproduce. Now the only difference with your setup (if you used unmodified easyconfigs for foss-2018a) is probably that I had to build everything up to GCCcore on a Broadwell node, and only then switched to a Skylake node (which has no system gcc installed). I would not expect that where you build GCCcore makes a difference but you never know. By the way the one and only WRONG result I get is always |
@bartoldeman Same 13 tests that fail on Skylake (when only rebuilding CP2K with
For example (other are pretty similar):
I'll now try rebuilding the dependencies first with |
@bartoldeman Same 13 tests still fail after rebuilding I'll now rebuild the whole shebang, let's see if that helps... |
I'll have a look at those tests... the -999 looks rather suspicious. |
@boegel when I grep for -999 in cp2k I find this:
I wonder if the tests on the skylake node have enough processors available. Could you have a look at your core-pinning there? |
@bartoldeman Turns out I was re-testing CP2K in an environment with just a single core available, while the tests that were failing require at least 2 cores to be available... I tested again on a full node with proper core pinning, and then I also saw just a single failing test (the one that fails consistently). Conclusion: including |
@bartoldeman Some minor tweaks in ComputeCanada#4, with that included this is good to go imho. |
use DEFAULT_OPT_LEVEL constant + fix minor style issues
Thanks, at least that mystery is solved then! We can still disable vectorization by setting 'lowopt' -- do you still also want the vectorize toolchain option? |
@bartoldeman Ah, yes, thanks for reminding me... Having a toolchain option to selectively disable Disabling |
@boegel I am looking at the 'vectorize' toolchainopt but having trouble. Certainly won't be able to do it in an hour on a Friday afternoon. If I put it in
will not work. What is the best way here, adding a 'novectorize' too? Or just doing it in code in |
one way I think of is to use
in the mapping but only expose 'vectorize' as a valid toolchain opt. Then some code can set 'novectorize' to True if 'vectorize' == False. |
If set to True or False, will explicitly enable or disable auto-vectorization. Otherwise (if left to None) noopt and lowopt will not vectorize and other optimzations levels will vectorize.
Modifying in place changes the original optflags which messes up further tests in the toolchain tests.
@@ -88,6 +88,8 @@ class Compiler(Toolchain): | |||
'static': (False, "Build static library"), | |||
'32bit': (False, "Compile 32bit target"), # LA, FFTW | |||
'openmp': (False, "Enable OpenMP"), | |||
'vectorize': (None, "Enable compiler auto-vectorization, default except for noopt and lowopt"), | |||
'novectorize': (None, "Disable compiler auto-vectorization, default for noopt and lowopt"), # not set, only used to map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bartoldeman Why do we need both? Can't we detect True
or False
(vs None
) for vectorize
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is in the mappings in gcc.py etc. I cannot map 'vectorize' to '-ftree-vectorize' for True and to '-fno-tree-vectorize' for False, unless those mappings get redesigned of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I could use ['fno-tree-vectorize','ftree-vectorize']
with a bit of code to select first or second. It's just inconsistent with the current use. Maybe use a dict to distinguish:
{False: 'fno-tree-vectorize', True: 'ftree-vectorize'}
This avoids the need for 'novectorize' and makes for more compact logic.
@@ -246,7 +247,17 @@ def _set_compiler_flags(self): | |||
|
|||
# 1st one is the one to use. add default at the end so len is at least 1 | |||
optflags = [self.options.option(x) for x in self.COMPILER_OPT_FLAGS if self.options.get(x, False)] + \ | |||
[self.options.option(default_opt_level)] | |||
[self.options.option(default_opt_level)][:1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bartoldeman To be correct, this should be this I think...
optflags = ([...] + [...])[:1]
vectflags = self.options.option('vectorize')[self.options['vectorize']] | ||
# avoid double use of such flags | ||
if isinstance(optflags[0], list): | ||
optflags[0] = [flag for flag in optflags[0] if flag not in self.options.option('vectorize').values()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flag not in vectflags
?
Addresses @boegel's review.
lgtm, and works as expected after a quick test, so good to go I'll try and do a partial regtest ( |
First example of this biting us in the ass: easybuilders/easybuild-easyconfigs#6461 . |
And the next one is GROMACS/2021 with foss/2020a getting numerical errors due to -fno-math-errno. -fno-math-errno should NOT be turned on by default since it has numerical impact. https://gromacs.bioexcel.eu/t/test-failure-in-fourcenter-tests-in-gromacs-2021/1633 |
Since Intel auto-vectorizes at -O2 but not at -O1, and GCC
auto-vectorizes at -O3 but not at -O2 this makes it consistent.