Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

include -ftree-vectorize and -fno-math-errno in default compiler optimisation flags for GCC #2388

Merged
merged 13 commits into from
Mar 31, 2018

Conversation

bartoldeman
Copy link
Contributor

Since Intel auto-vectorizes at -O2 but not at -O1, and GCC
auto-vectorizes at -O3 but not at -O2 this makes it consistent.

damianam
damianam previously approved these changes Jan 31, 2018
Copy link
Member

@damianam damianam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. And a sensible thing to do. But I won't merge it yet, I'd like to check with Kenneth first, since this has the potential to break some builds (in theory at least).

@damianam
Copy link
Member

Oh, it breaks here: https://github.com/easybuilders/easybuild-framework/pull/2388/files#diff-3177e89e1afa2a8305974f5995c31c29R284

Simply because it checks that -v is not present in -O2 -ftree-vectorize -march=native, which it is of course false, because of -vectorize. We need to change this test as well

Since Intel auto-vectorizes at -O2 but not at -O1, and GCC
auto-vectorizes at -O3 but not at -O2 this makes it consistent.
This avoids seeing "-v" being used in "-ftree-vectorize".
As -ftree-vectorize is there by default it confuses the test case.
@easybuilders easybuilders deleted a comment from boegelbot Feb 9, 2018
@easybuilders easybuilders deleted a comment from boegelbot Feb 9, 2018
@boegel
Copy link
Member

boegel commented Feb 13, 2018

@bartoldeman Do you have any idea of the potential negative impact this may have, both in terms of performance and floating-point accuracy of the resulting binaries (or maybe even build problems)?

@boegel boegel added this to the 3.x milestone Feb 13, 2018
@boegel boegel added the change label Feb 13, 2018
@bartoldeman
Copy link
Contributor Author

I've applied this at Compute Canada. So far we have seen no issues but I can monitor a bit.
Alternatively this could be a config setting perhaps?

@victorusu
Copy link
Contributor

victorusu commented Feb 16, 2018

The n-body fortran example implemented here is a good case for this flag. It can help improve performance, at least, for well written fortran codes.

The code is written in such a way that the compiler can auto-vectorize the double loop.

    m = 1                                                                          
    do i = 1, nb                                                                   
       do j = i + 1, nb                                                            
          v(:,i) = v(:,i) - r(:,m) * mass(j) * mag(m)                              
          v(:,j) = v(:,j) + r(:,m) * mass(i) * mag(m)                                   
          m = m + 1                                                                
       end do                                                                      
    end do

Results

Running this code on my laptop with a EasyBuilt version of GCC 7.2.0, I get:

Using the current default EasyBuild flags (-O2 -march=native)

gfortran -O2 -march=native main.f90 -o main-f90.x
time ./main-f90.x 50000000
-0.169075164
-0.169059907

real	0m5.624s
user	0m5.561s
sys	0m0.017s

Adding -ftree-vectorize to the current default EasyBuild flags (-O2 -march=native -ftree-vectorize)

gfortran -O2 -march=native -ftree-vectorize main.f90 -o main-f90.x
iMac:ada-md victor$ time ./main-f90.x 50000000
-0.169075164
-0.169059907

real	0m4.265s
user	0m4.225s
sys	0m0.013s

Compiling with -O3 level (-O3 -march=native)

gfortran -O3 -march=native main.f90 -o main-f90.x
iMac:ada-md victor$ time ./main-f90.x 50000000
-0.169075164
-0.169059907

real	0m4.294s
user	0m4.256s
sys	0m0.011s

gcc details

gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/apps/macos/software/Core/GCCcore/7.2.0/libexec/gcc/x86_64-apple-darwin17.4.0/7.2.0/lto-wrapper
Target: x86_64-apple-darwin17.4.0
Configured with: ../configure --enable-languages=c,c++,fortran --enable-checking=release --disable-multilib --enable-shared=yes --enable-static=yes --enable-threads=posix --enable-gold=default --enable-plugins --enable-ld --with-plugin-ld=ld.gold --enable-bootstrap --prefix=/opt/apps/macos/software/Core/GCCcore/7.2.0 --with-local-prefix=/opt/apps/macos/software/Core/GCCcore/7.2.0
Thread model: posix
gcc version 7.2.0 (GCC) 

@bartoldeman
Copy link
Contributor Author

@boegel GCC is conservative by default (vs. Intel which is loose by default but EB's -fp-model source makes it much more precise).

as far as I understand -ftree-vectorize without -ffast-math (or one of its sub-options, -funsafe-math-optimizations, -fassociative-math etc.) will not cause any changes in FP results, that is, vectorized sum reductions are not enabled.

@Helios07
Copy link

Helios07 commented Mar 1, 2018

I make some comparisons with a code that vectorizes quite good (Newtonian interaction of some thousand particles). I tested on a Xeon Skylake 6140 dual socket machine (36 cores, AVX2). Intel 2018a and GNU 6.4 were used.

icpc, -O3 -xHost: 10.6761s
icpc, -O3: 24.8094s
g++: -O2 -march=native: 89.756s
g++: -O2 -march=native -ftree-vectorize: 88.8382s
g++: -O3 -march=native -ftree-vectorize: 88.2167s
g++: -Ofast -march=native -ftree-vectorize: 13.4951s

It seems like some other flag is needed for the vectorization.

@victorusu
Copy link
Contributor

According to the Auto-vectorization in GCC guide,

Vectorization is enabled by the flag -ftree-vectorize and by default at -O3.

@Helios07
Copy link

Helios07 commented Mar 2, 2018

Your are right, vectorization is enabled with -O3. But in this case, it does not explain the difference between the times measured with GNU and Intel. GNU seems to be more precise, but slower, unless the flag -ffast-math is used.
I repeated my measurement with slightly bigger parameters (they are not comparable to those above, but hopefully a bit more reliable) and added -ffast-math explicitely in one measurement:

icpc, -O3 -xHost: 21.4103s
icpc, -O3: 36.4158s
g++: -O2 -march=native: 119.651s
g++: -O2 -march=native -ftree-vectorize: 117.14s
g++: -O3 -march=native -ftree-vectorize: 119.059s
g++: -Ofast -march=native -ftree-vectorize: 23.0479s
g++: -O3 -march=native -ftree-vectorize -ffast-math: 25.0692s
g++: -O2 -march=native -ffast-math: 24.8151s

@ocaisa
Copy link
Member

ocaisa commented Mar 2, 2018

@Helios07 What you've done is helpful and makes a lot of sense, if you're not allowed to reorder multiplication then vectorisation is going to be difficult.

From GCC (7.2):

-ffast-math
  Sets the options -fno-math-errno, -funsafe-math-optimizations, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans, -fcx-limited-range and -fexcess-precision=fast.

Do people think it is worthwhile cherry-picking from this set to be more conservative? Is -fassociative-math enough?

@bartoldeman
Copy link
Contributor Author

bartoldeman commented Mar 2, 2018

@Helios07 that is a good case for the 'loose' and 'veryloose' precision settings.
please try this too (easybuild default for icpc):
icpc -O2 -ftz -fp-speculation=safe -fp-model source
(with 'loose' you get -fp-model fast=1 which is actually the default for icpc if you don't give it any options, and 'veryloose' gives -fp-model fast=2)

I think gcc 'loose' and 'veryloose' settings should get some/all of the -ffast-math options, and not just -mrecip -mno-ieee-fp but that needs some further testing.

What do you get with?
g++ -O2 -ftree-vectorize -mrecip -mno-ieee-fp
(present 'loose') and
g++ -O2 -ftree-vectorize -mrecip=all -mno-ieee-fp
(present 'veryloose')

Note that -fassociative-math allows to change FP evaluation of a+b+c from (a+b)+c to a+(b+c), and the more aggressive options will even disregards () put there by programmers (ie. a+(b+c) => (a+b)+c), at least in Fortran (not sure about C/C++).

@Helios07
Copy link

Helios07 commented Mar 2, 2018

Here you are:

icpc -O2 -ftz -fp-speculation=safe -fp-model source: 30.9678s
icpc -O2 -ftz -fp-speculation=safe -fp-model fast=1: 36.5549s
icpc -O2 -ftz -fp-speculation=safe -fp-model fast=2: 37.3925s
g++ -O2 -ftree-vectorize -mrecip -mno-ieee-fp: 118.231s
g++ -O2 -ftree-vectorize -mrecip=all -mno-ieee-fp: 118.291s

If you want to try this by yourself, I am using a slightly modified version of this code:
https://github.com/ryancoleman/lotsofcoresbook1code/tree/master/Pearls1_Chapter10/Pearl

@bartoldeman
Copy link
Contributor Author

Thanks. However I forgot a few things. To properly compare we need this:

icpc -xHost -O2 -ftz -fp-speculation=safe -fp-model source
icpc -xHost -O2 -fp-model fast=1
icpc -xHost -O2 -fp-model fast=2
g++ -march=native -O2 -ftree-vectorize -mrecip -mno-ieee-fp
g++ -march=native -O2 -ftree-vectorize -mrecip=all -mno-ieee-fp

@Helios07
Copy link

Helios07 commented Mar 4, 2018

No problem.
icpc -xHost -O2 -ftz -fp-speculation=safe -fp-model source: 21.5741s
icpc -xHost -O2 -fp-model fast=1: 22.2876s
icpc -xHost -O2 -fp-model fast=2: 21.3196s
g++ -march=native -O2 -ftree-vectorize -mrecip -mno-ieee-fp: 117.894s
g++ -march=native -O2 -ftree-vectorize -mrecip=all -mno-ieee-fp: 118.806s

@Helios07
Copy link

Helios07 commented Mar 6, 2018

@ocaisa I tried some cherry picking:

g++ -fopenmp -O2 -ftree-vectorize -march=native -fno-math-errno -funsafe-math-optimizations -ffinite-math-only -fno-rounding-math -fno-signaling-nans -fcx-limited-range (=-ffast-math according to gcc 6.4 which I used): 24.3423s
g++ -fopenmp -O2 -ftree-vectorize -march=native: 118.356s
g++ -fopenmp -O2 -ftree-vectorize -march=native -fno-math-errno: 92.9929s
g++ -fopenmp -O2 -ftree-vectorize -march=native -funsafe-math-optimizations: 118.294s
g++ -fopenmp -O2 -ftree-vectorize -march=native -fno-math-errno -funsafe-math-optimizations: 25.3898s
g++ -fopenmp -O2 -ftree-vectorize -march=native -funsafe-math-optimizations -ffinite-math-only -fno-rounding-math -fno-signaling-nans -fcx-limited-range: 119.785s
g++ -fopenmp -O2 -ftree-vectorize -march=native -fno-math-errno -ffinite-math-only -fno-rounding-math -fno-signaling-nans -fcx-limited-range: 93.9554s

It seems to me like the combination of the flags -fno-math-errno -funsafe-math-optimizations are the key here. Unfortunately, I have no idea, how dangerous they are.

@bartoldeman
Copy link
Contributor Author

The gcc manual states that -funsafe-math-optimizations enables '-fno-signed-zeros', '-fno-trapping-math', '-fassociative-math' and '-freciprocal-math' so perhaps one of those 4 already do the trick?
I'll have a look at the code myself too.

@bartoldeman
Copy link
Contributor Author

-fno-math-errno -fno-signed-zeros -fno-trapping-math -fassociative-math is the minimal set that does the trick for me with GCC. (GCC manual states that -fno-signed-zeros -fno-trapping-math is needed for -fassociative-math)

@bartoldeman
Copy link
Contributor Author

bartoldeman commented Mar 6, 2018

The full story is as follows here:

  • the code is somewhat special since it uses #pragma simd, which is understood by Intel to force vectorization. Nowadays we should use #pragma omp simd.

Now if you do NOT use #pragma omp simd or #pragma simd (Intel only)
icc -xHost -O2 -ftz -fp-speculation=safe -fp-model source
and
gcc -fopenmp -O2 -ftree-vectorize -march=native -fno-math-errno
will both NOT vectorize: you will need to use -fp-model fast (icc) or at least
-fno-math-errno -fno-signed-zeros -fno-trapping-math -fassociative-math for GCC.

If you DO use the following loop with #pragma omp simd

#pragma omp simd reduction(+:pi,axi,ayi,azi)
      for (int j=0; j<N; j++) {
        float dx = x[j] - xi;
        float dy = y[j] - yi;
        float dz = z[j] - zi;
        float R2 = dx * dx + dy * dy + dz * dz + EPS2;
        float invR = 1.0f / sqrtf(R2);
        float invR3 = m[j] * invR * invR * invR;
        pi += m[j] * invR;
        axi += dx * invR3;
        ayi += dy * invR3;
        azi += dz * invR3;
      }

then both GCC and Intel vectorize with the above options (no need for -fassociative-math etc), however Intel does a better job (better than with the auto vectorizer in fact!). GCC gets to full speed with -funsafe-math-optimizations but for some reason not the other optimizations.

Anyway:

  • it would be good to consider -fno-math-errno in default GCC options, as icc does not enable it by default.
  • loose precision for gcc should at least include -fno-math-errno -fno-signed-zeros -fno-trapping-math -fassociative-math then; veryloose could use -ffast-math
  • the rest is all tuning of "omp simd"

@boegel boegel modified the milestones: 3.x, 3.6.0 Mar 14, 2018
@boegel
Copy link
Member

boegel commented Mar 14, 2018

@bartoldeman As discussed during the EasyBuild conf call today:

  • Can you enhance this to also use -fno-math-errno by default, since that doesn't impact floating-point accuracy? Once you did, I'll look into rebuilding CP2K 5.1 with foss/2018a on top of this on Haswell & Skylake.

  • Do we need to take similar measures for Clang & PGI, or do they vectorise already by default at -O2? (cc @victorusu, @geimer)?

  • For changing the loose and or veryloose toolchain options, please follow up in a separate PR

@bartoldeman
Copy link
Contributor Author

Thanks. Just wanted clarification to reproduce.

Now the only difference with your setup (if you used unmodified easyconfigs for foss-2018a) is probably that I had to build everything up to GCCcore on a Broadwell node, and only then switched to a Skylake node (which has no system gcc installed). I would not expect that where you build GCCcore makes a difference but you never know.

By the way the one and only WRONG result I get is always Fist/regtest-3/2d_pot.inp.out

@boegel
Copy link
Member

boegel commented Mar 22, 2018

@bartoldeman Same WRONG test for me when only a single test fails.

13 tests that fail on Skylake (when only rebuilding CP2K with -ftree-vectorize):

  • Fist/regtest-3/2d_pot.inp.out
  • TMC/regtest/TMC_walltime.inp.out
  • TMC/regtest/TMC_1pot_H2ONH4.inp.out
  • TMC/regtest/TMC_prot_reorder.inp.out
  • TMC/regtest/TMC_atom_swap_test.inp.out
  • TMC/regtest/TMC_PT.inp.out
  • TMC/regtest/TMC_NPT.inp.out
  • TMC/regtest/TMC_NPT_2pot.inp.out
  • TMC/regtest/TMC_NPT_2pot_2.inp.out
  • TMC/regtest/TMC_NPT_2pot_PT.inp.out
  • TMC/regtest/TMC_test_restart_0.inp.out
  • TMC/regtest/TMC_test_restart_1.inp.out
  • TMC/regtest_ana_post_proc/TMC_ana_density.inp.out

For example (other are pretty similar):

/tmp/CP2K/5.1/foss-2018a/TEST-Linux-x86-64-foss-popt-2018-03-22_10-54-18/TMC/regtest/TMC_test_restart_1.inp.out :
 7605  Total energy: : ref = 0.55218050338752289 new = -999
 7606  relative error :   1.00055273e+00 >  numerical tolerance = 1.0E-14

I'll now try rebuilding the dependencies first with -ftree-vectorize, and then rebuilding CP2K on top of that...

@boegel
Copy link
Member

boegel commented Mar 22, 2018

@bartoldeman Same 13 tests still fail after rebuilding Libint, libxc, libxsmm, FFTW, PLUMED and then CP2K on top of it...

I'll now rebuild the whole shebang, let's see if that helps...

@bartoldeman
Copy link
Contributor Author

I'll have a look at those tests... the -999 looks rather suspicious.

@bartoldeman
Copy link
Contributor Author

@boegel when I grep for -999 in cp2k I find this:

./cp2k-5.1/src/tmc/tmc_setup.F:            WRITE (output_unit, *) "TMC|NOTenoughProcessorsX= -999"
./cp2k-5.1/src/tmc/tmc_setup.F:            WRITE (output_unit, *) "TMC|NOTcalculatedTotal energy: -999"
./cp2k-5.1/src/tmc/tmc_setup.F:            WRITE (output_unit, *) "TMC|ANAtestOutputInitX= -999"

I wonder if the tests on the skylake node have enough processors available. Could you have a look at your core-pinning there?

@boegel
Copy link
Member

boegel commented Mar 23, 2018

@bartoldeman Turns out I was re-testing CP2K in an environment with just a single core available, while the tests that were failing require at least 2 cores to be available...

I tested again on a full node with proper core pinning, and then I also saw just a single failing test (the one that fails consistently).

Conclusion: including -ftree-vectorize indeed doesn't do any harm to the CP2K test suite, which serves as a good canary imho.

@boegel
Copy link
Member

boegel commented Mar 23, 2018

@bartoldeman Some minor tweaks in ComputeCanada#4, with that included this is good to go imho.

use DEFAULT_OPT_LEVEL constant + fix minor style issues
@bartoldeman
Copy link
Contributor Author

Thanks, at least that mystery is solved then!

We can still disable vectorization by setting 'lowopt' -- do you still also want the vectorize toolchain option?

@boegel
Copy link
Member

boegel commented Mar 23, 2018

@bartoldeman Ah, yes, thanks for reminding me... Having a toolchain option to selectively disable --ftree-vectorize would indeed make a lot of sense.

Disabling -ftree-vectorize at lowopt makes sense to.

@bartoldeman
Copy link
Contributor Author

bartoldeman commented Mar 23, 2018

@boegel I am looking at the 'vectorize' toolchainopt but having trouble. Certainly won't be able to do it in an hour on a Friday afternoon.

If I put it in COMPILER_UNIQUE_OPTION_MAP I can't put a "True" and a "False" option without digging much more deeply right, ie,

'vectorize': ['ftree-vectorize', '-fno-tree-vectorize'],

will not work. What is the best way here, adding a 'novectorize' too? Or just doing it in code in _set_compiler_toolchainoptions for every compiler?

@bartoldeman
Copy link
Contributor Author

one way I think of is to use

'vectorize': ['ftree-vectorize'],
'novectorize': ['fno-tree-vectorize'],

in the mapping but only expose 'vectorize' as a valid toolchain opt. Then some code can set 'novectorize' to True if 'vectorize' == False.

If set to True or False, will explicitly enable or disable
auto-vectorization. Otherwise (if left to None) noopt and lowopt
will not vectorize and other optimzations levels will vectorize.
Modifying in place changes the original optflags which messes up
further tests in the toolchain tests.
@@ -88,6 +88,8 @@ class Compiler(Toolchain):
'static': (False, "Build static library"),
'32bit': (False, "Compile 32bit target"), # LA, FFTW
'openmp': (False, "Enable OpenMP"),
'vectorize': (None, "Enable compiler auto-vectorization, default except for noopt and lowopt"),
'novectorize': (None, "Disable compiler auto-vectorization, default for noopt and lowopt"), # not set, only used to map
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bartoldeman Why do we need both? Can't we detect True or False (vs None) for vectorize instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is in the mappings in gcc.py etc. I cannot map 'vectorize' to '-ftree-vectorize' for True and to '-fno-tree-vectorize' for False, unless those mappings get redesigned of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I could use ['fno-tree-vectorize','ftree-vectorize'] with a bit of code to select first or second. It's just inconsistent with the current use. Maybe use a dict to distinguish:
{False: 'fno-tree-vectorize', True: 'ftree-vectorize'}

@easybuilders easybuilders deleted a comment from boegelbot Mar 30, 2018
@@ -246,7 +247,17 @@ def _set_compiler_flags(self):

# 1st one is the one to use. add default at the end so len is at least 1
optflags = [self.options.option(x) for x in self.COMPILER_OPT_FLAGS if self.options.get(x, False)] + \
[self.options.option(default_opt_level)]
[self.options.option(default_opt_level)][:1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bartoldeman To be correct, this should be this I think...

optflags = ([...] + [...])[:1]

vectflags = self.options.option('vectorize')[self.options['vectorize']]
# avoid double use of such flags
if isinstance(optflags[0], list):
optflags[0] = [flag for flag in optflags[0] if flag not in self.options.option('vectorize').values()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flag not in vectflags?

@boegel
Copy link
Member

boegel commented Mar 31, 2018

lgtm, and works as expected after a quick test, so good to go

I'll try and do a partial regtest (foss/2017b + foss/2018a easyconfigs) on top of this to see if this causes any trouble...

@boegel
Copy link
Member

boegel commented Jun 15, 2018

First example of this biting us in the ass: easybuilders/easybuild-easyconfigs#6461 .

@akesandgren
Copy link
Contributor

And the next one is GROMACS/2021 with foss/2020a getting numerical errors due to -fno-math-errno.

-fno-math-errno should NOT be turned on by default since it has numerical impact.

https://gromacs.bioexcel.eu/t/test-failure-in-fourcenter-tests-in-gromacs-2021/1633

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants