Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GNU Assembler bug workaround #196

Merged
merged 9 commits into from
Oct 7, 2019
Merged

GNU Assembler bug workaround #196

merged 9 commits into from
Oct 7, 2019

Conversation

lgarrison
Copy link
Collaborator

This fixes #193 by querying the compiler to see if it is using a known-bad version of the GNU Assembler (GAS) under the hood. If so, AVX-512 is disabled. This works for gcc and icc (both of which use GAS under the hood), and clang if it is using GAS instead of its integrated assembler.

If AVX-512 is disabled, a warning will be printed during compilation. I think the pip installation will still be silent though, which may not be what we want.

I also had to change icc -xHost to icc -march=native. It doesn't seem possible to disable AVX-512F (with -mno-avx512f, for example) with the former, only the latter. I don't think it will make a big difference.

There was also a bug in the icc compiler options: we were using -xHost -xCORE-AVX512 indiscriminately, which will always generate AVX-512 instructions, even on non-AVX-512 machines!

I also changed the compilation output coloring to use tput instead of echo and escape sequences. On Ubuntu, I was getting tons of -es in my output because sh is actually dash, and echo in dash doesn't have a -e flag (it's enabled by default). tput seemed like a more robust solution than trying to detect dash, but let me know what you think.

@lgarrison lgarrison added this to the v2.3.2 milestone Oct 3, 2019
@manodeep
Copy link
Owner

manodeep commented Oct 6, 2019

I tested on ozstar, with binutils/2.30. While the fix worked with gcc (disabled AVX512F), it did not work for icc.

I had added an #error AVX512F should not be enabled line here to make sure that the compilation fails, if AVX512F was enabled. Here is the error generated:

icc -DDOUBLE_PREC  -DVERSION=\"2.3.1\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3  -march=native -qopenmp -mno-avx512f -I../../io -I../../utils  -c countpairs_impl_double.c -o countpairs_impl_double.o
icc: command line warning #10159: invalid argument for option '-m'
In file included from countpairs_impl_double.c(21):
countpairs_kernels_double.c(30): error: #error directive: AVX512F should not be enabled
  #error AVX512F should not be enabled
   ^

The option to disable AVX512F with icc is different, but I have not yet been able to figure out what that is. My attempts with -mno-COMMON-AVX512 etc were fruitless. Here is how I was trying it:

[~/temp/test_corrfunc_gas_bug/Corrfunc @farnarkle2] icc -O3 -std=c99  -dM -E -xc -c /dev/null | grep -i avx
[~/temp/test_corrfunc_gas_bug/Corrfunc @farnarkle2] icc -march=native -O3 -std=c99  -dM -E -xc -c /dev/null | grep -i avx
#define __AVX__ 1
#define __AVX_I__ 1
#define __AVX2__ 1
#define __AVX512F__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
#define __AVX512BW__ 1
#define __AVX512VL__ 1

So the -march=native is definitely affecting the compiler flags. Without the -march=native, the AVX* flags are not enabled.

I do have another suggestion - will you please replace the magenta colour for the text for the message "Disabling AVX512" with red. This is an important enough message that users should pay attention.

@lgarrison
Copy link
Collaborator Author

Hmm, that's too bad about icc. It works for me on icc 19, what version are you running? But we'll have to think of another solution that works for earlier icc too.

@manodeep
Copy link
Owner

manodeep commented Oct 6, 2019

@lgarrison I was using icc 18.0.1. Intel/2016 (the other version available on ozstar), does not support AVX512 - so the fix is not triggered.

@manodeep
Copy link
Owner

manodeep commented Oct 6, 2019

One solution could be to capture the output of $(CC) -march=native -dM -E -xc -c /dev/null and checking what instruction sets are supported. We can then filter-out the AVX512 flags when GAS 2.30/2.31 is detected

@lgarrison
Copy link
Collaborator Author

Can you try this version? -xCORE-AVX2 seems to override -march=native, even on older icc.

@manodeep
Copy link
Owner

manodeep commented Oct 7, 2019

Yup - this works for me with intel 2018. Okay to go ahead and "squash + merge"?

@lgarrison
Copy link
Collaborator Author

Yes!

@manodeep manodeep merged commit 1b0038f into master Oct 7, 2019
@manodeep manodeep deleted the gas_bug_workaround branch October 11, 2019 17:56
@lgarrison
Copy link
Collaborator Author

lgarrison commented May 10, 2021

Just adding a note to document that @SandyYuan and I were able to get AVX-512 working on a Linux system that exhibited this bug by using conda-forge compilers. The steps were:

$ conda create -n forge -c conda-forge python=3
$ conda activate forge
$ conda install -c conda-forge gcc_linux-64 binutils_linux-64 gsl
$ git clone https://github.com/manodeep/Corrfunc.git
$ cd Corrfunc/
$ # edit common.mk with CC := x86_64-conda_cos6-linux-gnu-gcc
$ pip install -e ./

In theory, mainline conda also has binutils>=2.32, but that wasn't working for us. And there's probably a syntax to pass CC to pip without cloning the repo to edit common.mk, but we didn't test that.

@manodeep
Copy link
Owner

manodeep commented May 10, 2021

The (in)ability to pass the compiler at the command-line when installing with pip has always bothered me. I have now figured it out -- the solution is to use python -m pip install --verbose -e . --install-option="CC=/path/to/custom/compiler"

Relevant SO link

@lgarrison
Copy link
Collaborator Author

Forgot to mention maybe the most pernicious bug: one should install GSL from conda when using conda's compilers, because if it picks up the system GSL, it will add -I/usr/include to the CFLAGS, which will intercept non-GSL #includes, loading a set of headers incompatible with conda's compilers.

@manodeep
Copy link
Owner

Right - of course! You have to be entirely within the conda ecosystem

@valerio-marra
Copy link

conda install -c conda-forge gcc_linux-64 binutils_linux-64 gsl

hi @lgarrison, this solution worked for me. In case it is interesting, i report the improvement in performance:

Node details:

  • 4x12 Core Skylake Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
  • 512 GB DDR3 1333 MHz

AVX

  • Running corrfunc with 20000000 particles on 48 threads
    DDtheta_mocks took 519.1 seconds
  • Running corrfunc with 20000000 particles on 96 threads
    DDtheta_mocks took 409.5 seconds

AVX512

  • Running corrfunc with 20000000 particles on 48 threads
    DDtheta_mocks took 521.4 seconds
  • Running corrfunc with 20000000 particles on 96 threads
    DDtheta_mocks took 332.8 seconds

The odd thing is how much performance with AVX512 decreases when 48 threads are used.

@lgarrison
Copy link
Collaborator Author

Thanks for reporting back! I'm glad you did get a speedup after all. Definitely curious that you don't see a speedup until you use hyperthreading. Maybe the floating-point throughput is not the bottleneck initially, until you use enough threads to mask whatever memory/cache/backend latency is occurring.

You may also be interested in seeing if PR #258 give you extra performance, if your use-case allows for evenly-spaced bins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Different results with the same input for each run if the input arrays are float
3 participants