Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect logic in opal_config_asm.m4 for 128-wide atomic ops #5529

Closed
flang-cavium opened this issue Aug 9, 2018 · 32 comments
Closed

incorrect logic in opal_config_asm.m4 for 128-wide atomic ops #5529

flang-cavium opened this issue Aug 9, 2018 · 32 comments

Comments

@flang-cavium
Copy link

flang-cavium commented Aug 9, 2018

Thank you for taking the time to submit an issue!

Patch is based on OpenMPI 3.1.1 - released tarball.

Compiled and installed from source.

Operating system: Ubuntu 16.04 AArch64 (ARM64).
Compilers:

  • GCC 7.3 and 8.1
  • clang/LLVM 6.0.1

Problem description:

The logic for discovering and enabling 128-bit wide atomic ops at ./configure time is incorrect. Essentially, 128-wide atomic ops are enabled regardless of whether or not the Target actually supports them.

The file opal_config_asm.m4 contains the following:

if test ! "$enable_cross_cmpset128" = "yes" ; then
### enable 128-wide atomic cmpset
fi

The logical test

if test ! "$enable_cross_cmpset128" = "yes" ; then

is equivalent to

if test "$enable_cross_cmpset128" = "no" ; then

which means, cmpset128 will be enabled even though ./configure was told

--enable-cross-cmpset128=no

Further down, there is another test for $enable_cross_cmpset128:

if test ! "$enable_cross_cmpset128" = "yes" ; then

### enable 128-wide atomic CAS

else

### enable 128-wide atomic CAS

fi

if test ! "$enable_cross_cmpset128" = "yes" ; then

is the equivalent of saying

if test "$enable_cross_cmpset128" = "no" ; then

Again, same logic problem, but this time around making certain 128-wide CAS really gets enable no matter what ./configure says.

Also: trying to attach a patch named opal_config_asm.m4.patch results in the following error:

We don't support that file type. Try again with a GIF, JPEG, JPG, PNG, DOCX, GZ, LOG, PDF, PPTX, TXT, XLSX or ZIP.

I seriously doubt anyone would ever submit a GIF, JPEG, JPG or PNG patch. I had to rename the patch file from opal_config_asm.m4.patch to opal_config_asm.m4.patch.txt.
opal_config_asm.m4.patch.txt

@bosilca
Copy link
Member

bosilca commented Aug 11, 2018

The current logic is correct, the code you pinpoint to handles both the support for and the cross compiling support for 128 bits atomics. If support for 128 bits atomics exists, it cannot be disabled as it is the safest (and most efficient way) to cope with some ABA issues. The conditions are written in a reversed way, but the logic itself holds.

@jsquyres
Copy link
Member

@flang-cavium George is correct. I'll close this issue; please feel free to reply if you have further questions.

And here's why attaching an image is sometimes useful 😄:

thank-you

@flang-cavium
Copy link
Author

Ummm, no.

George is not correct. Actually, George is 100% wrong.

This has nothing to do with cross-compiling.

Two-value logic: "yes" or "no". ! "yes" means "no". Oui? Non?

I have already explained in the bug in detail. Care to actually take a look?

@flang-cavium
Copy link
Author

Question: "ABA" -- American Banking Association?

@bosilca
Copy link
Member

bosilca commented Aug 13, 2018

Actually I did look at your bug description and at your patch and the current code before writing my answer. Your conclusion doesn't stand.

ABA stands for the ABA problem, a multi-threaded synchronization issue that arise when 2 variables must be both atomically updated. 128 bits wide atomic operations provides a simple and efficient solution.

@flang-cavium
Copy link
Author

flang-cavium commented Aug 13, 2018

So, in your esteemed opinion, cmpxchg16b should always be enabled, regardless of whether or not the ISA supports it, or whether or not ./configure said --enable-cross-cmpset128=no and --enable-cross-cmpxchg16b=no?

Is this what you are saying here?

What does "128 bits wide atomic operations provides a simple and efficient solution" really mean?

Not all ISA's suport 128-bit wide atomics. Not even all flavors of X86_64 support it.

You still haven't explained what this has to do with cross-compiling.

@flang-cavium
Copy link
Author

Not to mention that the ./configure test for 128-wide atomics when compiling with GCC is inherently broken:

int main()
{
  __int128 x = 0, y = 0;
  __atomic_compare_exchange_n (&x, &y, 1, 0, __ATOMIC_RELAXED,
                               __ATOMIC_RELAXED);
  return 0;
}

/usr/local/gcc72/bin/gcc -D_GNU_SOURCE -O2 -std=c99 -mtune=corei7 -mfma -mavx -Wl,-rpath -Wl,/usr/local/gcc72/lib64 int128.c -o int128-gcc
/tmp/ccjdCiyN.o: In function main': int128.c:(.text.startup+0x29): undefined reference to __atomic_compare_exchange_16'
collect2: error: ld returned 1 exit status

So, when compiling with GCC, the test fails at link-time. GCC needs -latomic on link-line, which is absent in the ./configure test.

This ends up making cmpxchg16b and the 128-wide atomics always disabled when compiling with GCC. Problem is, cmpxchg16b with GCC ends up being disabled for all the wrong reasons. 128-wide atomics with GCC will end up disabled even for those X86_64 micro-arch'es that actually support 128-wide atomics and cmpxchg16b.

@bosilca
Copy link
Member

bosilca commented Aug 14, 2018

enable_cross_cmpset128 is not the typical enable/disable argument, it only needs one single value to hint configure to do a special check for the cross-compile case. The reason is that when not cross-compiling one can run a test during configure to validate the needed support, including looking for the existence of 128-bits atomic operations, and the flags or libraries such support requires.

And yes, I am saying that 128 bits atomics should always be used when available. The only way to turn it off is to completely disable all builtin atomics via --disable-builtin-atomics.

Anyway, you pinpointed a real problem: OMPI fails to detect support for 16 bits atomics on newer compilers that provide atomic support via the atomic library. Patch underway.

@flang-cavium
Copy link
Author

We're getting there.

I am saying that 128 bits atomics should always be used when available.

Exactly. 128-wide atomics are not available on every single ISA.

What you are advocating here is an all-or-nothing proposition.

Either we have 128-wide atomics, in which case we can use atomics, or we don't, in which case we can't use atomics.

This is wrong.

32-wide and 64-bit atomics are available on ISA's that do not have 128-wide atomics. For example, ARM64. Which means we should be able to use 32-bit and 64-bit atomics even when the 128-wide atomics aren't available.

Which is precisely what ./configure pretends it does, but lies about it:

--enable-cross-cmpset128
--enable-cross-cmpxchg16b

If I pass "no" to either or both these ./configure arguments, the 128-wide atomics should be disabled. But they are not disabled. They are always enabled, although I am not even cross-compiling.

Which means ./configure should - at a minimum - test whether or not I am cross-compiling or not. And if I am not cross-compiling, they should be disabled unless I am explicitly enabling them with

--enable-cross-cmpset128=yes
--enable-cross-cmpxchg16b=yes

And if you would have taken the time to test the patch, instead of arbitrarily discarding it wtihout testing, you would have noticed that the patch does exactly what you want it to do, and cures the problem of having 128-wide atomics always enabled, even for the ISA's that do not support them.

@bosilca
Copy link
Member

bosilca commented Aug 15, 2018

You have deeply misunderstood or misinterpreted every single comment I made. As I said before I read and test your patch, and it does not work.

@flang-cavium
Copy link
Author

No, I don't think I misunderstood anything.

I have tested my patch as well. And it works. I seriously doubt you tested it.

@bosilca
Copy link
Member

bosilca commented Aug 15, 2018

Let me break it down for you:

  1. Your initial analysis is misleading. You claim that due to the enable_cross_cmpset128 check we enable 128 bits atomics on all branches. Wrong. We enable checking for 128 bits atomics support on all branches, but this checking could result in turning them off.
  2. You raised a valid point about not using -latomic, but your patch doesn't even addresses this. Moreover, adding -latomic is not required in all cases, in some cases using the -mcx16 flag is enough.
  3. You have certainly successfully tested your patch on ARM64 but on a x86_64 using your patch and one of the compilers you mentioned (gcc 7.3) the 128 bits atomics are completely disabled (OPAL_HAVE_SYNC_BUILTIN_CSWAP_INT128 and OPAL_HAVE_GCC_BUILTIN_CSWAP_INT128 unset). Clearly incorrect because the __sync natively supports 128 atomics in this environment.
  4. It turns out there is a bug in opal_config_asm.m4 when handling __atomic_always_lock_free. If your patch would have correctly enabled 128 bits atomics you would have noticed it.

Please take a look at #5546 to see if it addresses your issue.

@flang-cavium
Copy link
Author

Your initial analysis is misleading. You claim that due to the enable_cross_cmpset128 check we enable 128 bits atomics on all branches. Wrong. We enable checking for 128 bits atomics support on all branches, but this checking could result in turning them off.

No. That is not what I am claiming.

What I am claiming is this:

When configuring OpenMPI from the released tarball, with a compiler OTHER than GCC -- namely clang -- the following line is ALWAYS generated in ${top_srcdir}/opal/include/opal_config.h:

/* Whether the __atomic builtin atomic compare and swap is lock-free on
128-bit values */
#define OPAL_HAVE_GCC_BUILTIN_CSWAP_INT128 1

This will happen with clang - for example - regardless of whether ./configure says

--enable-cross-cmpset128=no --enable-cross-cmpxchg16b=no

or

--enable-cross-cmpset128=yes --enable-cross-cmpxchg16b=yes

This is on ARM64, and it is wrong. ARM64 does not have 128-wide atomics.

This happens with clang because clang returns 0 for the test for 128-wide atomics on ANY ISA, in spite of the fact that clang doesn't even have support for a int128_t or uint128_t type.

You could argue that this is a clang bug, but that's besides the point. They aren't going to change this behavior anytime soon.

The reason this does not happen with GCC is because of the lack of -latomic on link line for the GCC atomics test - which I have already described above.

CAVEAT: If you add -latomic on link-line to the GCC test, the test will link. That does not mean that 128-wide atomics are valid on ARM64. They are not.

There are FOUR bugs here:

  1. OpenMPI's ./configure and opal_config_asm.m4 always ignores the values passed to --enable-cross-cmpset128 and --enable-cross-cmpxchg16b arguments. You can pass "yes" to either of these arguments, or you can pass "no". OpenMPI's ./configure will always enable them. This is wrong, for reasons already explained.

  2. The reason the 128-wide atomic test fails on GCC has nothing to do with the values passed to --enable-cross-cmpset128 and --enable-cross-cmpxchg16b. It has everything to do with -latomic passed on link-line for the atomics test. As explained above.

  3. Given that the test for 128-wide atomics with GCC always fails, these atomics have, in fact, never been tested on any Linux distro. Why do I say that? Because the default compiler for any Linux distro that I can think of is GCC.

  4. For the ISA's that support 128-wide atomics, the arguments to the 128-wide CAS must be 16-byte aligned. Generally speaking. X86_64 lets you get away without that requirement because X86_64 ignores misalignments. Other ISA's do not ignore misalignments. There is no test for the correct alignment.

If you apply my opal_config_asm.m4 patch and you pass:

--enable-cross-cmpset128=yes --enable-cross-cmpxchg16b=yes AND you add -latomic on link-line to the atomic test for GCC in ./configure, then ./configure will enable the 128-wide atomics.

If you pass --enable-cross-cmpset128=no --enable-cross-cmpxchg16b=no then ./configure won't enable them.

So, if you really want to know what happens with 128-wide atomics on ISA's that don't support them, please do the following, as a test:

  • ./configure OpenMPI with a recent enough GCC, but without my patch.
  • Add -latomic to the GCC atomics test in ./configure. This will make the GCC atomic test pass.
  • Build OpenMPI.
  • Run gmake check in ${top_builddir}.
  • See what happens with the following tests in ${top_builddir}/test/class/:

opal_pointer_array
opal_lifo
opal_fifo

You can easily test this on an X86_64 CPU that doesn't support cmpxchg16b. For example a laptop with a Corei5 or Corei7 CPU.

I can tell you what happened on my Corei7 laptop when I tested this with GCC 7.2: these three tests fail with a SEGV at run-time.

@jsquyres
Copy link
Member

It looks like #5546 is... complicated (per the discussion between @bosilca and @hjelmn). Portable software is hard! Ugh.

@bosilca and I just got off the phone discussing this issue. I think this boils down to two issues:

The configure test for 128 cmpset

You made a remark in your last reply that finally made it click for us: our configure test is getting the wrong answer for clang+ARM64. This is the root cause of all the problems.

We got lost in the discussion and our own biases because of the --enable-cross-* CLI stuff (see below for more on that).

George is going to amend #5546 to strengthen the test to make sure that __sync_bool_compare_and_swap() actually works (vs. does not abort). This should fix the configure test, and will finally enable you to get a good build (i.e., with no 128 bit atomics for your platform).

The --enable-cross-* CLI options

Yes, we're totally doing something weird/ugly with these options (i.e., that --disable-cross-* doesn't actually disable the tests). Specifically: that's not what those CLI options were intended for; they were intended to be "I know better and/or I'm cross-compiling; just do the link test and not the run test, and trust me that it works." We're limited by Autoconf in the types of CLI options that we can add; we can only add --enable-BLAH or --with-BLAH (and the corresponding --disable-BLAH or --without-BLAH).

I don't fully agree with what we did with these CLI options, but it was the result of compromises due to Autoconf's limitations, and they have now been there for a long time. We really shouldn't change the behavior of these CLI options in the middle of release series, either (our users tend to get annoyed when we break build or run scripts in the middle of a release series).

Here's what I propose:

  1. Fully document these --enable-cross-* CLI options in README, to include obvious verbiage about "--disable-cross-* does not do what you think it will do! ...yadda yadda yadda" (they're not listed in README right now).
  2. Add much more description to the AS_HELP_STRINGs for these CLI options. I'm guessing you found these options via ./configure --help and assumed you could use them because of their names. If there was a help string right there more fully describing what these CLI options do / do not do, this whole misunderstanding would not have occurred.
  3. Put a check in to make sure that the user does not specify --disable-cross-*. And if they do, abort configure and say "Sorry, we did something totally weird with --disable-cross-BLAH; please see the README yadda yadda yadda." So that at least it's obvious that we don't actually disable these checks.

Howzat?

@bosilca
Copy link
Member

bosilca commented Aug 15, 2018

Thanks, the information about clang is very interesting. If that is the root of the issue we can certainly amend how we test for the existence of the 128-bits atomics to make it work despite the poor support on the compiler side.

I don't have access to a ARM64 machine, but on my Core i7 laptop, neither with gcc7.3 not clang-902 (llvm 9.1.0) the default OMPI always disables the 128-bits atomics completely and with my patch the __sync support is used. All tests run to completion.

I would like to better understand the statement:

This happens with clang because clang returns 0 for the test for 128-wide atomics on ANY ISA, in spite of the fact that clang doesn't even have support for a int128_t or uint128_t type.

According to the clang documentation they have support for some __sync and some __atomic operations for some POSIX types, but there is no mention about the 128 bits atomics. So what exactly returns 0 ? The linking of the small test for 128-bits atomics succeed but the __sync call returns 0 ?

@flang-cavium
Copy link
Author

flang-cavium commented Aug 15, 2018

%> uname -a
Linux VAL1-14 4.8.0-32-t99-perf #34 SMP Wed Jan 18 18:34:20 UTC 2017 aarch64 aarch64 aarch64 GNU/Linux
%> /opt/cavium/bin/clang --version
clang version 6.0.1 (https://github.com/flang-compiler/flang-driver.git ccd507ca383f2122b2c5510aac0f7393f440398a) (http://llvm.org/git/llvm.git 057310b8fc41e5089dbee0d2c90e4233756a83c9)
Target: aarch64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/cavium/bin

%> cat ./int128.c
int main()
{
  __int128 x = 0, y = 0;
  __atomic_compare_exchange_n (&x, &y, 1, 0, __ATOMIC_RELAXED,
                               __ATOMIC_RELAXED);
  return 0;
}

%> /opt/cavium/bin/clang -D_GNU_SOURCE -O3 -std=c99 -mcpu=thunderx2t99 -march=armv8.1-a+lse -ffast-math -ffp-contract=fast -funroll-loops -finline-functions -fslp-vectorize -fvectorize -Wl,-rpath -Wl,/opt/cavium/lib int128.c -o int128-clang
%> echo $status
0
%> ./int128-clang
%> echo $status
0
%> 

clang doesn't use GCC's libatomic. It has its own builtins for all the __atomic* ops.

Also: clang support for __int128 (and int128_t/uint128_t) was added only very recently to clang:

https://bugs.llvm.org/show_bug.cgi?id=36862 (March 2018).

and it's definitely not available in clang/LLVM < 7.0, which isn't released yet.

So, for clang versions prior to 7.0, we can safely assume that support for __int128 is non-existent.

The clang bug is that it doesn't issue a diagnostic. Which, technically, it isn't required to, given that __int128 is a GNU extension.

@jsquyres
Copy link
Member

Github pro tip: mark long verbatim sections with three single tick marks: https://help.github.com/articles/basic-writing-and-formatting-syntax/#quoting-code

@flang-cavium
Copy link
Author

Also - and I almost forgot:

Even when __int128 becomes available in ARM64 clang, it may not be lock-free.

@jsquyres
Copy link
Member

@flang-cavium Can you also output:

  • The return value from __atomic_compare_exchange_n() (if there is one)
  • After the call returns, the value of x
  • After the call returns, the value of y

@flang-cavium
Copy link
Author

%> cat ./int128.c
int main()
{
  __int128 x = 0, y = 0;
  __int128 r = __atomic_compare_exchange_n (&x, &y, 1, 0, __ATOMIC_RELAXED,
                                            __ATOMIC_RELAXED);
  (void) fprintf(stderr, "x=%llu y=%llu r=%llu\n", x, y, r);
  return 0;
}

%> /opt/cavium/bin/clang -D_GNU_SOURCE -O3 -std=c99 -mcpu=thunderx2t99 -march=armv8.1-a+lse -ffast-math -ffp-contract=fast -funroll-loops -finline-functions -fslp-vectorize -fvectorize -Wl,-rpath -Wl,/opt/cavium/lib int128.c -o int128-clang
int128.c:8:52: warning: format specifies type 'unsigned long long' but the
      argument has type '__int128' [-Wformat]
  (void) fprintf(stderr, "x=%llu y=%llu r=%llu\n", x, y, r);
                            ~~~~                   ^
int128.c:8:55: warning: format specifies type 'unsigned long long' but the
      argument has type '__int128' [-Wformat]
  (void) fprintf(stderr, "x=%llu y=%llu r=%llu\n", x, y, r);
                                   ~~~~               ^
int128.c:8:58: warning: format specifies type 'unsigned long long' but the
      argument has type '__int128' [-Wformat]
  (void) fprintf(stderr, "x=%llu y=%llu r=%llu\n", x, y, r);
                                          ~~~~           ^
3 warnings generated.
%> echo $status
0
%> ./int128-clang
x=1 y=0 r=0
%> echo $status
0
%>

There is no format specifier available for __int128.

@flang-cavium
Copy link
Author

For those interested:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0801h/bzf1476202789613.html

This is the ARM documentation for the ARM V8.1a LSE extensions - namely lock-free atomics.

There's a whole bunch of instructions there: CASA/CASL/CASAL, LDADDA, LDADDL, LDADDAL, etc.

As you can see, there are no 128-wide lock-free atomics defined in the ARM64 ISA Specification. Only 32-bit and 64-bit. Only the W[x] (32-wide) and the X[x] (64-wide) registers are defined.

So, 128-wide lock-free atomics on ARM64 will not happen until ARM defines the ISA for them.

jsquyres added a commit to bosilca/ompi that referenced this issue Aug 16, 2018
Per lengthy discussion on open-mpi#5529
and open-mpi#5546, it seems that -- at
least as of August 2018 -- on ARM64+clang, the
__atomic_compare_exchange_n() intrinsic function does exist, but it
does not behave the way that Open MPI expects it to.  Strengthen our
configure test to make sure that the function not only exists, but
also behaves the way Open MPI expects it to.

Many thanks to Stefan Teleman for the bug report.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
@bosilca
Copy link
Member

bosilca commented Aug 16, 2018

This statement might not be 100% accurate #5546 (comment).

@ggouaillardet
Copy link
Contributor

@flang-cavium back to the initial problem description

typically, $enable_cross_cmpset128 can have 3 values

  • yes if configure --enable-cross-cmpset128
  • no if configure --disable-cross-cmpset128
  • empty if no option was passed to the configure command line

For the sake of completeness, it is possible to configure --enable-cross-cmpset=FOO and Open MPI does not handle this properly (and I guess the reason is this is not a wise thing to do in the first place).

Bottom line, and regardless the bozo case, test ! "$enable_cross_cmpset128" = "yes" is not equivalent to test "$enable_cross_cmpset128" = "no" (since $enable_cross_cmpset is not set in most use cases)

@flang-cavium
Copy link
Author

flang-cavium commented Aug 17, 2018 via email

@flang-cavium
Copy link
Author

flang-cavium commented Aug 17, 2018 via email

@bosilca
Copy link
Member

bosilca commented Aug 17, 2018

While we appreciate your bug report, your trolling is unwelcome. If whoever you represent cares about this issue, they better find someone else to pursue this discussion. Good speed.

@flang-cavium
Copy link
Author

flang-cavium commented Aug 17, 2018 via email

@jsquyres
Copy link
Member

With @flang-cavium's help, I filed an upstream bug with llvm: https://bugs.llvm.org/show_bug.cgi?id=38621

@jjones-cavium
Copy link

Thanks for everyone's assistance. My team appreciates it.

From now on, I will be handling any interactions with the OpenMPI project.

@jsquyres
Copy link
Member

@jjones-cavium Ok, great -- thanks!

We got the PR merged on master; it's awaiting merge on the v3.0.x, v3.1.x, and v4.0.x branches (#5617, #5618, #5619, respectively). Once those 3 are merged, this issue will be closed.

@jsquyres
Copy link
Member

This has now been merged into master, v3.0.x, v3.1.x, and v4.0.x branches.

Done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants