Fix the detection of 128 bits atomics. #5546

bosilca · 2018-08-15T03:01:11Z

Thanks to @flang-cavium for identifying this issue and providing a proof-of-concept patch and a very accurate description of the issue in #5529.

It turns out we were missing the check for -latomic to get support for 128-bits atomics. Second we need to unset the local variables (_save) to avoid a conflict in OPAL_CHECK_SYNC_BUILTIN_CSWAP_INT128.

This patch might conflict with #5445, they are complementary and should be merged.

Signed-off-by: George Bosilca bosilca@icl.utk.edu

bosilca · 2018-08-15T03:16:06Z

@jsquyres, @bwbarrett and @hppritcha once we converge on how we want to address this issue, we will need to push it to all stable releases. We can certainly address the cross-128 argument, but the purpose of this patch is to be minimally invasive and only address the detection of 128 atomic support.

hjelmn · 2018-08-15T04:39:26Z

@bosilca The problem is often when the compare-exchange support for 128-bit atomics requires libatomic then the support is not lock-free (example gcc on x86-64). This is why I didn't add -latomic. Is there some case where libatomic is required and the resulting support is lock-free?

hjelmn · 2018-08-15T04:41:07Z

To be fair I only tested that with icc and gcc. Both failed atomic_is_lock_free () for 128-bit atomics iff libatomic was required for atomic_compare_exchange_strong on 128-bit values.

bosilca · 2018-08-15T14:06:11Z

@hjelmn I confirm, on the machines where I tested (x86_64 with different flavors of gcc and llvm) the atomics from -latomic are not lock-free. However, if we don't test for it we discard the check for 128-bits __sync and we would not mix the __atomic and __sync for x86_64. As a result all 128 bits atomics are disabled, which is not what we want.

We could modify the logic to always test for 128-bits sync if we fail to use the 128-bits __atomic. But, in the long term, compilers will fix their support for __atomic, and then with the current patch the detection will work seamlessly.

jsquyres · 2018-08-15T18:41:58Z

@bosilca and I discussed this on the phone: see #5529 (comment).

The root cause of the problem, as identified by @flang-cavium in #5529, is that our AC_TRY_RUN() test is not strong enough on clang+ARM64. Specifically: we're running this:

__int128 x = 0;
__sync_bool_compare_and_swap (&x, 0, 1);

but we're not actually testing if that function actually worked -- we're just testing that that program doesn't abort. Apparently, on clang+ARM64, this function doesn't cause the program to abort, and therefore AC_TRY_RUN() doesn't fail, so therefore we (incorrectly) conclude that this platform supports 128 bit atomics.

@bosilca is going to amend the tests to make sure that the function actually worked, and then hopefully we'll get a correct answer from this configure test on clang+ARM64.

Additionally, I'm going to add a commit or two on this PR about the --enable-cross-* CLI options -- see #5529 (comment) for more details.

flang-cavium · 2018-08-15T19:17:14Z

Thank you for this.

OpenMPI is very important to us - Cavium. Our T99 (ARM64) chip is big in HPC/Super-Computing and OpenMPI is a very important part of that.

hjelmn · 2018-08-15T23:12:36Z

@bosilca I though we already always checked for __sync. If that isn't working there is a problem.

https://github.com/open-mpi/ompi/blob/master/opal/include/opal/sys/gcc_builtin/atomic.h#L191

hjelmn · 2018-08-15T23:14:46Z

And you are correct. A compiler may eventually have a lock-free CAS-128 in libatomic so there is no problem with this PR.

hjelmn · 2018-08-15T23:18:20Z

@jsquyres Crazy that it doesn't abort. That is an unexpected result. What does it do? Give a wrong answer or something?

jsquyres · 2018-08-16T10:20:17Z

@hjelmn See #5529 (comment)

jsquyres · 2018-08-16T15:53:40Z

@hjelmn George hasn't added commits yet to make sure that the 128 bit test actually works. So let's not merge yet!

hjelmn · 2018-08-16T16:19:26Z

well. from the user comment it looked like they somewhat worked. that seems to be a compiler bug

hjelmn · 2018-08-16T16:20:55Z

we need to add the check if they are lock free to the __atomic tests too. we do it for c11 in the other PR.... which i should probably merge

jsquyres · 2018-08-16T16:39:40Z

I think it might be worthwhile to use large values in the test that we run -- e.g., ~0 or something. Perhaps the function actually only operates on the lower 32 or 64 bits, leaving the upper bits alone...? I think @flang-cavium would run whatever tests you ask him to run.

Worse would be if it actually works (e.g., via emulation), but not in an atomic fashion. 😲

bosilca · 2018-08-16T16:53:19Z

I got access to an arm64, so I might be able to dig into this a little bit. But the more I read the discussion the more confused I get, about what is supported and what not and about what is available through system libraries.

flang-cavium · 2018-08-16T16:54:09Z

I think it might be worthwhile to use large values in the test that we run

Here's what happens when testing with values other than 0 and 1:

GCC 7.3 ARM64:

%> cat int128.c
#include <stdio.h>

int main()
{
  __int128 x = 123456789012345;
  __int128 y = 987654321098765;

  (void) fprintf(stderr, "before: x=%llu y=%llu\n", x, y);

  __int128 r = __atomic_compare_exchange_n (&x, &y,
                                            543210987654321,
                                            987650123498765,
                                            __ATOMIC_RELAXED,
                                            __ATOMIC_RELAXED);
  (void) fprintf(stderr, "after: x=%llu y=%llu r=%llu\n", x, y, r);

  return 0;
}
%> /usr/local/gcc73/bin/gcc -D_GNU_SOURCE -O3 -std=c99 -mcpu=thunderx2t99 -march=armv8.1-a+lse -ffast-math -ffp-contract=fast -funroll-loops -finline-functions -latomic -Wl,-rpath -Wl,/usr/local/gcc73/lib64 int128.c -o int128-gcc
%> echo $status
0
%> ./int128-gcc
before: x=123456789012345 y=0
after: x=123456789012345 y=0 r=123456789012345
%> echo $status
0
%>

clang 6.0.1 ARM64:

%> /opt/cavium/bin/clang -D_GNU_SOURCE -O3 -std=c99 -mcpu=thunderx2t99 -march=armv8.1-a+lse -ffast-math -ffp-contract=fast -funroll-loops -finline-functions -fslp-vectorize -fvectorize -Wl,-rpath -Wl,/opt/cavium/lib int128.c -o int128-clang
int128.c:8:53: warning: format specifies type 'unsigned long long' but the
      argument has type '__int128' [-Wformat]
  (void) fprintf(stderr, "before: x=%llu y=%llu\n", x, y);
                                    ~~~~            ^
int128.c:8:56: warning: format specifies type 'unsigned long long' but the
      argument has type '__int128' [-Wformat]
  (void) fprintf(stderr, "before: x=%llu y=%llu\n", x, y);
                                           ~~~~        ^
int128.c:15:59: warning: format specifies type 'unsigned long long' but the
      argument has type '__int128' [-Wformat]
  (void) fprintf(stderr, "after: x=%llu y=%llu r=%llu\n", x, y, r);
                                   ~~~~                   ^
int128.c:15:62: warning: format specifies type 'unsigned long long' but the
      argument has type '__int128' [-Wformat]
  (void) fprintf(stderr, "after: x=%llu y=%llu r=%llu\n", x, y, r);
                                          ~~~~               ^
int128.c:15:65: warning: format specifies type 'unsigned long long' but the
      argument has type '__int128' [-Wformat]
  (void) fprintf(stderr, "after: x=%llu y=%llu r=%llu\n", x, y, r);
                                                 ~~~~           ^
5 warnings generated.
%> echo $status
0
%> ./int128-clang
before: x=123456789012345 y=0
after: x=543210987654321 y=0 r=123456789012345
%> echo $status
0
%> uname -a
Linux VAL1-14 4.8.0-32-t99-perf #34 SMP Wed Jan 18 18:34:20 UTC 2017 aarch64 aarch64 aarch64 GNU/Linux
%>

Testing with 0 and 1 only is fraught with peril. Constant-folding to 0 and 1. Plus the 0 and 1 arguments are passed as integer constants to the __atomic* intrinsic call.

So the compiler will not actually use 128 bits to represent 0 or 1. It will use as little as possible.

flang-cavium · 2018-08-16T17:44:21Z

Worse would be if it actually works (e.g., via emulation), but not in an atomic fashion.

GCC's libatomic implements atomics as LL/SC with additional POSIX threads locking. Here's a snippet from the ARM64 disassembly:

0000000000001f00 <pthread_mutex_lock@plt>:
    1f00:       f0000090        adrp    x16, 14000 <memcpy@GLIBC_2.17>
    1f04:       f9401211        ldr     x17, [x16, #32]
    1f08:       91008210        add     x16, x16, #0x20
    1f0c:       d61f0220        br      x17

0000000000001f10 <pthread_mutex_unlock@plt>:
    1f10:       f0000090        adrp    x16, 14000 <memcpy@GLIBC_2.17>
    1f14:       f9401611        ldr     x17, [x16, #40]
    1f18:       9100a210        add     x16, x16, #0x28
    1f1c:       d61f0220        br      x17

[ ... ]

0000000000002b08 <libat_lock_1>:
    2b08:       927a1400        and     x0, x0, #0xfc0
    2b0c:       d0000081        adrp    x1, 14000 <memcpy@GLIBC_2.17>
    2b10:       91010021        add     x1, x1, #0x40
    2b14:       8b000020        add     x0, x1, x0
    2b18:       17fffcfa        b       1f00 <pthread_mutex_lock@plt>
    2b1c:       d503201f        nop

Also: LL/SC atomics can silently fail:

https://en.wikipedia.org/wiki/Load-link/store-conditional

hjelmn · 2018-08-16T17:50:24Z

Yeah. They can fail which is why there is usually a loop around ll/sc.

bosilca · 2018-08-16T18:01:01Z

But that is outside the scope of this discussion, as if __atomic_compare_exchange_n exists it's the implementors responsibility to make it follow the specifications. Otherwise, if they provide a function that has a broken implementation there is little we can do except preventing OMPI from being built with that specific compiler.

Btw in your example above where did you find the__atomic_compare_exchange_n prototype ? The 4th argument is supposed to be a bool and not a __int128 isn't it ?

jsquyres · 2018-08-24T23:37:57Z

I'd like to compare a configure output from current master to a configure output to this PR. I.e., why hasn't this been a problem with the XL Fortran compiler before?

bosilca · 2018-08-24T23:41:12Z

The check for - latomic was added in this PR.

…

On Fri, Aug 24, 2018, 7:37 PM Jeff Squyres ***@***.***> wrote: I'd like to compare a configure output from current master to a configure output to this PR. I.e., why hasn't this been a problem with the XL Fortran compiler before? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5546 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAnOjRo71jyolDKlIS0Vuq2rTavyByMJks5uUI5WgaJpZM4V9fcF> .

jsquyres · 2018-08-24T23:55:03Z

Actually, come to think of it, if we need -latomic for C, then we'll also need it to link C and Fortran MPI apps (i.e., in mpicc and mpifort). So it doesn't make sense to separate out the C and Fortran $LIBS just for this issue.

Meaning: the Fortran linker needs to be able to find -latomic.

IBM: please advise.

ibm-ompi · 2018-08-25T01:01:18Z

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/7a36c9443b32c20d4c2a135db2e915c8

jsquyres · 2018-08-25T01:03:18Z

@jjhursey @gpaulsen Can you grab a configure output / config.log from the latest failed CI build? I made a change in the configury, and I'd like to see the output. Thanks.

Thanks to Stefan Teleman for identifying this issue and providing a proof-of-concept patch. We ended up revamping the detection of 128-bit atomics to reduce duplicated code and be a slightly simpler -- albiet perhaps a bit more verbose -- approach: - Remove the --enable-cross-* options; they were confusing and unnecessary. - Always try to compile / link the compiler-intrinsic 128-bit atomic functions. - Strengthen the C tests we use to be more robust. - Use m4 to avoid duplicating the C tests multiple times in the .m4 source. - If not cross-compiling, try to run a short test and ensure that they actually work (as of Aug 2018, there's at least one platform where they don't: clang 6 on ARM64). If cross-compiling, just assume that they work. - Add more comments about what is going on with all the tests; it's tricky stuff. Our Future Selves will thank us. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

Whitespace change only; no code or logic changes. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

Thanks to George for finding/fixing these. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

jsquyres · 2018-08-25T13:21:32Z

@gpapaure @jjhursey Ignore the last request: the config.log that Josh sent was sufficient to figure out the issue.

@bosilca I think we're finally good to go. Can you give it a final review?

hjelmn · 2018-08-25T13:52:28Z

@jsquyres What was the problem?

jsquyres · 2018-08-25T13:56:56Z

TL;DR: we weren't resetting CFLAGS/LIBS properly in one failure case.

More detail: it's still weird that IBM's XL CI environment has -latomic for the C compiler/linker but not for the Fortran compiler/linker. But with closer examination of the config.log that Josh sent, we shouldn't have been using -latomic anyway -- because it wasn't always lock-free. Once I fixed the "hey, we have -latomic but we don't want to use it" path to properly reset CFLAGS/LIBS, then IBM's XL CI environment isn't trying to use -latomic, and all is good.

FWIW, IBM may still want to report this to their XL dev team. Even though this has turned out to not be an issue for Open MPI, it can still be an issue when you need XLF to link a combined C/Fortran application.

hjelmn · 2018-08-25T14:01:25Z

I still don't think it is needed for fortran. If the C code was linked with -latomic that is sufficient. Fortran can link to it without issue.

bosilca · 2018-08-25T14:48:38Z

@hjelmn we only have one $LIBS. If the 128 bits atomic implementation in -latomic become lockless, our current test will add -latomic to $LIBS and will therefore impact the linking of all Fortran code.

bosilca · 2018-08-25T14:51:26Z

@jsquyres 👍 . I can't approve the PR, github has a legit policy that prevents authors from approving their own patches.

jsquyres · 2018-08-25T15:01:52Z

jsquyres · 2018-08-25T15:04:46Z

Before we merge, since there has been sooo many changes on this PR, I'd like to get a final round of confirmations from ARM people that this PR actually works for them.

@shamisp @nSircombe Can you confirm that this PR works properly?

shamisp · 2018-08-25T16:28:45Z

Hi Jeff, I would like to run it by few more developers before we sign off. Is it okay to wait until Monday COB ?

…

On Sat, Aug 25, 2018 at 10:04 Jeff Squyres ***@***.***> wrote: Before we merge, since there has been sooo many changes on this PR, I'd like to get a final round of confirmations from ARM people that this PR actually works for them. @shamisp <https://github.com/shamisp> @nSircombe <https://github.com/nSircombe> Can you confirm that this PR works properly? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5546 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACIe2Ahr-43A9ZgDIJa1NX939xQ7_D4yks5uUWeRgaJpZM4V9fcF> .

jsquyres · 2018-08-26T12:10:37Z

@shamisp No problem.

nSircombe · 2018-08-28T14:01:33Z

Some more digging with the HPC Compiler team clears things up a bit...

The equivalent of @bosilca suggestion (#5546 (comment)) is implemented in clang on Armv8, on a pair of 64bit ints, in a loop.

So the configure log all makes sense, and I think the configure logic is fine now.

There are some issues for us that we may need to look at (Armv8.1, .2... implement this slightly differently, which appears to impact the __atomics. And the lse extension is clearly causing problems).

But for this PR things look ok - if @shamisp is happy, then I'm happy too.

jsquyres · 2018-08-28T14:35:33Z

@nSircombe Thanks!
@shamisp Are you happy? 😉

shamisp · 2018-08-28T14:39:55Z

@jsquyres @nSircombe - I'm okay with existing check.

For the record @hjelmn - considering OpenMPI use case I want to implement ompi internal version of 128bit AMOs using LDXRP/STXRP.

bosilca added bug Severity: blocker labels Aug 15, 2018

bosilca added this to the v4.0.0 milestone Aug 15, 2018

bosilca self-assigned this Aug 15, 2018

bosilca requested a review from hjelmn August 15, 2018 03:01

bosilca mentioned this pull request Aug 15, 2018

incorrect logic in opal_config_asm.m4 for 128-wide atomic ops #5529

Closed

bosilca force-pushed the fix/128_atomics branch from 0e913ab to 8c5de45 Compare August 15, 2018 03:09

bwbarrett added the Target: main label Aug 15, 2018

hppritcha removed this from the v4.0.0 milestone Aug 15, 2018

hjelmn approved these changes Aug 15, 2018

View reviewed changes

jsquyres force-pushed the fix/128_atomics branch from 75e5632 to 3f69c40 Compare August 16, 2018 15:55

jsquyres force-pushed the fix/128_atomics branch from 358eca8 to bb65560 Compare August 25, 2018 01:24

jsquyres added 3 commits August 25, 2018 05:21

opal_config_asm.m4: replace tabs with spaces

63560fe

Whitespace change only; no code or logic changes. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

opal_functions.m4: minor typo fixes

9194dbb

Thanks to George for finding/fixing these. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

jsquyres force-pushed the fix/128_atomics branch from bb65560 to 9194dbb Compare August 25, 2018 12:25

hjelmn approved these changes Aug 25, 2018

View reviewed changes

jsquyres approved these changes Aug 25, 2018

View reviewed changes

jsquyres merged commit 151efa6 into open-mpi:master Aug 28, 2018

This was referenced Aug 28, 2018

v3.0.x: atomic 128 fixes #5617

Merged

v3.1.x: atomic 128 fixes #5618

Merged

v4.0.x: atomic 128 fixes #5619

Merged

bosilca deleted the fix/128_atomics branch September 6, 2018 09:47

Fix the detection of 128 bits atomics. #5546

Fix the detection of 128 bits atomics. #5546

Conversation

bosilca commented Aug 15, 2018

bosilca commented Aug 15, 2018

hjelmn commented Aug 15, 2018

hjelmn commented Aug 15, 2018

bosilca commented Aug 15, 2018

jsquyres commented Aug 15, 2018

flang-cavium commented Aug 15, 2018

hjelmn commented Aug 15, 2018

hjelmn commented Aug 15, 2018 • edited Loading

hjelmn commented Aug 15, 2018

jsquyres commented Aug 16, 2018

jsquyres commented Aug 16, 2018

hjelmn commented Aug 16, 2018

hjelmn commented Aug 16, 2018 • edited Loading

jsquyres commented Aug 16, 2018 • edited Loading

bosilca commented Aug 16, 2018 • edited Loading

flang-cavium commented Aug 16, 2018

flang-cavium commented Aug 16, 2018

hjelmn commented Aug 16, 2018

bosilca commented Aug 16, 2018

jsquyres commented Aug 24, 2018

bosilca commented Aug 24, 2018 via email

jsquyres commented Aug 24, 2018 • edited Loading

ibm-ompi commented Aug 25, 2018

jsquyres commented Aug 25, 2018

jsquyres commented Aug 25, 2018

hjelmn commented Aug 25, 2018

jsquyres commented Aug 25, 2018

hjelmn commented Aug 25, 2018

bosilca commented Aug 25, 2018

bosilca commented Aug 25, 2018

jsquyres commented Aug 25, 2018

jsquyres commented Aug 25, 2018

shamisp commented Aug 25, 2018 via email

jsquyres commented Aug 26, 2018

nSircombe commented Aug 28, 2018

jsquyres commented Aug 28, 2018

shamisp commented Aug 28, 2018

hjelmn commented Aug 15, 2018 •

edited

Loading

hjelmn commented Aug 16, 2018 •

edited

Loading

jsquyres commented Aug 16, 2018 •

edited

Loading

bosilca commented Aug 16, 2018 •

edited

Loading

jsquyres commented Aug 24, 2018 •

edited

Loading