Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the detection of 128 bits atomics. #5546

Merged
merged 3 commits into from
Aug 28, 2018

Conversation

bosilca
Copy link
Member

@bosilca bosilca commented Aug 15, 2018

Thanks to @flang-cavium for identifying this issue and providing a proof-of-concept patch and a very accurate description of the issue in #5529.

It turns out we were missing the check for -latomic to get support for 128-bits atomics. Second we need to unset the local variables (_save) to avoid a conflict in OPAL_CHECK_SYNC_BUILTIN_CSWAP_INT128.

This patch might conflict with #5445, they are complementary and should be merged.

Signed-off-by: George Bosilca bosilca@icl.utk.edu

@bosilca
Copy link
Member Author

bosilca commented Aug 15, 2018

@jsquyres, @bwbarrett and @hppritcha once we converge on how we want to address this issue, we will need to push it to all stable releases. We can certainly address the cross-128 argument, but the purpose of this patch is to be minimally invasive and only address the detection of 128 atomic support.

@hjelmn
Copy link
Member

hjelmn commented Aug 15, 2018

@bosilca The problem is often when the compare-exchange support for 128-bit atomics requires libatomic then the support is not lock-free (example gcc on x86-64). This is why I didn't add -latomic. Is there some case where libatomic is required and the resulting support is lock-free?

@hjelmn
Copy link
Member

hjelmn commented Aug 15, 2018

To be fair I only tested that with icc and gcc. Both failed atomic_is_lock_free () for 128-bit atomics iff libatomic was required for atomic_compare_exchange_strong on 128-bit values.

@bosilca
Copy link
Member Author

bosilca commented Aug 15, 2018

@hjelmn I confirm, on the machines where I tested (x86_64 with different flavors of gcc and llvm) the atomics from -latomic are not lock-free. However, if we don't test for it we discard the check for 128-bits __sync and we would not mix the __atomic and __sync for x86_64. As a result all 128 bits atomics are disabled, which is not what we want.

We could modify the logic to always test for 128-bits sync if we fail to use the 128-bits __atomic. But, in the long term, compilers will fix their support for __atomic, and then with the current patch the detection will work seamlessly.

@jsquyres
Copy link
Member

@bosilca and I discussed this on the phone: see #5529 (comment).

The root cause of the problem, as identified by @flang-cavium in #5529, is that our AC_TRY_RUN() test is not strong enough on clang+ARM64. Specifically: we're running this:

__int128 x = 0;
__sync_bool_compare_and_swap (&x, 0, 1);

but we're not actually testing if that function actually worked -- we're just testing that that program doesn't abort. Apparently, on clang+ARM64, this function doesn't cause the program to abort, and therefore AC_TRY_RUN() doesn't fail, so therefore we (incorrectly) conclude that this platform supports 128 bit atomics.

@bosilca is going to amend the tests to make sure that the function actually worked, and then hopefully we'll get a correct answer from this configure test on clang+ARM64.

Additionally, I'm going to add a commit or two on this PR about the --enable-cross-* CLI options -- see #5529 (comment) for more details.

@flang-cavium
Copy link

Thank you for this.

OpenMPI is very important to us - Cavium. Our T99 (ARM64) chip is big in HPC/Super-Computing and OpenMPI is a very important part of that.

@hppritcha hppritcha removed this from the v4.0.0 milestone Aug 15, 2018
@hjelmn
Copy link
Member

hjelmn commented Aug 15, 2018

@bosilca I though we already always checked for __sync. If that isn't working there is a problem.

https://github.com/open-mpi/ompi/blob/master/opal/include/opal/sys/gcc_builtin/atomic.h#L191

@hjelmn
Copy link
Member

hjelmn commented Aug 15, 2018

And you are correct. A compiler may eventually have a lock-free CAS-128 in libatomic so there is no problem with this PR.

@hjelmn
Copy link
Member

hjelmn commented Aug 15, 2018

@jsquyres Crazy that it doesn't abort. That is an unexpected result. What does it do? Give a wrong answer or something?

@jsquyres
Copy link
Member

@hjelmn See #5529 (comment)

@jsquyres
Copy link
Member

@hjelmn George hasn't added commits yet to make sure that the 128 bit test actually works. So let's not merge yet!

@hjelmn
Copy link
Member

hjelmn commented Aug 16, 2018

well. from the user comment it looked like they somewhat worked. that seems to be a compiler bug

@hjelmn
Copy link
Member

hjelmn commented Aug 16, 2018

we need to add the check if they are lock free to the __atomic tests too. we do it for c11 in the other PR.... which i should probably merge

@jsquyres
Copy link
Member

jsquyres commented Aug 16, 2018

I think it might be worthwhile to use large values in the test that we run -- e.g., ~0 or something. Perhaps the function actually only operates on the lower 32 or 64 bits, leaving the upper bits alone...? I think @flang-cavium would run whatever tests you ask him to run.

Worse would be if it actually works (e.g., via emulation), but not in an atomic fashion. 😲

@bosilca
Copy link
Member Author

bosilca commented Aug 16, 2018

I got access to an arm64, so I might be able to dig into this a little bit. But the more I read the discussion the more confused I get, about what is supported and what not and about what is available through system libraries.

@flang-cavium
Copy link

I think it might be worthwhile to use large values in the test that we run

Here's what happens when testing with values other than 0 and 1:

  1. GCC 7.3 ARM64:
%> cat int128.c
#include <stdio.h>

int main()
{
  __int128 x = 123456789012345;
  __int128 y = 987654321098765;

  (void) fprintf(stderr, "before: x=%llu y=%llu\n", x, y);

  __int128 r = __atomic_compare_exchange_n (&x, &y,
                                            543210987654321,
                                            987650123498765,
                                            __ATOMIC_RELAXED,
                                            __ATOMIC_RELAXED);
  (void) fprintf(stderr, "after: x=%llu y=%llu r=%llu\n", x, y, r);

  return 0;
}
%> /usr/local/gcc73/bin/gcc -D_GNU_SOURCE -O3 -std=c99 -mcpu=thunderx2t99 -march=armv8.1-a+lse -ffast-math -ffp-contract=fast -funroll-loops -finline-functions -latomic -Wl,-rpath -Wl,/usr/local/gcc73/lib64 int128.c -o int128-gcc
%> echo $status
0
%> ./int128-gcc
before: x=123456789012345 y=0
after: x=123456789012345 y=0 r=123456789012345
%> echo $status
0
%>
  1. clang 6.0.1 ARM64:
%> /opt/cavium/bin/clang -D_GNU_SOURCE -O3 -std=c99 -mcpu=thunderx2t99 -march=armv8.1-a+lse -ffast-math -ffp-contract=fast -funroll-loops -finline-functions -fslp-vectorize -fvectorize -Wl,-rpath -Wl,/opt/cavium/lib int128.c -o int128-clang
int128.c:8:53: warning: format specifies type 'unsigned long long' but the
      argument has type '__int128' [-Wformat]
  (void) fprintf(stderr, "before: x=%llu y=%llu\n", x, y);
                                    ~~~~            ^
int128.c:8:56: warning: format specifies type 'unsigned long long' but the
      argument has type '__int128' [-Wformat]
  (void) fprintf(stderr, "before: x=%llu y=%llu\n", x, y);
                                           ~~~~        ^
int128.c:15:59: warning: format specifies type 'unsigned long long' but the
      argument has type '__int128' [-Wformat]
  (void) fprintf(stderr, "after: x=%llu y=%llu r=%llu\n", x, y, r);
                                   ~~~~                   ^
int128.c:15:62: warning: format specifies type 'unsigned long long' but the
      argument has type '__int128' [-Wformat]
  (void) fprintf(stderr, "after: x=%llu y=%llu r=%llu\n", x, y, r);
                                          ~~~~               ^
int128.c:15:65: warning: format specifies type 'unsigned long long' but the
      argument has type '__int128' [-Wformat]
  (void) fprintf(stderr, "after: x=%llu y=%llu r=%llu\n", x, y, r);
                                                 ~~~~           ^
5 warnings generated.
%> echo $status
0
%> ./int128-clang
before: x=123456789012345 y=0
after: x=543210987654321 y=0 r=123456789012345
%> echo $status
0
%> uname -a
Linux VAL1-14 4.8.0-32-t99-perf #34 SMP Wed Jan 18 18:34:20 UTC 2017 aarch64 aarch64 aarch64 GNU/Linux
%>

Testing with 0 and 1 only is fraught with peril. Constant-folding to 0 and 1. Plus the 0 and 1 arguments are passed as integer constants to the __atomic* intrinsic call.

So the compiler will not actually use 128 bits to represent 0 or 1. It will use as little as possible.

@flang-cavium
Copy link

Worse would be if it actually works (e.g., via emulation), but not in an atomic fashion.

GCC's libatomic implements atomics as LL/SC with additional POSIX threads locking. Here's a snippet from the ARM64 disassembly:

0000000000001f00 <pthread_mutex_lock@plt>:
    1f00:       f0000090        adrp    x16, 14000 <memcpy@GLIBC_2.17>
    1f04:       f9401211        ldr     x17, [x16, #32]
    1f08:       91008210        add     x16, x16, #0x20
    1f0c:       d61f0220        br      x17

0000000000001f10 <pthread_mutex_unlock@plt>:
    1f10:       f0000090        adrp    x16, 14000 <memcpy@GLIBC_2.17>
    1f14:       f9401611        ldr     x17, [x16, #40]
    1f18:       9100a210        add     x16, x16, #0x28
    1f1c:       d61f0220        br      x17

[ ... ]

0000000000002b08 <libat_lock_1>:
    2b08:       927a1400        and     x0, x0, #0xfc0
    2b0c:       d0000081        adrp    x1, 14000 <memcpy@GLIBC_2.17>
    2b10:       91010021        add     x1, x1, #0x40
    2b14:       8b000020        add     x0, x1, x0
    2b18:       17fffcfa        b       1f00 <pthread_mutex_lock@plt>
    2b1c:       d503201f        nop

Also: LL/SC atomics can silently fail:

https://en.wikipedia.org/wiki/Load-link/store-conditional

@hjelmn
Copy link
Member

hjelmn commented Aug 16, 2018

Yeah. They can fail which is why there is usually a loop around ll/sc.

@bosilca
Copy link
Member Author

bosilca commented Aug 16, 2018

But that is outside the scope of this discussion, as if __atomic_compare_exchange_n exists it's the implementors responsibility to make it follow the specifications. Otherwise, if they provide a function that has a broken implementation there is little we can do except preventing OMPI from being built with that specific compiler.

Btw in your example above where did you find the__atomic_compare_exchange_n prototype ? The 4th argument is supposed to be a bool and not a __int128 isn't it ?

@jsquyres
Copy link
Member

I'd like to compare a configure output from current master to a configure output to this PR. I.e., why hasn't this been a problem with the XL Fortran compiler before?

@bosilca
Copy link
Member Author

bosilca commented Aug 24, 2018 via email

@jsquyres
Copy link
Member

jsquyres commented Aug 24, 2018

Actually, come to think of it, if we need -latomic for C, then we'll also need it to link C and Fortran MPI apps (i.e., in mpicc and mpifort). So it doesn't make sense to separate out the C and Fortran $LIBS just for this issue.

Meaning: the Fortran linker needs to be able to find -latomic.

IBM: please advise.

@ibm-ompi
Copy link

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/7a36c9443b32c20d4c2a135db2e915c8

@jsquyres
Copy link
Member

@jjhursey @gpaulsen Can you grab a configure output / config.log from the latest failed CI build? I made a change in the configury, and I'd like to see the output. Thanks.

Thanks to Stefan Teleman for identifying this issue and providing a
proof-of-concept patch.  We ended up revamping the detection of
128-bit atomics to reduce duplicated code and be a slightly simpler --
albiet perhaps a bit more verbose -- approach:

- Remove the --enable-cross-* options; they were confusing and
  unnecessary.
- Always try to compile / link the compiler-intrinsic 128-bit atomic
  functions.
  - Strengthen the C tests we use to be more robust.
  - Use m4 to avoid duplicating the C tests multiple times in the .m4
    source.
- If not cross-compiling, try to run a short test and ensure that they
  actually work (as of Aug 2018, there's at least one platform where
  they don't: clang 6 on ARM64).  If cross-compiling, just assume that
  they work.
- Add more comments about what is going on with all the tests; it's
  tricky stuff.  Our Future Selves will thank us.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Whitespace change only; no code or logic changes.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Thanks to George for finding/fixing these.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
@jsquyres
Copy link
Member

@gpapaure @jjhursey Ignore the last request: the config.log that Josh sent was sufficient to figure out the issue.

@bosilca I think we're finally good to go. Can you give it a final review?

@hjelmn
Copy link
Member

hjelmn commented Aug 25, 2018

@jsquyres What was the problem?

@jsquyres
Copy link
Member

TL;DR: we weren't resetting CFLAGS/LIBS properly in one failure case.

More detail: it's still weird that IBM's XL CI environment has -latomic for the C compiler/linker but not for the Fortran compiler/linker. But with closer examination of the config.log that Josh sent, we shouldn't have been using -latomic anyway -- because it wasn't always lock-free. Once I fixed the "hey, we have -latomic but we don't want to use it" path to properly reset CFLAGS/LIBS, then IBM's XL CI environment isn't trying to use -latomic, and all is good.

FWIW, IBM may still want to report this to their XL dev team. Even though this has turned out to not be an issue for Open MPI, it can still be an issue when you need XLF to link a combined C/Fortran application.

@hjelmn
Copy link
Member

hjelmn commented Aug 25, 2018

I still don't think it is needed for fortran. If the C code was linked with -latomic that is sufficient. Fortran can link to it without issue.

@bosilca
Copy link
Member Author

bosilca commented Aug 25, 2018

@hjelmn we only have one $LIBS. If the 128 bits atomic implementation in -latomic become lockless, our current test will add -latomic to $LIBS and will therefore impact the linking of all Fortran code.

@bosilca
Copy link
Member Author

bosilca commented Aug 25, 2018

@jsquyres 👍 . I can't approve the PR, github has a legit policy that prevents authors from approving their own patches.

@jsquyres
Copy link
Member

george

@jsquyres
Copy link
Member

Before we merge, since there has been sooo many changes on this PR, I'd like to get a final round of confirmations from ARM people that this PR actually works for them.

@shamisp @nSircombe Can you confirm that this PR works properly?

@shamisp
Copy link
Contributor

shamisp commented Aug 25, 2018 via email

@jsquyres
Copy link
Member

@shamisp No problem.

@nSircombe
Copy link

Some more digging with the HPC Compiler team clears things up a bit...

The equivalent of @bosilca suggestion (#5546 (comment)) is implemented in clang on Armv8, on a pair of 64bit ints, in a loop.

So the configure log all makes sense, and I think the configure logic is fine now.

There are some issues for us that we may need to look at (Armv8.1, .2... implement this slightly differently, which appears to impact the __atomics. And the lse extension is clearly causing problems).

But for this PR things look ok - if @shamisp is happy, then I'm happy too.

@jsquyres
Copy link
Member

@nSircombe Thanks!
@shamisp Are you happy? 😉

@shamisp
Copy link
Contributor

shamisp commented Aug 28, 2018

@jsquyres @nSircombe - I'm okay with existing check.

For the record @hjelmn - considering OpenMPI use case I want to implement ompi internal version of 128bit AMOs using LDXRP/STXRP.

@jsquyres jsquyres merged commit 151efa6 into open-mpi:master Aug 28, 2018
@bosilca bosilca deleted the fix/128_atomics branch September 6, 2018 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.