Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3.0.0rc1: PMIx issue: UNPACK-INADEQUATE-SPACE #3809

Closed
jsquyres opened this issue Jul 5, 2017 · 47 comments
Closed

v3.0.0rc1: PMIx issue: UNPACK-INADEQUATE-SPACE #3809

jsquyres opened this issue Jul 5, 2017 · 47 comments
Assignees
Labels
Milestone

Comments

@jsquyres
Copy link
Member

jsquyres commented Jul 5, 2017

From thread on devel started by @PHHargrove (https://www.mail-archive.com/devel@lists.open-mpi.org/msg20280.html): On (at least) two different hosts (both Linux, one x86-64 and one ppc64el) I am seeing a failure to launch ring_c with errors like those shown below.

Configure options: --prefix=[...] --enable-debug CC=... CXX=... FC=...
One of the two also had --with-libfabric=/usr/local/pkg/libfabric-1.4.0

If it might matter, one of these systems has 14 interfaces listed by /usr/sbin/ifconfig and the other has 8.

$ mpirun -mca btl sm,self -np 2 examples/ring_c
[pcp-d-1:02255] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file /home/phargrov/OMPI/openmpi-3.0.0rc1-linux-x86_64-pathcc-6/openmpi-3.0.0rc1/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c at line 2477
[pcp-d-1:02255] PMIX ERROR: ERROR in file /home/phargrov/OMPI/openmpi-3.0.0rc1-linux-x86_64-pathcc-6/openmpi-3.0.0rc1/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c at line 2024
[pcp-d-1:02255] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file /home/phargrov/OMPI/openmpi-3.0.0rc1-linux-x86_64-pathcc-6/openmpi-3.0.0rc1/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c at line 1150
[pcp-d-1:02255] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file /home/phargrov/OMPI/openmpi-3.0.0rc1-linux-x86_64-pathcc-6/openmpi-3.0.0rc1/opal/mca/pmix/pmix2x/pmix/src/common/pmix_jobdata.c at line 112
[pcp-d-1:02255] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file /home/phargrov/OMPI/openmpi-3.0.0rc1-linux-x86_64-pathcc-6/openmpi-3.0.0rc1/opal/mca/pmix/pmix2x/pmix/src/common/pmix_jobdata.c at line 392
[pcp-d-1:02255] PMIX ERROR: UNPACK-INADEQUATE-SPACE in file /home/phargrov/OMPI/openmpi-3.0.0rc1-linux-x86_64-pathcc-6/openmpi-3.0.0rc1/opal/mca/pmix/pmix2x/pmix/src/server/pmix_server.c at line 518
@jsquyres jsquyres added the bug label Jul 5, 2017
@jsquyres jsquyres added this to the v3.0.0 milestone Jul 5, 2017
@hppritcha
Copy link
Member

Does the problem persist if vader BTL is used instead of sm?

@rhc54
Copy link
Contributor

rhc54 commented Jul 5, 2017

I can't replicate it using the head of the 3.0 branch

@PHHargrove
Copy link
Member

@hppritcha No, -mca btl vader,self did not change or eliminate the error.

@PHHargrove
Copy link
Member

In the flood of data that arrives each time I test an RC, I had failed to notice that despite the fact that one is x86-64 and the other ppc64el, both failing systems are using PathScale compilers!

I don't have other runs on the x86-64 host, but on the ppc64el host a build with clang runs just fine!

So, it looks possible that this could be a PathScale compiler bug.

@PHHargrove
Copy link
Member

x86-64: ekopath-6.0.963 (snapshot / free download)

$ pathcc --version 
PathScale EKOPath(tm) Compiler Suite: Version 6.0.963
Built on:
Thread model: posix
GNU gcc compatible version 4.2.1

Copyright PathScale Inc and others.  All Rights Reserved.
You can find complete copyright, patent and legal notices in the corresponding documentation.

ppc64el: enzo-6.0.9

$ pathcc --version
PathScale ENZO(tm) Compiler Suite: Version 6.0.9
Built on:
Thread model: posix
GNU gcc compatible version 4.2.1

Copyright PathScale Inc and others.  All Rights Reserved.
You can find complete copyright, patent and legal notices in the corresponding documentation.

@rhc54
Copy link
Contributor

rhc54 commented Jul 5, 2017

It could be a compiler issue, but more likely an uninitialized variable that Pathscale sets to a different default. We can poke in the code a bit and see what's going on.

@rhc54
Copy link
Contributor

rhc54 commented Jul 5, 2017

Here's the culprit:

    while (PMIX_SUCCESS == (rc = pmix_bfrop.unpack(buf, kp, &(int){1}, PMIX_KVAL))) {

My guess is that Pathscale doesn't appreciate that cute &(int){1} in there.

@rhc54
Copy link
Contributor

rhc54 commented Jul 5, 2017

@PHHargrove Can you please try with the following diff:

diff --git a/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c b/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c
index 573a83d..3d29fc2 100644
--- a/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c
+++ b/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c
@@ -2428,6 +2428,7 @@ static int _store_data_for_rank(ns_track_elem_t *ns_info, pmix_rank_t rank, pmix
     rank_meta_info *rinfo = NULL;
     size_t num_elems, free_offset, new_free_offset;
     int data_exist;
+    int32_t cnt;
 
     PMIX_OUTPUT_VERBOSE((10, pmix_globals.debug_output,
                          "%s:%d:%s: for rank %u", __FILE__, __LINE__, __func__, rank));
@@ -2458,7 +2459,8 @@ static int _store_data_for_rank(ns_track_elem_t *ns_info, pmix_rank_t rank, pmix
      */
     free_offset = get_free_offset(datadesc);
     kp = PMIX_NEW(pmix_kval_t);
-    while (PMIX_SUCCESS == (rc = pmix_bfrop.unpack(buf, kp, &(int){1}, PMIX_KVAL))) {
+    cnt = 1;
+    while (PMIX_SUCCESS == (rc = pmix_bfrop.unpack(buf, kp, &cnt, PMIX_KVAL))) {
         pmix_output_verbose(2, pmix_globals.debug_output,
                             "pmix: unpacked key %s", kp->key);
         if (PMIX_SUCCESS != (rc = pmix_sm_store(ns_info, rank, kp, &rinfo, data_exist))) {

@artpol84
Copy link
Contributor

artpol84 commented Jul 5, 2017

I recall that @hjelmn did similar (and maybe this) changes when we were debugging a dstore issues. If this is the root cause - may need to check for other places.

@artpol84 artpol84 closed this as completed Jul 5, 2017
@artpol84 artpol84 reopened this Jul 5, 2017
@artpol84
Copy link
Contributor

artpol84 commented Jul 5, 2017

Sorry, wrong button.

@rhc54
Copy link
Contributor

rhc54 commented Jul 5, 2017

Just FWIW: it is legitimate according to the standard - it just may be a part of the standard that Pathscale doesn't support, and since it isn't necessary, it would be an easy fix once we confirm that this is indeed the issue.

@artpol84
Copy link
Contributor

artpol84 commented Jul 5, 2017

I actually liked that change, looks very nice and avoids unnecessary variable. I didn't knew about this feature. However indeed it is quite possible that some compilers can stumble upon it as it is not a very common practice.

@PHHargrove
Copy link
Member

@rhc54
I can confirm that your patch corrects the problem on both platforms.

I tried, just now, configuring with CC="pathcc -std=c99" and got an unrelated compile error in libevent that does not occur with CC="pathcc -std=gnu99" (will report separately). However, I did confirm that the error cannot be eliminated via CC="pathcc -std=gnu99".

Re: &(int){1}
Didn't the Open MPI community at some point set down a specific subset of C99 that is required of the compiler, and thus implicitly limit the developer to those features? If so, is this feature on that list? Does PMIx maybe have more aggressive coding standards?

@rhc54
Copy link
Contributor

rhc54 commented Jul 5, 2017

I honestly don't recall if OMPI did so. PMIx tries to stay at the lowest common denominator, and so we certainly wouldn't push beyond what OMPI allows. This just slipped thru as something cute and within the standard, but definitely not something we need.

I'll setup the PR to fix it. Thanks!

@hjelmn
Copy link
Member

hjelmn commented Jul 5, 2017

@PHHargrove I think we only agreed on C99. If pathscale doesn't support &(int){1} then it isn't compliant.

@PHHargrove
Copy link
Member

@hjelmn
Perhaps the configure check for C99 support can be strengthened to reject Pathscale?
However, this particular issue appears as a runtime behavior, and not a compile-time one.

FWIW: as I just noted in #3812, if you ask gcc on Linux to be strictly C99 compliant then libevent won't build!

@rhc54
Copy link
Contributor

rhc54 commented Jul 5, 2017

We shouldn't reject Pathscale over something this trivial and unnecessary. If we get that strict, we'll be rejecting many compilers as they all have some quirk 😄

@rhc54
Copy link
Contributor

rhc54 commented Jul 5, 2017

FWIW: that code was already changed in the PMIx master, and so won't be there once we update to PMIx 2.1.0 (or 3.0, whichever comes first).

@hjelmn
Copy link
Member

hjelmn commented Jul 5, 2017

I really see no point in allowing non-compliant C compilers. I have long since abandoned using anything other than gcc and clang for the C compiler. Intel compiler is too slow, pgcc is buggy, etc. Just use whatever for the fortran bindings. But that is just me.

@hjelmn
Copy link
Member

hjelmn commented Jul 5, 2017

Its really odd that this ends up as a runtime error. Probably means pathscale supports the syntax but there is a bug in the C implementation.

@rhc54
Copy link
Contributor

rhc54 commented Jul 5, 2017

yep -but it simply isn't a position we can globally take as a community. I agree that it's a bug in their implementation and should be reported to them.

@PHHargrove
Copy link
Member

@rhc54 I am not sure what "them" one can report to. www.pathscale.com is down and Wikipedia's entry on Pathscale says it has been since May, 2017.

@jsquyres
Copy link
Member Author

jsquyres commented Jul 5, 2017

Have to agree with @rhc here -- it's a pretty obscure feature. We apparently used it in a single place. We should report it to the Pathscale people, and they'll likely fix it.

It might be different if this was a feature that we used widely throughout the code base.

@rhc54
Copy link
Contributor

rhc54 commented Jul 5, 2017

dunno - however, the Pathscale folks have responded on our devel list recently, so they do still appear to be around.

@jsquyres
Copy link
Member Author

jsquyres commented Jul 5, 2017

@cbergstrom You guys still around? Can we report a potential Pathscale compiler bug to you?

@PHHargrove
Copy link
Member

I know that Nathan has used this feature in the attempts to fix the patcher/overwrite problems on OpenBSD. So, I wouldn't say this is used in only one place.
FWIW, a quick grep turns up this instance and 2 others in 3.0.0rc1:

$ grep -r '& *([a-z_]*) *{' openmpi-3.0.0rc1
openmpi-3.0.0rc1/ompi/mca/pml/base/pml_base_sendreq.h:                                    &(size_t){0});                      \
openmpi-3.0.0rc1/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c:    while (PMIX_SUCCESS == (rc = pmix_bfrop.unpack(buf, kp, &(int){1}, PMIX_KVAL))) {
openmpi-3.0.0rc1/opal/mca/pmix/cray/pmix_cray.c:        while (OPAL_SUCCESS == (rc = opal_dss.unpack(buf, &kp, &(int){1}, OPAL_VALUE))) {

@jsquyres
Copy link
Member Author

jsquyres commented Jul 5, 2017

@PHHargrove I stand corrected. 😄 Perhaps we should update all of these.

@PHHargrove
Copy link
Member

With a better regex to allow digits and spaces in type names I found three more:

$ grep -r '& *( *[a-z0-9_ ]*) *{' openmpi-3.0.0rc1
openmpi-3.0.0rc1/ompi/mca/pml/base/pml_base_sendreq.h:                                    &(size_t){0});                      \
openmpi-3.0.0rc1/opal/mca/btl/ugni/btl_ugni_module.c:    opal_event_evtimer_add (&ugni_module->connection_event, (&(struct timeval) {.tv_sec = 0, .tv_usec = MCA_BTL_UGNI_CONNECT_USEC}));
openmpi-3.0.0rc1/opal/mca/btl/ugni/btl_ugni_add_procs.c:        opal_event_evtimer_add (&ugni_module->connection_event, (&(struct timeval) {.tv_sec = 0, .tv_usec = MCA_BTL_UGNI_CONNECT_USEC}));
openmpi-3.0.0rc1/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c:    while (PMIX_SUCCESS == (rc = pmix_bfrop.unpack(buf, kp, &(int){1}, PMIX_KVAL))) {
openmpi-3.0.0rc1/opal/mca/pmix/pmix2x/pmix/src/buffer_ops/internal_functions.c:    return pmix_bfrop_unpack_datatype(buffer, type, &(int32_t){1}, PMIX_DATA_TYPE);
openmpi-3.0.0rc1/opal/mca/pmix/cray/pmix_cray.c:        while (OPAL_SUCCESS == (rc = opal_dss.unpack(buf, &kp, &(int){1}, OPAL_VALUE))) {

@rhc54
Copy link
Contributor

rhc54 commented Jul 5, 2017

sigh - yes, let's keep the level of cuteness down a bit

@rhc54
Copy link
Contributor

rhc54 commented Jul 5, 2017

I'll take care of the ones in the pmix layer

@hjelmn
Copy link
Member

hjelmn commented Jul 5, 2017

It isn't cuteness. It is a C99 feature to make a temporary pointer. To be compliant a compiler MUST support it. It is not optional.

@rhc54
Copy link
Contributor

rhc54 commented Jul 5, 2017

i say "cute" because it is easily coded in an optional manner that yields the exact same binary, and is thus unnecessary.

@hjelmn
Copy link
Member

hjelmn commented Jul 5, 2017

It is actually part of a OMPI required C99 feature (sub-object naming) FWIW. See C99 6.5.2.5 Compound literals.

@jsquyres
Copy link
Member Author

jsquyres commented Jul 5, 2017

@hjelmn I believe you. But this is possibly a minor compiler bug. Let's ask the compiler guys about it and see where to go from there.

@rhc54
Copy link
Contributor

rhc54 commented Jul 5, 2017

Understood - but it doesn't change the situation. This isn't a critical feature we can't live without, it doesn't impact performance, etc. So let's just change the code and move on.

@PHHargrove
Copy link
Member

<humor>
@hjelmn C99 also defines a behavior for the inline keyword that GCC violates.
So, Nathan, you can keep using Clang but must (by your own rule) stop using GCC.
</humor>

I am not in a position to make policy for Open MPI or PMIx.
However, as champion for portability, I agree entirely with Ralph that this is a feature that is not necessary to the code and that making a concession to the world of imperfect compilers is the right think to do here. If we were, instead, talking about named struct initializers I would be in favor dropping the offending compiler.

@rhc54
Copy link
Contributor

rhc54 commented Jul 5, 2017

I fixed them all here: #3813

@PHHargrove
Copy link
Member

@jjhursey
Copy link
Member

jjhursey commented Jul 6, 2017

Is this something we can easily add a configure check for? That way configure will fail out if the compiler does not correctly support this aspect of C99. We certainly do this kind of testing of Fortran compilers, so I think it would be good to do so of C compilers. I don't know how difficult it would be to generate a test for it though.

@rhc54
Copy link
Contributor

rhc54 commented Jul 6, 2017

That's a death-spiral - as Paul pointed out, every compiler has a bug in it somewhere. If we begin rejecting compilers that aren't fully compliant, nothing will be left.

Where we want to use a feature that is critical to the implementation or offers a clear performance advantage, then we should take a hard look at the situation if a compiler has issues with it to see if coding around the problem can be done without impacting our objectives. If not, then it makes sense to test/reject.

This case doesn't meet that criteria.

markalle pushed a commit to markalle/ompi that referenced this issue Jul 6, 2017
Fixes open-mpi#3809

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
@hjelmn
Copy link
Member

hjelmn commented Nov 8, 2017

Just as an FYI Pathscale is dead. I would like to revisit the rejection of the &(type){value} syntax as the new atomics take a pointer to type as the compare argument. Will be more than a pain to not be able to use the C99 syntax for creating a temporary pointer.

@PHHargrove
Copy link
Member

I don't pretend that my opinions carry much weight here, but here's my 0.02USD anyway.

Just because Pathscale as a company folded in March 2017 does not mean that nobody is is using their compilers in production. In my experience, the more conservative HPC centers will keep a s/w stack fixed (except for bug fixes) for as long as 5 years.

See you in Denver,
-Paul [sent from my phone]

@hjelmn
Copy link
Member

hjelmn commented Nov 8, 2017

I would like to be more aggressive at dropping older compiler support. The way I see it sites that need an older compiler can either use a modern gcc or clang to as the C compiler to build Open MPI or use an older version of Open MPI. I know I am probably in the minority but making extra work for ourselves for the sake of older systems seems silly.

@rhc54
Copy link
Contributor

rhc54 commented Nov 8, 2017

This is a cross we have born since the very beginning of the project. I'm afraid it is a price of playing in this market segment - as Paul said, those "older systems" are going to be around for quite some time.

Just my $0.002

@hjelmn
Copy link
Member

hjelmn commented Nov 8, 2017

@rhc54 This is an example that shows why I am frustrated with this limitation:

#define BTL_OPENIB_CREDITS_SEND_TRYLOCK(E, Q) \
    OPAL_ATOMIC_BOOL_CMPSET_32(&(E)->qps[(Q)].rd_credit_send_lock, 0, 1)
#define BTL_OPENIB_CREDITS_SEND_UNLOCK(E, Q) \
    OPAL_ATOMIC_BOOL_CMPSET_32(&(E)->qps[(Q)].rd_credit_send_lock, 1, 0)
#define BTL_OPENIB_GET_CREDITS(FROM, TO)                                        \
    do {                                                     \
        TO = FROM;                                           \
    } while(0 == OPAL_ATOMIC_BOOL_CMPSET_32(&FROM, TO, 0))

The easy way:

#define BTL_OPENIB_CREDITS_SEND_TRYLOCK(E, Q) \
    OPAL_ATOMIC_COMPARE_EXCHANGE_STRONG_32(&(E)->qps[(Q)].rd_credit_send_lock, &(int32_t) {0}, 1)
#define BTL_OPENIB_CREDITS_SEND_UNLOCK(E, Q) \
    OPAL_ATOMIC_COMPARE_EXCHANGE_STRONG_32(&(E)->qps[(Q)].rd_credit_send_lock, &(int32_t){1}, 0)
#define BTL_OPENIB_GET_CREDITS(FROM, TO)                                        \
    OPAL_ATOMIC_SWAP_32(&FROM, TO)

The lets support a buggy compiler way:

#define BTL_OPENIB_CREDITS_SEND_TRYLOCK(E, Q) \
    do { \
        int32_t _tmp_value = 0; \
        OPAL_ATOMIC_BOOL_CMPSET_32(&(E)->qps[(Q)].rd_credit_send_lock, &_tmp_value, 1); \
    } while (0)
#define BTL_OPENIB_CREDITS_SEND_UNLOCK(E, Q) \
    do { \
        int32_t _tmp_value = 1; \
        OPAL_ATOMIC_BOOL_CMPSET_32(&(E)->qps[(Q)].rd_credit_send_lock, &_tmp_value, 0); \
    } while (0)
#define BTL_OPENIB_GET_CREDITS(FROM, TO)                                        \
    OPAL_ATOMIC_SWAP_32(&FROM, TO)

I mean it is possible to write the code but it is messy and can lead to bugs. There are a lot of runtime-detected compiler bugs out there that eventually cause problems. Typically the compiler vendor fixes it and we add a note about a particular compiler version. In this case it will never be fixed and is a fundamental flaw in the sub-object naming support in Pathscale.

@rhc54
Copy link
Contributor

rhc54 commented Nov 8, 2017

I'm sorry, I know this means a lot to you - but I honestly fail to see that much difference in the code. This seems like a tempest in a very small mug of coffee to me.

I'm happy to discuss this with the PMIx community, but I personally am skeptical of blacklisting compilers for things that seem rather trivial (and yes, I know you are passionate about it).

@hjelmn
Copy link
Member

hjelmn commented Nov 8, 2017

PMIx doesn't have to go the same way as Open MPI though it makes sense for PMIx to be compiled with the system C compiler (usually gcc, icc, or clang) as there is no fortran or C++ interface (that I know of). Open MPI is definitely more complicated but as far as I can tell we do not officially support Pathscale. It isn't in the supported, tested, or lightly tested lists. We do have some notes about versions that just do not work though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants