-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v3.0.0rc1: PMIx issue: UNPACK-INADEQUATE-SPACE #3809
Comments
Does the problem persist if vader BTL is used instead of sm? |
I can't replicate it using the head of the 3.0 branch |
@hppritcha No, |
In the flood of data that arrives each time I test an RC, I had failed to notice that despite the fact that one is x86-64 and the other ppc64el, both failing systems are using PathScale compilers! I don't have other runs on the x86-64 host, but on the ppc64el host a build with clang runs just fine! So, it looks possible that this could be a PathScale compiler bug. |
x86-64: ekopath-6.0.963 (snapshot / free download)
ppc64el: enzo-6.0.9
|
It could be a compiler issue, but more likely an uninitialized variable that Pathscale sets to a different default. We can poke in the code a bit and see what's going on. |
Here's the culprit:
My guess is that Pathscale doesn't appreciate that cute |
@PHHargrove Can you please try with the following diff: diff --git a/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c b/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c
index 573a83d..3d29fc2 100644
--- a/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c
+++ b/opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c
@@ -2428,6 +2428,7 @@ static int _store_data_for_rank(ns_track_elem_t *ns_info, pmix_rank_t rank, pmix
rank_meta_info *rinfo = NULL;
size_t num_elems, free_offset, new_free_offset;
int data_exist;
+ int32_t cnt;
PMIX_OUTPUT_VERBOSE((10, pmix_globals.debug_output,
"%s:%d:%s: for rank %u", __FILE__, __LINE__, __func__, rank));
@@ -2458,7 +2459,8 @@ static int _store_data_for_rank(ns_track_elem_t *ns_info, pmix_rank_t rank, pmix
*/
free_offset = get_free_offset(datadesc);
kp = PMIX_NEW(pmix_kval_t);
- while (PMIX_SUCCESS == (rc = pmix_bfrop.unpack(buf, kp, &(int){1}, PMIX_KVAL))) {
+ cnt = 1;
+ while (PMIX_SUCCESS == (rc = pmix_bfrop.unpack(buf, kp, &cnt, PMIX_KVAL))) {
pmix_output_verbose(2, pmix_globals.debug_output,
"pmix: unpacked key %s", kp->key);
if (PMIX_SUCCESS != (rc = pmix_sm_store(ns_info, rank, kp, &rinfo, data_exist))) { |
I recall that @hjelmn did similar (and maybe this) changes when we were debugging a dstore issues. If this is the root cause - may need to check for other places. |
Sorry, wrong button. |
Just FWIW: it is legitimate according to the standard - it just may be a part of the standard that Pathscale doesn't support, and since it isn't necessary, it would be an easy fix once we confirm that this is indeed the issue. |
I actually liked that change, looks very nice and avoids unnecessary variable. I didn't knew about this feature. However indeed it is quite possible that some compilers can stumble upon it as it is not a very common practice. |
@rhc54 I tried, just now, configuring with Re: |
I honestly don't recall if OMPI did so. PMIx tries to stay at the lowest common denominator, and so we certainly wouldn't push beyond what OMPI allows. This just slipped thru as something cute and within the standard, but definitely not something we need. I'll setup the PR to fix it. Thanks! |
@PHHargrove I think we only agreed on C99. If pathscale doesn't support |
We shouldn't reject Pathscale over something this trivial and unnecessary. If we get that strict, we'll be rejecting many compilers as they all have some quirk 😄 |
FWIW: that code was already changed in the PMIx master, and so won't be there once we update to PMIx 2.1.0 (or 3.0, whichever comes first). |
I really see no point in allowing non-compliant C compilers. I have long since abandoned using anything other than gcc and clang for the C compiler. Intel compiler is too slow, pgcc is buggy, etc. Just use whatever for the fortran bindings. But that is just me. |
Its really odd that this ends up as a runtime error. Probably means pathscale supports the syntax but there is a bug in the C implementation. |
yep -but it simply isn't a position we can globally take as a community. I agree that it's a bug in their implementation and should be reported to them. |
@rhc54 I am not sure what "them" one can report to. www.pathscale.com is down and Wikipedia's entry on Pathscale says it has been since May, 2017. |
Have to agree with @rhc here -- it's a pretty obscure feature. We apparently used it in a single place. We should report it to the Pathscale people, and they'll likely fix it. It might be different if this was a feature that we used widely throughout the code base. |
dunno - however, the Pathscale folks have responded on our devel list recently, so they do still appear to be around. |
@cbergstrom You guys still around? Can we report a potential Pathscale compiler bug to you? |
I know that Nathan has used this feature in the attempts to fix the patcher/overwrite problems on OpenBSD. So, I wouldn't say this is used in only one place.
|
@PHHargrove I stand corrected. 😄 Perhaps we should update all of these. |
With a better regex to allow digits and spaces in type names I found three more:
|
sigh - yes, let's keep the level of cuteness down a bit |
I'll take care of the ones in the pmix layer |
It isn't cuteness. It is a C99 feature to make a temporary pointer. To be compliant a compiler MUST support it. It is not optional. |
i say "cute" because it is easily coded in an optional manner that yields the exact same binary, and is thus unnecessary. |
It is actually part of a OMPI required C99 feature (sub-object naming) FWIW. See C99 6.5.2.5 Compound literals. |
@hjelmn I believe you. But this is possibly a minor compiler bug. Let's ask the compiler guys about it and see where to go from there. |
Understood - but it doesn't change the situation. This isn't a critical feature we can't live without, it doesn't impact performance, etc. So let's just change the code and move on. |
I am not in a position to make policy for Open MPI or PMIx. |
I fixed them all here: #3813 |
Re: www.pathscale.com being down |
Is this something we can easily add a configure check for? That way configure will fail out if the compiler does not correctly support this aspect of C99. We certainly do this kind of testing of Fortran compilers, so I think it would be good to do so of C compilers. I don't know how difficult it would be to generate a test for it though. |
That's a death-spiral - as Paul pointed out, every compiler has a bug in it somewhere. If we begin rejecting compilers that aren't fully compliant, nothing will be left. Where we want to use a feature that is critical to the implementation or offers a clear performance advantage, then we should take a hard look at the situation if a compiler has issues with it to see if coding around the problem can be done without impacting our objectives. If not, then it makes sense to test/reject. This case doesn't meet that criteria. |
Fixes open-mpi#3809 Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Just as an FYI Pathscale is dead. I would like to revisit the rejection of the &(type){value} syntax as the new atomics take a pointer to type as the compare argument. Will be more than a pain to not be able to use the C99 syntax for creating a temporary pointer. |
I don't pretend that my opinions carry much weight here, but here's my 0.02USD anyway. Just because Pathscale as a company folded in March 2017 does not mean that nobody is is using their compilers in production. In my experience, the more conservative HPC centers will keep a s/w stack fixed (except for bug fixes) for as long as 5 years. See you in Denver, |
I would like to be more aggressive at dropping older compiler support. The way I see it sites that need an older compiler can either use a modern gcc or clang to as the C compiler to build Open MPI or use an older version of Open MPI. I know I am probably in the minority but making extra work for ourselves for the sake of older systems seems silly. |
This is a cross we have born since the very beginning of the project. I'm afraid it is a price of playing in this market segment - as Paul said, those "older systems" are going to be around for quite some time. Just my $0.002 |
@rhc54 This is an example that shows why I am frustrated with this limitation:
The easy way:
The lets support a buggy compiler way:
I mean it is possible to write the code but it is messy and can lead to bugs. There are a lot of runtime-detected compiler bugs out there that eventually cause problems. Typically the compiler vendor fixes it and we add a note about a particular compiler version. In this case it will never be fixed and is a fundamental flaw in the sub-object naming support in Pathscale. |
I'm sorry, I know this means a lot to you - but I honestly fail to see that much difference in the code. This seems like a tempest in a very small mug of coffee to me. I'm happy to discuss this with the PMIx community, but I personally am skeptical of blacklisting compilers for things that seem rather trivial (and yes, I know you are passionate about it). |
PMIx doesn't have to go the same way as Open MPI though it makes sense for PMIx to be compiled with the system C compiler (usually gcc, icc, or clang) as there is no fortran or C++ interface (that I know of). Open MPI is definitely more complicated but as far as I can tell we do not officially support Pathscale. It isn't in the supported, tested, or lightly tested lists. We do have some notes about versions that just do not work though. |
From thread on devel started by @PHHargrove (https://www.mail-archive.com/devel@lists.open-mpi.org/msg20280.html): On (at least) two different hosts (both Linux, one x86-64 and one ppc64el) I am seeing a failure to launch ring_c with errors like those shown below.
Configure options: --prefix=[...] --enable-debug CC=... CXX=... FC=...
One of the two also had --with-libfabric=/usr/local/pkg/libfabric-1.4.0
If it might matter, one of these systems has 14 interfaces listed by /usr/sbin/ifconfig and the other has 8.
The text was updated successfully, but these errors were encountered: