-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a MCA framework for threads #6578
Conversation
Just curious - I see a change in libevent itself. Does that mean we will no longer be able to support an external libevent? The change seems minor - are you planning to remove it before committing? |
@npe9 did you come up with a way to avoid the libevent patch? |
@hppritcha @rhc54 External libevents should still be fine. The msec change is for argobots progress I believe (@shintaro-iwasaki?). We probably need to look at better integrating libevent with tasking runtimes but that's orthogonal to this pull request. The other change is the openmpi specific header ( With @shintaro-iwasaki's permission, you can probably back out the msec change. |
What happens if OMPI is using argobots (or whatever) and PMIx is using libevent? I don't offhand see why there would be an issue, but we are interweaving our way in/out of threads between the two libraries so I figure it might be worth asking. |
@rhc54 The idea is to have them both using the same scheduling substrate, integrating PMIx more tightly is on my todo list. Once this code is in, I intend to be more active on the PMIx slack. I'm also interested in caching mechanisms for PMIx that would potentially obviate the need for TLS entirely. |
There were some timeouts in the CI testing -- might be worth checking out to see if they are due to the code changes in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started this review by looking at the individual commits, not realizing that there's a sequence of WIP commits (vs. a final set of polished / complete commits). E.g., some things I commented on in early commits are fixed in later commits. I trust that these types of things will be squashed down appropriately before this is merged. For such things, you can disregard my comments.
I then started reading the entire PR as a whole (vs. the individual commits). But I stopped because I'm somewhat confused. It seems like this PR is only halfway converting to a thread framework -- there seem to be parts still left in the base and parts in a framework. There's parts that do not obey the framework / component symbol prefix rule. The component configure.m4
scripts do not seem to actually do functional tests. There's no explanation for what the component API is, or what the intent is. It also seems that this is a compile-time framework (vs. a run-time decision framework). An explanation for that would be appreciated, because this will tend to lead to multiple Open MPI installations (vs. a single installation where the threading package can be chosen at run time).
Overall, there's no top-level description that explains what precisely this new "threads" framework is encompassing (and why). It seems like there was movement of threads, mutexes, condition variables, and at least some of the peripherary surrounding them into a framework. A cohesive explanation of the new view of the world would be greatly appreciated, either in a README[.md] in a .h
file somewhere (i.e., it should be part of the code base, not something that exists only on the gtihub UI). E.g., it would be good to explain exactly what a component needs to implement in order to be able to be used in this framework.
return 0; | ||
} | ||
int opal_uses_threads = 0; | ||
int opal_tsd_keys_destruct() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a (void)
declaration here.
|
||
int opal_condition_t_class; | ||
int opal_mutex_t_class; | ||
int opal_recursive_mutex_t_class; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't these need to be initialized to avoid common symbols?
|
||
static int mca_threads_base_register(mca_base_register_flag_t flags) | ||
{ | ||
// Do I need to register anything here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you? Please answer this question / update the code rather than leave a "to do"-like comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsquyres I don't believe I do, but I'd like to make sure with someone that knows more about proper mca instantiation before I get rid of the comment. Who's a good person to review this with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably @hppritcha. If you have nothing to register, you should delete this function.
} | ||
|
||
void opal_thread_set_main() { | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Void functions can't return a value. Also need a (void
) in the declaration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
threads_base_null was an artifact of the initial template (the timer mca), but looking at your message I think these went away after the squash.
@@ -281,11 +281,11 @@ if test "$opal_pthread_c_success" = "0"; then | |||
opal_pthread_c_success=0) | |||
AC_LANG_POP(C) | |||
if test "$opal_pthread_c_success" = "1"; then | |||
PTHREAD_CFLAGS="$pf" | |||
TPKG_CFLAGS="$pf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is TPKG
? Why is it better than a PTHREAD
prefix? The tests here are specific to pthreads, after all...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hppritcha can explain the motivation better than I can, but I believe that the TPKG_CFLAGS get used for every possible thread package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@npe9 is correct. TPK_CFLAGS is used on opal_config_threads.m4. Right now the argo and qthreads configure functions don't use it, but the name was to indicate its more general and may be set by other thread package configure functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well we moved all of the threads related configury into the framework, so probably this comment doesn't apply anymore?
config/opal_config_threads.m4
Outdated
|
||
OPAL_SUMMARY_ADD([[Miscellaneous]],[[Threading Package]],[opal_threads], [$thread_type_found]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that threads is a framework, shouldn't all of this be in opal/mca/threads/configure.m4
and/or the various component configure.m4
files? Seems like an abstraction violation to have all this stuff at the top level now that there is a framework and components for each of these individual things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this has been done -- I still see lots of logic and hard-coding for argobots in config/opal_config_threads.m4
. Shouldn't config/opal_config_threads.m4
go away and be fully replaced by framework-level thread m4 stuff?
@@ -162,6 +162,9 @@ poll_dispatch(struct event_base *base, struct timeval *tv) | |||
|
|||
EVBASE_RELEASE_LOCK(base, th_base_lock); | |||
|
|||
if (msec > 0) { | |||
msec = 0; | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in the comments on this PR, this change to libevent is... tricky. What does it mean for external libevents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted this
AC_DEFUN([MCA_opal_threads_argobots_CONFIG],[ | ||
AC_CONFIG_FILES([opal/mca/threads/argobots/Makefile]) | ||
|
||
AC_MSG_CHECKING([HAVE_THREAD_PKG_TYPE = $HAVE_THREAD_PKG_TYPE]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This output message seems weird.
Don't you need some actual functional tests here for argobots?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed - unnecessary messages
evthread_argobots_lock_alloc(unsigned locktype) | ||
{ | ||
ABT_mutex lock; | ||
if (locktype & EVTHREAD_LOCKTYPE_RECURSIVE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeting level is wrong.
4 space indenting levels, please -- and no tabs.
@jsquyres Thanks for the feedback regarding documentation. I explained our concerns about making the threading library shared initially (performance) inline to your comments. If that explanation is sufficient I can add it to a README in the mca root. I'll also provide a writeup of the motivation for the component when I get a chance. Also, who's the proper person to act as a shepherd for documentation issues? |
@npe9 Yea, please add a README or a large comment somewhere in this PR. I ask because this framework seems to be a bit different than our other frameworks (i.e., provide a component struct and provide a module struct, and all the APIs are available through there). Some explanation of exactly what component authors are supposed to provide would be greatly appreciated. |
i think the AWS failures are legit but just in case |
FYI The README is coming, I'm on paternity leave and cramming for an April 22nd paper deadline. I'll do the README when things slow down. |
bot:ompi:retest |
1 similar comment
bot:ompi:retest |
Is this PR ready to do some performance regression testing on ppc64le? |
@gpaulsen yes. I'm trying to figure out the ubuntu problem we hitting with AWS CI. |
@hppritcha I don't think Ubuntu was the problem (it looks like Ubuntu was killed because another job failed). If you follow the red balls of failure, you see that the --disable-oshmem build failed. It failed during execution of ring_c, during shutdown, because of a segmentation fault:
My guess is that something after the MCA framework is shut down is making a call into the threading framework. Not sure why the --disable-oshmem would change that, so it might just be racey. |
@bwbarrett so far its only been on the ubuntu image that we're seeing this segfault. i can't reproduce on redhat or suse, so trying it in a vm running ubuntu. |
hmm... let's see if it fails just in fortran test again. |
UH was having issues, so restart ompi CI |
52e6bbb
to
6cdb2df
Compare
@rhc54 I excised the libevent change from this PR. |
@jsquyres i think we've cleaned up the issues you found with the PR? Could you review again? We'll squash down the commits - maybe not to one. Also, we plan to add some perf data soon. |
MPI put bibw rate using the fence variant of the test from the rma-mt repo: https://github.com/hpc/rma-mt |
So no noticable impact from the PR. That's good. |
Errr...if there is no identifiable benefit, then what is the point of this change? |
@rhc54 The benefit is adding support for different threading models. The first step is adding a framework. After that I believe we intend to add support for QThreads. |
@rhc54 The potential benefit is better integration of progress threads and threading runtimes (especially OpenMP via Bolt). The first goal though (this PR) is to get the infrastructure in place and "do no harm". The results showing no performance hit on the pthreads code is the start. Once integrated, the next step is to go through and add an opal_yield call instead of usleeps (the pthread mca will still use usleeps though) and work to get libevent working non-preemptively. |
@devreal Consider the case where there are two qthreads or argobots threads that are both mapped to a single pthread. Now, if one of the threads tries to do a blocking send to the other thread, which in turn is trying to do a blocking receive, this program can deadlock unless the MPI library can yield from one thread to another. Note that pthread yield is not sufficient here because both the threads are mapped to the same pthread. This is required by the MPI standard. MPI-3.1 page 485, lines 6-7:
The reason I pointed out that they have independent stacks is to indicate that they really are "threads" (as opposed to OpenMP or Argobots tasks, for example). FWIW, the same problem with occur with Solaris threads, which uses a similar model, I believe. Moreover, the pthread specification doesn't actually say that each pthread is mapped to a kernel thread. It just that most (all?) implementations of pthreads do that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow my comments got registered but not my review. I have to resubmit it.
Sorry about this long post. This may not be the right forum but I think it's important to bring up. tl;dr: IMO, ULTs are not threads in the sense of the MPI standard and should not be treated that way. @pavanbalaji I've been grappling with this for some time now. I believe that the terminology we're using is quite convoluted and unclear. Here is what my understanding is right now and I'm happy to stand corrected: lightweight processes (LWPs) are a kernel construct that allow multiple threads inside a single process to share resources. Threads are scheduled by a scheduler and are pre-emtable. There is no need for a 1:1 mapping between LWPs and threads but some scheduler is responsible for some degree of fairness between threads (it is my understanding that pthreads are always pre-emptable since they carry their own scheduling policy and priority). This is what the MPI standard refers to as "a thread package similar to POSIX threads". It is my understanding that even Solaris threads are pre-emptable [1] (the MxN mapping between LWPs and Solaris threads was apparently abandoned in 2002 [2]). In that scheme, LWPs are a way to control the degree of parallelism by limiting the number of parallel active threads. So, thread support in MPI means protecting shared data structures against concurrent accesses. This is the guarantee that IMO, the term User Level Thread is misleading. What you are describing is a fiber: a lightweight thread that is executed by a regular thread in it's own context (stack) but scheduled cooperatively [3]. The boost.fiber package is an example, not to be confused with boost.thread, which is essentially a wrapper around POSIX/Windows threads. We may disagree on this and I'm happy to adjust my definitions. It is important, however, that ULTs/fibers have different scheduling semantics than POSIX threads. The point here is that by using blocking MPI calls inside different fibers potentially running on the same thread your program is ill-formed. Essentially, it is the same as creating a pthread and issuing both blocking send and receive in sequence. You may do so switching stacks in between. Yes, MPI will block that thread and allow all other threads in the process to happily go about their lifes. But that one thread will remain blocked. That is in accordance with the requirement in the MPI standard you cited. And MPI provides non-blocking communication to avoid this. What this PR does is shift the burden of correct scheduling (as opposed to correct mutual exclusion) into the MPI layer. The consequences are significant: an application will still see I fully agree that mixing ULT/fibers and MPI is an issue that needs to be addressed and there have been other approaches to that end (TAMPI comes to mind, there may be others). I am not convinced that MPI implementations are the right point for tackling this and we should be very cautious opening that can of worms. As I said at the top, this may not be the right forum but I think it's important to have that discussion. If the community decides that broadening the definition of what a thread is in the sense of MPI that's fine. I just wanted to throw in my 2 cents here :) [1] https://docs.oracle.com/cd/E19620-01/805-5080/6j4q7emic/index.html |
I won't weigh in on the question of whether or not to do this - but I am a tad concerned about it being a configure-time decision. It sounds like we are going to now require that organizations not only build and maintain OMPI against various types of compilers, but also multiply that by the number of different threading models?? Did we just triple the number of installations an organization has to maintain? Is there no way this could be done as a runtime decision? |
I share your concern, but threading constructs are so fundamental to performance and so low level that I'm concerned a run-time selection would be catastrophic for performance. From a technical perspective, I'm also not sure how we could implement static initializers (which we sprinkled throughout the code) with a run-time selection, since many of those initializers fire before we even get to MPI_INIT. |
Yeah, I also feared that would be the case - just hoped that maybe I was wrong and we could avoid pissing off all the installers and downstream packagers. |
@devreal I think this discussion is getting out of the context of this PR and we should probably move it outside. I feel that saying that only entities that have associated kernel threads would be considered "threads" is quite restrictive (even pthreads do not have that requirement in the POSIX standard). But at this point we have gotten into opinions, rather than facts, so I'll let the OMPI developers decide on this. |
That was not my argument. You can come up with a preempting user-level thread scheduler that meets the POSIX requirements. The point is the difference in scheduling behavior, which is not just an opinion. Anyway, I agree that this is beyond the scope of the PR. I'm happy to discuss this in a different forum. @bwbarrett @rhc54 I think Cray's MPI implementation does runtime selection of implementations based on |
@devreal the Cray MPI isn't that fancy. I forget the details but I think they have some single threaded optimizations that are on by default. MPI_init_thread by default will return max provided of MPI_THREAD_SERIALIZED I think. This allowed for some of the OSU/IMB results to look better than if MPI_THREAD_MULTIPLE was supported by default. adding this env. variable was a way to support MPI_THREAD_MULTIPLE - but not by default. |
@hppritcha Darn! So much pain just for some OSU benchmark results? Oh well, and I thought there really was some fancyness at work there... |
I will check failures of |
bot:ibm:scale:retest |
9edf6fc
to
3dcdcaf
Compare
Add a framework to support different types of threading models including user space thread packages such as Qthreads and argobot: https://github.com/pmodels/argobots https://github.com/Qthreads/qthreads The default threading model is pthreads. Alternate thread models are specificed at configure time using the --with-threads=X option. The framework is static. The theading model to use is selected at Open MPI configure/build time. mca/threads: implement Argobots threading layer config: fix thread configury - Add double quotations - Change Argobot to Argobots config: implement Argobots check If the poll time is too long, MPI hangs. This quick fix just sets it to 0, but it is not good for the Pthreads version. Need to find a good way to abstract it. Note that even 1 (= 1 millisecond) causes disastrous performance degradation. rework threads MCA framework configury It now works more like the ompi/mca/rte configury, modulo some edge items that are special for threading package linking, etc. qthreads module some argobots cleanup Signed-off-by: Noah Evans <noah.evans@gmail.com> Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov> Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov>
To suppress Valgrind warnings, opal_tsd_keys_destruct() needs to explicitly release TSD values of the main thread. However, they were not freed if keys are created by non-main threads. This patch fixes it. This patch also optimizes allocation of opal_tsd_key_values by doubling its size when count >= length instead of increasing the size by one. Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov>
Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov>
Argobots/Qthreads-aware libevent should be used instead. Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov>
3dcdcaf
to
d7fba60
Compare
Is this ok to merge, or is there still discussion/changes needed? We would like to get this in well before the 5.0 branch date. |
@hppritcha, I say we merge it this weekend. Any objection? |
sorry wrong PR about merging back to 4.0.x. of course this doesn't go back to 4. |
@hppritcha sounds good. Merge when you are ready. |
Add a framework to support different types of threading models including
user space thread packages such as Qthreads and argobots:
https://github.com/pmodels/argobots
https://github.com/Qthreads/qthreads
The default threading model is pthreads. Alternate thread models are
specificed at configure time using the --with-threads=X option.
The framework is static. The theading model to use is selected at
Open MPI configure/build time.