-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opal/runtime: Add a hang-up detection feature #3700
Conversation
opal/runtime/opal_progress.c
Outdated
void *cbdata) | ||
{ | ||
char prefix[100 + OPAL_MAXHOSTNAMELEN]; | ||
FILE *stream = stderr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of hardcoding stderr
could we harness the opal_stacktrace_output_filename
MCA parameter (-mca opal_stacktrace_output
) to allow for some flexibility in the destination of the stack output (so we can choose a file, for example). There is some logic in opal/util/stacktrace.c
to process that MCA parameter. Maybe we can move that logic to a more commonly accessible area.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'll update the PR to use it.
This has been tried before with little success, mainly because (1) the timeout is application specific and (2) we prevent multi-threaded applications from using blocking communications to provide progress. OMPI has support for debugging message queues, a mechanism that can be used by an external process (such as padb) to extract information about all pending requests, and aggregate this information before exposing it to the users. I think a similar mechanisms are already implemented in TV. We should certainly add this on the list of discussions for the developers meeting. |
@bosilca I know this is not a complete/sophisticated solution. This feature is disabled by default and a user who suffers from a hang-up bug should enable it with a proper timeout (application specific). Though not all applications can use it, many applications (with a bug) get benefits. Many our customers use it when debugging. The advantage of this code is that a user does not need any external tools and can try it only with an MCA parameter. Under a batch job scheduling system, attaching to MPI processes is sometimes difficult. So it's an important point. The message queue mechanism, which I forgot, can be useful. I'll check it up. Thanks. I added this to the meeting topics list on the Wiki. |
Debugging communication deadlock of MPI processes is a difficult task. If one process stops progress for some reason, peer processes who wait the process also stop progress. And all processes in the job will stop progress. Usually, finding the causal process is not a trivial task. Sometimes, even determining whether the situation is communication deadlock or just slowdown is difficult. This commit adds a feature to detect a possible deadlock (hang-up) and output information which may be useful to analyze the deadlock. Added Feature ============= Detection --------- The logic of the deadlock detection is very simple. If a waiting operation (in `MPI_WAIT` etc.) does not complete within a certain time, we consider it as a hang-up situation. The time limit can be set by the MCA parameter `opal_progress_timeout` in seconds. I know this logic is suboptimal and there are studies about communication deadlock detection/analysis. But this logic is easy to implement and is proved to be very useful among Fujitsu MPI customers. By default, `opal_progress_timeout` is 0 and this feature is disabled. Output ------ If a hang-up situation is detected, the following information is output at each process. - message to indicate a possible hang-up is detected - stack trace - data of waiting request An output example: ``` [dirac:19756] Possible hang-up (no progress) is detected on [[56375,1],0] [dirac:19756] [ 0] /home/tkawa/openmpi/lib/libopen-pal.so.0(opal_progress_handle_hangup+0xc0)[0x7fa267a9a540] [dirac:19756] [ 1] /home/tkawa/openmpi/lib/openmpi/mca_pml_ob1.so(+0x102e8)[0x7fa25c2c42e8] [dirac:19756] [ 2] /home/tkawa/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x599)[0x7fa25c2c6731] [dirac:19756] [ 3] /home/tkawa/openmpi/lib/libmpi.so.0(PMPI_Send+0x2a7)[0x7fa26873995e] [dirac:19756] [ 4] ./deadlock[0x400c7c] [dirac:19756] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fa2680dcb45] [dirac:19756] [ 6] ./deadlock[0x400a99] [dirac:19756] send request (c=0xb6c100 f=UNDEFINED) to rank=0 (jobid=3694592001 vpid=0) on communicator=MPI_COMM_WORLD (c=0x601500 f=0 id=0 my_rank=0) with tag=0 for datatype=MPI_CHAR (c=0x601300 f=34 id=1) x count=131072 in addr=0x7ffcdd9cdb00 [complete=n state=2 type=1 pml_complete=n free_called=n sequence=1 send_mode=4 bytes_packed=131072 recv=(nil) state=0 throttle_sends=n pipeline_depth=0 bytes_delivered=0 rdma_cnt=1 pending=0] ``` Assembling output from all processes and analyzing the data will help finding the cause of the hang-up. But such script is not ready at this moment. Termination ----------- By default, when hang-up is detected and information is output, the process will abort with `exit(1)`. If one MPI process aborts, `orted` will terminates all MPI processes. But in many cases, other processes also detect hang-up and try to output information. To ensure information is output on all processes, the existing MCA parameter `opal_abort_delay` can be used. If a positive value is specified, each process sleeps for the given seconds to wait other processes to output information. If a negative value is specified, each process sleeps forever and allows attaching of a debugger. This detection logic gives a false-positive. If the value of `opal_progress_timeout` is negative, the process does not abort. It only output information and continue progress when possible hang-up is detected. Similar Feature =============== `orterun` has the `--timeout` option and also can detect possible hang-up. Advantage of this feature is: - The timeout of this feature is per-operation basis. - Request data can be output. - It can be used with resource manager other than `orted`. Code Modification ================= Detection --------- `while` loops which call the `opal_progress` function are replaced with the `OPAL_PROGRESS_WHILE` macro and `OPAL_PROGRESS_BLOCK_WHILE` macro, which processes timeout. In this commit, only `while` loops in the following functions are replaced. - `sync_wait_st` - `sync_wait_mt` - `opal_condition_wait` These functions covers most of blocking MPI functions but not all. You can replace loops in other functions easily. Output ------ In this commit, output of request data is supported by only ob1 PML requests. You can support data output by setting the `req_dump` callback function of the `ompi_request_t` structure. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Instead of hardcoding `stderr`, harness the `opal_stacktrace_output` MCA parameter to allow for some flexibility in the destination of the hang-up detection message and the stack output. Thanks to Josh Hursey for the suggestion. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
606fb08
to
5a1341b
Compare
bot:mellanox:retest |
1 similar comment
bot:mellanox:retest |
I'll try another approach. |
Add a hang-up detection feature
Debugging communication deadlock of MPI processes is a difficult task. If one process stops progress for some reason, peer processes who wait the process also stop progress. And all processes in the job will stop progress. Usually, finding the causal process is not a trivial task. Sometimes, even determining whether the situation is communication deadlock or just slowdown is difficult. This commit adds a feature to detect a possible deadlock (hang-up) and output information which may be useful to analyze the deadlock.
I want to discuss this code modification at the upcoming July 2017 OMPI developer's meeting.
Added Feature
Detection
The logic of the deadlock detection is very simple. If a waiting operation (in
MPI_WAIT
etc.) does not complete within a certain time, we consider it as a hang-up situation. The time limit can be set by the MCA parameteropal_progress_timeout
in seconds. I know this logic is suboptimal and there are studies about communication deadlock detection/analysis. But this logic is easy to implement and is proved to be very useful among Fujitsu MPI customers.By default,
opal_progress_timeout
is 0 and this feature is disabled.Output
If a hang-up situation is detected, the following information is output at each process.
An output example:
Assembling output from all processes and analyzing the data will help finding the cause of the hang-up. But such script is not ready at this moment.
Termination
By default, when hang-up is detected and information is output, the process will abort with
exit(1)
.If one MPI process aborts,
orted
will terminates all MPI processes. But in many cases, other processes also detect hang-up and try to output information. To ensure information is output on all processes, the existing MCA parameteropal_abort_delay
can be used. If a positive value is specified, each process sleeps for the given seconds to wait other processes to output information. If a negative value is specified, each process sleeps forever and allows attaching of a debugger.This detection logic gives a false-positive. If the value of
opal_progress_timeout
is negative, the process does not abort. It only output information and continue progress when possible hang-up is detected.Similar Feature
orterun
has the--timeout
option and also can detect possible hang-up. Advantage of this feature is:orted
.Code Modification
Detection
while
loops which call theopal_progress
function are replaced with theOPAL_PROGRESS_WHILE
macro andOPAL_PROGRESS_BLOCK_WHILE
macro, which processes timeout. In this commit, onlywhile
loops in the following functions are replaced.sync_wait_st
sync_wait_mt
opal_condition_wait
These functions covers most of blocking MPI functions but not all. You can replace loops in other functions easily.
Output
In this commit, output of request data is supported by only ob1 PML requests. You can support data output by setting the
req_dump
callback function of theompi_request_t
structure.