Skip to content
This repository was archived by the owner on Apr 2, 2025. It is now read-only.

hpcrun with OpenMPI: deadlock profiling MPI program (ARM & x86 machines) #73

@laksono

Description

@laksono

Platform: ARM
Branch: master and mult-kernel
Compiler: GCC
OpenMPI 3.0.0 (gcc)

AMG2006 and Nekbone work fine without hpctoolkit:

[la5@arm1 test]$ time ./amg2006 -P 1 1 1   -r 10 10 10
...
real    0m7.140s
user    0m14.706s
sys     0m0.248s

But deadlock when profiling with REALTIME or perf_events' CYCLE.

Error message appears when profiled with REALTIME:

$ /home/la5/pkgs/hpctoolkit-master/bin/hpcrun -dd LINUX_PERF  -e REALTIME@5000 ./amg2006 -P 1 1 1  -r 10 10  10
[arm1.cs.rice.edu:03399] PMIX ERROR: ERROR in file ../../../../../../../opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c at line 1658
[arm1.cs.rice.edu:03399] PMIX ERROR: OUT-OF-RESOURCE in file ../../../../../../../opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c at line 1759
[arm1.cs.rice.edu:03399] PMIX ERROR: OUT-OF-RESOURCE in file ../../../../../../../opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c at line 1123
[arm1.cs.rice.edu:03399] PMIX ERROR: OUT-OF-RESOURCE in file ../../../../../../../opal/mca/pmix/pmix2x/pmix/src/common/pmix_jobdata.c at line 112
[arm1.cs.rice.edu:03399] PMIX ERROR: OUT-OF-RESOURCE in file ../../../../../../../opal/mca/pmix/pmix2x/pmix/src/common/pmix_jobdata.c at line 392
[arm1.cs.rice.edu:03399] PMIX ERROR: OUT-OF-RESOURCE in file ../../../../../../../opal/mca/pmix/pmix2x/pmix/src/server/pmix_server.c at line 518

After the error message, the application deadlocks in MPI initialization.
Location of the deadlock using gdb:

(gdb) bt
#0  0x000040003e5cc754 in __read_nocancel () from /lib64/libpthread.so.0
#1  0x000040003f469ae0 in rte_init.part () from /projects/pkgs/openmpi-3.0.0/lib/openmpi/mca_ess_singleton.so
#2  0x000040003e8e02ac in orte_init () from /projects/pkgs/openmpi-3.0.0/lib/libopen-rte.so.40
#3  0x000040003e4912f0 in ompi_mpi_init () from /projects/pkgs/openmpi-3.0.0/lib/libmpi.so.40
#4  0x000040003e4b8314 in PMPI_Init () from /projects/pkgs/openmpi-3.0.0/lib/libmpi.so.40
#5  0x000040003e33df58 in MPI_Init (argc=0xfffff2f500ec, argc@entry=0xfffff2f501fc, argv=0xfffff2f500e0, argv@entry=0xfffff2f501f0) at ../../libmonitor/src/mpi_init_c.c:33
#6  0x0000000000401ed8 in main (argc=<optimized out>, argv=<optimized out>) at amg2006.c:1615

Somehow there are two hpcfnbounds processes

[la5@arm1 test]$ ps
  PID TTY          TIME CMD
 1894 pts/12   00:00:06 amg2006
 1909 pts/12   00:00:00 hpcfnbounds-bin
 1918 pts/12   00:00:00 hpcfnbounds-bin
 1931 pts/12   00:00:00 ps
21150 pts/12   00:00:00 bash

It seems hpcrun causes deadlock in MPI initialization. The timer still works, but cannot make progress from MPI_init.

[la5@arm1 test]$ gdb -p gdb -p 1894                                                                                                                 
...
(gdb) bt
#0  0x0000ffff962476e8 in __write_nocancel () from /lib64/libpthread.so.0
#1  0x0000ffff96514fe4 in hpcrun_write_msg_to_log (echo_stderr=false, add_thread_id=true, tag=0xffff96532dc8 "NORM_IP", fmt=0xffff96532da0 "normalizing %p, w load_module %s",
    box=0xfffff3717bd0) at ../../../../src/tool/hpcrun/messages/messages-async.c:242
#2  0x0000ffff96514c00 in hpcrun_pmsg (tag=0xffff96532dc8 "NORM_IP", fmt=0xffff96532da0 "normalizing %p, w load_module %s") at ../../../../src/tool/hpcrun/messages/messages-async.c:154
#3  0x0000ffff965170d4 in hpcrun_normalize_ip (unnormalized_ip=0x401e90 <main>, lm=0xffff9583ae48) at ../../../../src/tool/hpcrun/utilities/ip-normalized.c:72
#4  0x0000ffff9651fcd4 in compute_normalized_ips (cursor=0xfffff3718278) at ../../../../src/tool/hpcrun/unwind/common/libunw_intervals.c:102
#5  0x0000ffff9651fda4 in libunw_find_step (cursor=0xfffff3718278) at ../../../../src/tool/hpcrun/unwind/common/libunw_intervals.c:124
#6  0x0000ffff965201e4 in libunw_unw_step (cursor=0xfffff3718278) at ../../../../src/tool/hpcrun/unwind/common/libunw_intervals.c:251
#7  0x0000ffff96521834 in hpcrun_unw_step (cursor=0xfffff3718278) at ../../../../src/tool/hpcrun/unwind/generic-libunwind/libunw-unwind.c:176
#8  0x0000ffff9651ea6c in hpcrun_generate_backtrace_no_trampoline (bt=0xfffff37193f0, context=0xfffff3719750, skipInner=0) at ../../../../src/tool/hpcrun/unwind/common/backtrace.c:249
#9  0x0000ffff9651ecbc in hpcrun_generate_backtrace (bt=0xfffff37193f0, context=0xfffff3719750, skipInner=0) at ../../../../src/tool/hpcrun/unwind/common/backtrace.c:310
#10 0x0000ffff96502eb0 in help_hpcrun_backtrace2cct (bundle=0xffff9589ffb0, context=0xfffff3719750, metricId=0, metricIncr=..., skipInner=0, isSync=0, data=0x0)
    at ../../../../src/tool/hpcrun/cct_insert_backtrace.c:379
#11 0x0000ffff96502a28 in hpcrun_backtrace2cct (cct=0xffff9589ffb0, context=0xfffff3719750, metricId=0, metricIncr=..., skipInner=0, isSync=0, data=0x0)
    at ../../../../src/tool/hpcrun/cct_insert_backtrace.c:246
#12 0x0000ffff9650695c in hpcrun_sample_callpath (context=0xfffff3719750, metricId=0, metricIncr=..., skipInner=0, isSync=0, data=0x0) at ../../../../src/tool/hpcrun/sample_event.c:238
#13 0x0000ffff9650aae0 in itimer_signal_handler (sig=37, siginfo=0xfffff37196d0, context=0xfffff3719750) at ../../../../src/tool/hpcrun/sample-sources/itimer.c:725
#14 0x0000ffff964b5a90 in monitor_signal_handler (sig=37, info=0xfffff37196d0, context=0xfffff3719750) at ../../libmonitor/src/signal.c:217
#15 <signal handler called>
#16 0x0000ffff96247754 in __read_nocancel () from /lib64/libpthread.so.0
#17 0x0000ffff953a3ae0 in rte_init.part () from /projects/pkgs/openmpi-3.0.0/lib/openmpi/mca_ess_singleton.so
#18 0x0000ffff95ea62ac in orte_init () from /projects/pkgs/openmpi-3.0.0/lib/libopen-rte.so.40
#19 0x0000ffff962f52f0 in ompi_mpi_init () from /projects/pkgs/openmpi-3.0.0/lib/libmpi.so.40
#20 0x0000ffff9631c314 in PMPI_Init () from /projects/pkgs/openmpi-3.0.0/lib/libmpi.so.40
#21 0x0000ffff964b6f58 in MPI_Init (argc=0xfffff371b1ec, argc@entry=0xfffff371b2fc, argv=0xfffff371b1e0, argv@entry=0xfffff371b2f0) at ../../libmonitor/src/mpi_init_c.c:33
#22 0x0000000000401ed8 in main (argc=<optimized out>, argv=<optimized out>) at amg2006.c:1615

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions