This repository was archived by the owner on Apr 2, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 60
hpcrun with OpenMPI: deadlock profiling MPI program (ARM & x86 machines) #73
Copy link
Copy link
Closed
Labels
Description
Platform: ARM
Branch: master and mult-kernel
Compiler: GCC
OpenMPI 3.0.0 (gcc)
AMG2006 and Nekbone work fine without hpctoolkit:
[la5@arm1 test]$ time ./amg2006 -P 1 1 1 -r 10 10 10
...
real 0m7.140s
user 0m14.706s
sys 0m0.248s
But deadlock when profiling with REALTIME or perf_events' CYCLE.
Error message appears when profiled with REALTIME:
$ /home/la5/pkgs/hpctoolkit-master/bin/hpcrun -dd LINUX_PERF -e REALTIME@5000 ./amg2006 -P 1 1 1 -r 10 10 10
[arm1.cs.rice.edu:03399] PMIX ERROR: ERROR in file ../../../../../../../opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c at line 1658
[arm1.cs.rice.edu:03399] PMIX ERROR: OUT-OF-RESOURCE in file ../../../../../../../opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c at line 1759
[arm1.cs.rice.edu:03399] PMIX ERROR: OUT-OF-RESOURCE in file ../../../../../../../opal/mca/pmix/pmix2x/pmix/src/dstore/pmix_esh.c at line 1123
[arm1.cs.rice.edu:03399] PMIX ERROR: OUT-OF-RESOURCE in file ../../../../../../../opal/mca/pmix/pmix2x/pmix/src/common/pmix_jobdata.c at line 112
[arm1.cs.rice.edu:03399] PMIX ERROR: OUT-OF-RESOURCE in file ../../../../../../../opal/mca/pmix/pmix2x/pmix/src/common/pmix_jobdata.c at line 392
[arm1.cs.rice.edu:03399] PMIX ERROR: OUT-OF-RESOURCE in file ../../../../../../../opal/mca/pmix/pmix2x/pmix/src/server/pmix_server.c at line 518
After the error message, the application deadlocks in MPI initialization.
Location of the deadlock using gdb:
(gdb) bt
#0 0x000040003e5cc754 in __read_nocancel () from /lib64/libpthread.so.0
#1 0x000040003f469ae0 in rte_init.part () from /projects/pkgs/openmpi-3.0.0/lib/openmpi/mca_ess_singleton.so
#2 0x000040003e8e02ac in orte_init () from /projects/pkgs/openmpi-3.0.0/lib/libopen-rte.so.40
#3 0x000040003e4912f0 in ompi_mpi_init () from /projects/pkgs/openmpi-3.0.0/lib/libmpi.so.40
#4 0x000040003e4b8314 in PMPI_Init () from /projects/pkgs/openmpi-3.0.0/lib/libmpi.so.40
#5 0x000040003e33df58 in MPI_Init (argc=0xfffff2f500ec, argc@entry=0xfffff2f501fc, argv=0xfffff2f500e0, argv@entry=0xfffff2f501f0) at ../../libmonitor/src/mpi_init_c.c:33
#6 0x0000000000401ed8 in main (argc=<optimized out>, argv=<optimized out>) at amg2006.c:1615
Somehow there are two hpcfnbounds processes
[la5@arm1 test]$ ps
PID TTY TIME CMD
1894 pts/12 00:00:06 amg2006
1909 pts/12 00:00:00 hpcfnbounds-bin
1918 pts/12 00:00:00 hpcfnbounds-bin
1931 pts/12 00:00:00 ps
21150 pts/12 00:00:00 bash
It seems hpcrun causes deadlock in MPI initialization. The timer still works, but cannot make progress from MPI_init.
[la5@arm1 test]$ gdb -p gdb -p 1894
...
(gdb) bt
#0 0x0000ffff962476e8 in __write_nocancel () from /lib64/libpthread.so.0
#1 0x0000ffff96514fe4 in hpcrun_write_msg_to_log (echo_stderr=false, add_thread_id=true, tag=0xffff96532dc8 "NORM_IP", fmt=0xffff96532da0 "normalizing %p, w load_module %s",
box=0xfffff3717bd0) at ../../../../src/tool/hpcrun/messages/messages-async.c:242
#2 0x0000ffff96514c00 in hpcrun_pmsg (tag=0xffff96532dc8 "NORM_IP", fmt=0xffff96532da0 "normalizing %p, w load_module %s") at ../../../../src/tool/hpcrun/messages/messages-async.c:154
#3 0x0000ffff965170d4 in hpcrun_normalize_ip (unnormalized_ip=0x401e90 <main>, lm=0xffff9583ae48) at ../../../../src/tool/hpcrun/utilities/ip-normalized.c:72
#4 0x0000ffff9651fcd4 in compute_normalized_ips (cursor=0xfffff3718278) at ../../../../src/tool/hpcrun/unwind/common/libunw_intervals.c:102
#5 0x0000ffff9651fda4 in libunw_find_step (cursor=0xfffff3718278) at ../../../../src/tool/hpcrun/unwind/common/libunw_intervals.c:124
#6 0x0000ffff965201e4 in libunw_unw_step (cursor=0xfffff3718278) at ../../../../src/tool/hpcrun/unwind/common/libunw_intervals.c:251
#7 0x0000ffff96521834 in hpcrun_unw_step (cursor=0xfffff3718278) at ../../../../src/tool/hpcrun/unwind/generic-libunwind/libunw-unwind.c:176
#8 0x0000ffff9651ea6c in hpcrun_generate_backtrace_no_trampoline (bt=0xfffff37193f0, context=0xfffff3719750, skipInner=0) at ../../../../src/tool/hpcrun/unwind/common/backtrace.c:249
#9 0x0000ffff9651ecbc in hpcrun_generate_backtrace (bt=0xfffff37193f0, context=0xfffff3719750, skipInner=0) at ../../../../src/tool/hpcrun/unwind/common/backtrace.c:310
#10 0x0000ffff96502eb0 in help_hpcrun_backtrace2cct (bundle=0xffff9589ffb0, context=0xfffff3719750, metricId=0, metricIncr=..., skipInner=0, isSync=0, data=0x0)
at ../../../../src/tool/hpcrun/cct_insert_backtrace.c:379
#11 0x0000ffff96502a28 in hpcrun_backtrace2cct (cct=0xffff9589ffb0, context=0xfffff3719750, metricId=0, metricIncr=..., skipInner=0, isSync=0, data=0x0)
at ../../../../src/tool/hpcrun/cct_insert_backtrace.c:246
#12 0x0000ffff9650695c in hpcrun_sample_callpath (context=0xfffff3719750, metricId=0, metricIncr=..., skipInner=0, isSync=0, data=0x0) at ../../../../src/tool/hpcrun/sample_event.c:238
#13 0x0000ffff9650aae0 in itimer_signal_handler (sig=37, siginfo=0xfffff37196d0, context=0xfffff3719750) at ../../../../src/tool/hpcrun/sample-sources/itimer.c:725
#14 0x0000ffff964b5a90 in monitor_signal_handler (sig=37, info=0xfffff37196d0, context=0xfffff3719750) at ../../libmonitor/src/signal.c:217
#15 <signal handler called>
#16 0x0000ffff96247754 in __read_nocancel () from /lib64/libpthread.so.0
#17 0x0000ffff953a3ae0 in rte_init.part () from /projects/pkgs/openmpi-3.0.0/lib/openmpi/mca_ess_singleton.so
#18 0x0000ffff95ea62ac in orte_init () from /projects/pkgs/openmpi-3.0.0/lib/libopen-rte.so.40
#19 0x0000ffff962f52f0 in ompi_mpi_init () from /projects/pkgs/openmpi-3.0.0/lib/libmpi.so.40
#20 0x0000ffff9631c314 in PMPI_Init () from /projects/pkgs/openmpi-3.0.0/lib/libmpi.so.40
#21 0x0000ffff964b6f58 in MPI_Init (argc=0xfffff371b1ec, argc@entry=0xfffff371b2fc, argv=0xfffff371b1e0, argv@entry=0xfffff371b2f0) at ../../libmonitor/src/mpi_init_c.c:33
#22 0x0000000000401ed8 in main (argc=<optimized out>, argv=<optimized out>) at amg2006.c:1615