You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to use a custom worker pool (with the goal of using the master process as a worker too, so as not to waste a GPU) but getting a deadlock in this package. Unfortuantely I can't get a MWE, but schematically the MWE looks something like this (although this itself doesn't seem to trigger it):
# myscript.jlusing MPI, MPIClusterManagers, Distributed
MPI.Init()
MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL) # or TCP_TRANSPORT_ALL, doesn't matter
pool =WorkerPool(procs())
for i =1:100pmap(pool, 1:10) do j
# ...endend
and then mpiexec -n 4 julia myscript.jl. In my real case, I always deadlock somewhere before i = 100. Interrupting a worker I can retrieve this stack trace:
signal (15): Terminated
in expression starting at /global/u1/m/marius/work/pipelineB2/scripts/bk18_fwdsim2_nodust.jl:68
jl_pgcstack_addr_static at /buildworker/worker/package_linux64/build/cli/loader_exe.c:14
ctx_switch at /buildworker/worker/package_linux64/build/src/task.c:398
jl_switch at /buildworker/worker/package_linux64/build/src/task.c:502
try_yieldto at ./task.jl:767
wait at ./task.jl:837
yield at ./task.jl:721
receive_event_loop at /global/homes/m/marius/.julia/packages/MPIClusterManagers/TTxqG/src/mpimanager.jl:430
#20 at ./task.jl:423
unknown function (ip: 0x7ff690147f7f)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:877
unknown function (ip: (nil))
Allocations: 766929758 (Pool: 766247147; Big: 682611); GC: 1133
Any ideas what could be going on? Julia v1.7.2, MPI v0.19.2, MPIClusterManagers v0.2.1, OpenMPI, v4.0.5.
The text was updated successfully, but these errors were encountered:
marius311
changed the title
Deadlock with MPI transport and MPIManager
Deadlock with MPIManager and custom worker pool
Feb 13, 2022
I'm trying to use a custom worker pool (with the goal of using the master process as a worker too, so as not to waste a GPU) but getting a deadlock in this package. Unfortuantely I can't get a MWE, but schematically the MWE looks something like this (although this itself doesn't seem to trigger it):
and then
mpiexec -n 4 julia myscript.jl
. In my real case, I always deadlock somewhere beforei = 100
. Interrupting a worker I can retrieve this stack trace:Any ideas what could be going on? Julia v1.7.2, MPI v0.19.2, MPIClusterManagers v0.2.1, OpenMPI, v4.0.5.
The text was updated successfully, but these errors were encountered: