Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock with MPIManager and custom worker pool #32

Open
marius311 opened this issue Feb 13, 2022 · 0 comments
Open

Deadlock with MPIManager and custom worker pool #32

marius311 opened this issue Feb 13, 2022 · 0 comments

Comments

@marius311
Copy link
Contributor

marius311 commented Feb 13, 2022

I'm trying to use a custom worker pool (with the goal of using the master process as a worker too, so as not to waste a GPU) but getting a deadlock in this package. Unfortuantely I can't get a MWE, but schematically the MWE looks something like this (although this itself doesn't seem to trigger it):

# myscript.jl
using MPI, MPIClusterManagers, Distributed

MPI.Init()
MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL) # or TCP_TRANSPORT_ALL, doesn't matter

pool = WorkerPool(procs())

for i = 1:100
    pmap(pool, 1:10) do j
        # ...
    end
end

and then mpiexec -n 4 julia myscript.jl. In my real case, I always deadlock somewhere before i = 100. Interrupting a worker I can retrieve this stack trace:

signal (15): Terminated
in expression starting at /global/u1/m/marius/work/pipelineB2/scripts/bk18_fwdsim2_nodust.jl:68
jl_pgcstack_addr_static at /buildworker/worker/package_linux64/build/cli/loader_exe.c:14
ctx_switch at /buildworker/worker/package_linux64/build/src/task.c:398
jl_switch at /buildworker/worker/package_linux64/build/src/task.c:502
try_yieldto at ./task.jl:767
wait at ./task.jl:837
yield at ./task.jl:721
receive_event_loop at /global/homes/m/marius/.julia/packages/MPIClusterManagers/TTxqG/src/mpimanager.jl:430
#20 at ./task.jl:423
unknown function (ip: 0x7ff690147f7f)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:877
unknown function (ip: (nil))
Allocations: 766929758 (Pool: 766247147; Big: 682611); GC: 1133

Any ideas what could be going on? Julia v1.7.2, MPI v0.19.2, MPIClusterManagers v0.2.1, OpenMPI, v4.0.5.

@marius311 marius311 changed the title Deadlock with MPI transport and MPIManager Deadlock with MPIManager and custom worker pool Feb 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant