Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI.jl with cray-mpich on the Cray XC50 is not running #616

Open
wons6554 opened this issue Jul 11, 2022 · 7 comments
Open

MPI.jl with cray-mpich on the Cray XC50 is not running #616

wons6554 opened this issue Jul 11, 2022 · 7 comments
Labels

Comments

@wons6554
Copy link

wons6554 commented Jul 11, 2022

The admin has installed the JULIA with the Modules on the system directory of the Cray XC50.

$ /opt/prg/.src/julia/julia-1.7.3-full.gnu> cat Make.user
prefix=/opt/prg/julia/1.7.3/GNU/73


$ /opt/prg/.src/julia/julia-1.7.3-full.gnu> cat conf.gnu
module swap PrgEnv-cray PrgEnv-gnu
module load PrgEnv-gnu
module load cmake
make -C deps -j 20 USE_BINARYBUILDER=0
make
make install

And then, I have been trying to set the MPI.jl on my account but MPI.jl gets me errors as follows.
Could I get some clue about the problem?

user1@login:~:> module list
Currently Loaded Modulefiles:
  1) modules/3.2.11.1                                 13) dmapp/7.1.1-6.0.7.0_34.3__g5a674e0.ari
  2) eproxy/2.0.22-6.0.7.0_37.1__g1ebe45c.ari         14) gni-headers/5.0.12.0-6.0.7.0_24.1__g3b1768f.ari
  3) gcc/7.3.0                                        15) xpmem/2.2.15-6.0.7.1_5.7__g7549d06.ari
  4) craype-network-aries                             16) job/2.2.3-6.0.7.0_44.1__g6c4e934.ari
  5) craype-x86-skylake                               17) dvs/2.7_2.2.112-6.0.7.1_6.1__ge96a422
  6) craype/2.5.15                                    18) alps/6.6.43-6.0.7.0_26.4__ga796da3.ari
  7) cray-mpich/7.7.3                                 19) rca/2.2.18-6.0.7.0_33.3__g2aa4f39.ari
  8) pbs/default                                      20) atp/2.1.3
  9) cray-lprgci/18.07.1                              21) perftools-base/7.0.4
 10) udreg/2.3.2-6.0.7.0_33.18__g5196236.ari          22) PrgEnv-gnu/6.0.4
 11) ugni/6.0.14.0-6.0.7.0_23.1__gea11d3d.ari         23) julia/1.7.3
 12) pmi/5.0.14



user1@login:~:>
user1@login:~:>
user1@login:~:> export JULIA_MPI_PATH=$CRAY_MPICH_DIR
user1@login:~:>
user1@login:~:> echo $JULIA_MPI_PATH
/opt/cray/pe/mpt/7.7.3/gni/mpich-gnu/7.1
user1@login:~:>
user1@login:~:> julia --project -e 'import Pkg; Pkg.add("MPI")'
user1@login:~:>  julia --project -e 'ENV["JULIA_MPI_BINARY"]="system"; using Pkg; Pkg.build("MPI"; verbose=true)'
    Building MPI → `~/.julia/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/d56a80d8cf8b9dc3050116346b3d83432b1912c0/build.log`
  Progress [>                                        ]  0/1
[ Info: using system MPI
┌ Info: Using implementation
│   libmpi = "/opt/cray/pe/mpt/7.7.3/gni/mpich-gnu/7.1/lib/libmpich"
│   mpiexec_cmd = `/opt/cray/pe/mpt/7.7.3/gni/mpich-gnu/7.1/bin/mpiexec`
└   MPI_LIBRARY_VERSION_STRING = "MPI VERSION    : CRAY MPICH version 7.7.3 (ANL base 3.2)\nMPI BUILD INFO : Built Wed Aug 22 15:44:54 2018 (git hash b88a4a20c) MT-G\n"
┌ Info: MPI implementation detected
│   impl = CrayMPICH::MPIImpl = 7
│   version = v"7.7.3"
└   abi = "MPICH"
Precompiling project...
  13 dependencies successfully precompiled in 4 seconds


user1@login:~:>  julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.7.3 (2022-05-06)
 _/ |\__'_|_|_|\__'_|  |
|__/                   |

julia> using MPI

julia> MPI.Init()
[Mon Jul 11 16:22:54 2022] [unknown] Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(537):
MPID_Init(246).......: channel initialization failed
MPID_Init(638).......:  PMI2 init failed: 1

signal (6): Aborted
in expression starting at REPL[2]:1
gsignal at /lib64/libc.so.6 (unknown line)
abort at /lib64/libc.so.6 (unknown line)
MPID_Abort at /opt/cray/pe/mpt/7.7.3/gni/mpich-gnu/7.1/lib/libmpich.so (unknown line)
MPIR_Handle_fatal_error at /opt/cray/pe/mpt/7.7.3/gni/mpich-gnu/7.1/lib/libmpich.so (unknown line)
MPIR_Err_return_comm at /opt/cray/pe/mpt/7.7.3/gni/mpich-gnu/7.1/lib/libmpich.so (unknown line)
MPI_Init_thread at /opt/cray/pe/mpt/7.7.3/gni/mpich-gnu/7.1/lib/libmpich.so (unknown line)
_init_thread at /home/user1/.julia/packages/MPI/08SPr/src/environment.jl:156 [inlined]
#Init#32 at /home/user1/.julia/packages/MPI/08SPr/src/environment.jl:94
Init at /home/user1/.julia/packages/MPI/08SPr/src/environment.jl:82
unknown function (ip: 0x7f5ded4ab89e)
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
do_call at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:126
eval_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:215
eval_stmt_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:166 [inlined]
eval_body at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:587
jl_interpret_toplevel_thunk at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:731
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:885
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:830
jl_toplevel_eval_in at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:944
eval at ./boot.jl:373 [inlined]
eval_user_input at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:150
repl_backend_loop at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:246
start_repl_backend at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:231
#run_repl#47 at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:364
run_repl at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:351
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
#936 at ./client.jl:394
jfptr_YY.936_30555 at /mnt/lustre/opt/prg/julia/1.7.3/GNU/73/lib/julia/sys.so (unknown line)
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
jl_f__call_latest at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/builtins.c:757
#invokelatest#2 at ./essentials.jl:716 [inlined]
invokelatest at ./essentials.jl:714 [inlined]
run_main_repl at ./client.jl:379
exec_options at ./client.jl:309
_start at ./client.jl:495
jfptr__start_22092 at /mnt/lustre/opt/prg/julia/1.7.3/GNU/73/lib/julia/sys.so (unknown line)
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
true_main at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:559
jl_repl_entrypoint at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:701
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at julia (unknown line)
Allocations: 2722 (Pool: 2710; Big: 12); GC: 0
Aborted

user1@login:~:>
@vchuravy
Copy link
Member

Can you try writing a C reproducer? I am not sure this is a Julia issue.

What happens if you run this under mpirun/mpiexec?

@wons6554
Copy link
Author

wons6554 commented Jul 12, 2022

I got the result when I executed the MPI job using 'aprun' as below.
You can see 'Segmentation fault' in the output.
I don't know why the JULIA source path '/mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/' is listed.
The JULIA path is '/mnt/lustre/opt/prg/julia/1.7.3/GNU/73/'.


user1@login:/proj/user1/test:> cat hello_world.mpi.jl

#examples/01-hello.jl
using MPI
MPI.Init()

comm = MPI.COMM_WORLD
print("Hello world, I am rank $(MPI.Comm_rank(comm)) of $(MPI.Comm_size(comm))\n")
MPI.Barrier(comm)

user1@login:/proj/user1/test:>
user1@login:/proj/user1/test:> cat parallel_job.julia.sh
#!/bin/sh
#PBS -N mpi_test
#PBS -q normal
#PBS -l select=2:ncpus=3:mpiprocs=3:ompthreads=1
#PBS -l walltime=00:20:00
#PBS -j oe
cd $PBS_O_WORKDIR

module unload PrgEnv-intel
module unload PrgEnv-cray
module unload PrgEnv-gnu
module load PrgEnv-gnu
module load cmake
module load julia/1.7.3

aprun -n 6 julia hello_world.mpi.jl


user1@login:/proj/user1/test:>
user1@login:/proj/user1/test:> qsub parallel_job.julia.sh
588693.sdb
user1@login:/proj/user1/test:>
user1@login:/proj/user1/test:> cat mpi_test.o588693
elogin version login.. loading modules

signal (11): Segmentation fault
in expression starting at /mnt/lustre/proj/user1/test/hello_world.mpi.jl:3

signal (11): Segmentation fault
in expression starting at /mnt/lustre/proj/user1/test/hello_world.mpi.jl:3

signal (11): Segmentation fault
in expression starting at /mnt/lustre/proj/user1/test/hello_world.mpi.jl:3

signal (11): Segmentation fault
in expression starting at /mnt/lustre/proj/user1/test/hello_world.mpi.jl:3

signal (11): Segmentation fault
in expression starting at /mnt/lustre/proj/user1/test/hello_world.mpi.jl:3
unknown function (ip: (nil))
unknown function (ip: (nil))
unknown function (ip: (nil))
unknown function (ip: (nil))
unknown function (ip: (nil))
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
do_call at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:126
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
do_call at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:126
do_call at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:126
do_call at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:126
do_call at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:126
eval_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:215
eval_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:215
eval_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:215
eval_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:215
eval_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:215
eval_stmt_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:166 [inlined]
eval_body at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:587
eval_stmt_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:166 [inlined]
eval_body at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:587
eval_stmt_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:166 [inlined]
eval_body at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:587
eval_stmt_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:166 [inlined]
eval_body at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:587
eval_stmt_value at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:166 [inlined]
eval_body at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:587
jl_interpret_toplevel_thunk at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:731
jl_interpret_toplevel_thunk at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:731
jl_interpret_toplevel_thunk at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:731
jl_interpret_toplevel_thunk at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:731
jl_interpret_toplevel_thunk at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/interpreter.c:731
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:885
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:885
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:885
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:885
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:885
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:830
jl_toplevel_eval_in at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:944
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:830
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:830
jl_toplevel_eval_in at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:944
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:830
jl_toplevel_eval_in at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:944
jl_toplevel_eval_flex at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:830
jl_toplevel_eval_in at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:944
jl_toplevel_eval_in at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/toplevel.c:944
eval at ./boot.jl:373 [inlined]
eval at ./boot.jl:373 [inlined]
include_string at ./loading.jl:1196
eval at ./boot.jl:373 [inlined]
eval at ./boot.jl:373 [inlined]
include_string at ./loading.jl:1196
include_string at ./loading.jl:1196
include_string at ./loading.jl:1196
eval at ./boot.jl:373 [inlined]
include_string at ./loading.jl:1196
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
_include at ./loading.jl:1253
_include at ./loading.jl:1253
_include at ./loading.jl:1253
_include at ./loading.jl:1253
_include at ./loading.jl:1253
include at ./Base.jl:418
include at ./Base.jl:418
include at ./Base.jl:418
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
include at ./Base.jl:418
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
include at ./Base.jl:418
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
exec_options at ./client.jl:292
exec_options at ./client.jl:292
exec_options at ./client.jl:292
exec_options at ./client.jl:292
exec_options at ./client.jl:292
_start at ./client.jl:495
_start at ./client.jl:495
_start at ./client.jl:495
_start at ./client.jl:495
_start at ./client.jl:495
jfptr__start_22092 at /mnt/lustre/opt/prg/julia/1.7.3/GNU/73/lib/julia/sys.so (unknown line)
jfptr__start_22092 at /mnt/lustre/opt/prg/julia/1.7.3/GNU/73/lib/julia/sys.so (unknown line)
jfptr__start_22092 at /mnt/lustre/opt/prg/julia/1.7.3/GNU/73/lib/julia/sys.so (unknown line)
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
jfptr__start_22092 at /mnt/lustre/opt/prg/julia/1.7.3/GNU/73/lib/julia/sys.so (unknown line)
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
jfptr__start_22092 at /mnt/lustre/opt/prg/julia/1.7.3/GNU/73/lib/julia/sys.so (unknown line)
_jl_invoke at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/gf.c:2429
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
true_main at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:559
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
true_main at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:559
jl_repl_entrypoint at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:701
jl_repl_entrypoint at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:701
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
true_main at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:559
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
jl_apply at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/julia.h:1788 [inlined]
true_main at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:559
true_main at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:559
jl_repl_entrypoint at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:701
jl_repl_entrypoint at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:701
jl_repl_entrypoint at /mnt/lustre/opt/prg/.src/julia/julia-1.7.3.gnu/src/jlapi.c:701
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at julia (unknown line)
Allocations: 2723 (Pool: 2710; Big: 13); GC: 0
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at julia (unknown line)
Allocations: 2723 (Pool: 2710; Big: 13); GC: 0
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at julia (unknown line)
Allocations: 2723 (Pool: 2710; Big: 13); GC: 0
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at julia (unknown line)
Allocations: 2723 (Pool: 2710; Big: 13); GC: 0
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at julia (unknown line)
Allocations: 2723 (Pool: 2710; Big: 13); GC: 0
_pmiu_daemon(SIGCHLD): [NID 00381] [c1-0c2s15n1] [Tue Jul 12 10:51:52 2022] PE RANK 4 exit signal Segmentation fault
[NID 00381] 2022-07-12 10:51:52 Apid 1303379: initiated application termination
Application 1303379 exit codes: 139
Application 1303379 resources: utime ~1s, stime ~2s, Rss ~179684, inblocks ~1816, outblocks ~0

8<---------------- PBS Pro Epilogue ------------------
Job id           : 588693.sdb
User             : user1
Group            : normal
Jobname          : mpi_test
Session          : 3349
Resource limits  : arch=XT,ncpus=6,place=scatter,walltime=00:20:00
Resources used   : cpupercent=25,cput=00:00:03,mem=11124kb,ncpus=6,vmem=138176kb,walltime=00:00:18
Queue            : normal
Account          : null
Exit Status      : 139
Directory        : /mnt/lustre/home/user1
ALPS ResId       : 705284
Hostname         : mom1

user1@login:/proj/user1/test:>

@w21085
Copy link

w21085 commented Jul 22, 2022

I have also experieced the same issue.
Just only one RANK works well and the other RANK's don't work when calling MPI.Init( ).
Backtrace from coredump is as the below:

#23 0x00002aaab6fb8abc in julia_exec_options_44569 () at client.jl:292
#24 0x00002aaab6ea47ae in julia__start_44805 () at client.jl:495
#25 0x00002aaab6ea4929 in jfptr.start_44806 ()
   from /mnt/lustre/julia/1.7.3/GNU/83/lib/julia/sys.so
#26 0x00002aaaabdc9576 in _jl_invoke (world=31331, mfunc=<optimized out>, nargs=0, args=0x7fffffff6f60,
    F=0x2aaab8666dc0 <jl_system_image_data+20085824>)
    at /mnt/lustre/julia/1.7.3//GNU/src/gf.c:2247
#27 jl_apply_generic (F=<optimized out>, args=0x7fffffff6f60, nargs=<optimized out>)
    at /mnt/lustre/julia/1.7.3/GNU/src/gf.c:2429
#28 0x00002aaaabe26406 in jl_apply (nargs=1, args=0x7fffffff6f58)
    at /mnt/lustre/julia/1.7.3/GNU/src/julia.h:1788
#29 true_main (argc=<optimized out>, argv=<optimized out>)
    at /mnt/lustre/julia/1.7.3/GNU/src/jlapi.c:559
#30 0x00002aaaabe26dab in jl_repl_entrypoint (argc=<optimized out>, argv=<optimized out>)
    at /mnt/lustre/julia/1.7.3/GNU/src/jlapi.c:701
#31 0x0000000000400839 in main (argc=<optimized out>, argv=<optimized out>)
    at /mnt/lustre/julia/1.7.3/GNU/cli/loader_exe.c:42

my test example is as follows:

using MPI

function do_hello()
        comm = MPI.COMM_WORLD
        println("Hello world, I am $(MPI.Comm_rank(comm)) of $(MPI.Comm_size(comm))\n")
        MPI.Barrier(comm)
end

function main()
        println("Hello world before MPI.Init\n")
        MPI.Init()
        do_hello()
        MPI.Finalize()
end

main()

All ranks print "Hello world before MPI.Init".
But only 1st rank writes "Hello world, I am 0 of 3"
Segmentation fault are occured when the other ranks calls MPI.Init( ).

@shahzebsiddiqui
Copy link

Since this is cray machine you should be compiling with cc and not the traditional MPI wrappers mpicc. Our Julia expert at NERSC is @JBlaschke. A SegFault could be anything

According to their installation https://juliaparallel.org/MPI.jl/v0.10/installation/ what is the value for JULIA_MPICC and JULIA_MPIEXEC. I am not sure but have you tried setting the launcher to aprun not sure if that would make sense in your HPC system.

@w21085
Copy link

w21085 commented Jul 25, 2022

Thanks for your advice.

I tried again buiding MPI.jl using the below environment(both MPI ver.0.19.2 and ver.0.18.0):

export JULIA_MPI_BINARY="system"
export JULIA_MPI_PATH=/opt/cray/pe/mpt/7.7.3/gni/mpich-gnu/7.1
export JULIA_MPI_INCLUDE_PATH=/opt/cray/pe/mpt/7.7.3/gni/mpich-gnu/7.1/include
export JULIA_MPI_LIBRARY_PATH=/opt/cray/pe/mpt/7.7.3/gni/mpich-gnu/7.1/lib
export JULIA_MPI_LIBRARY="libmpich"
export JULIA_MPICC=cc
export JULIA_MPIEXEC="aprun"

No error to build it.

and I submit job with aprun.

 aprun -n 2 julia ./hello_mpi.jl

Our system has still the above symptom when submitting a job.

Hello world before MPI.Init
Hello world before MPI.Init



signal (11): Segmentation fault
in expression starting at /mnt/lustre/proj/crayuser1/julia/hello_mpi.jl:16
unknown function (ip: (nil))
run_init_hooks at /proj/crayuser1/.julia/packages/MPI/E3Wer/src/environment.jl:49
#Init#30 at /proj/crayuser1/.julia/packages/MPI/E3Wer/src/environment.jl:91
Init at /proj/crayuser1/.julia/packages/MPI/E3Wer/src/environment.jl:86 [inlined]
main at /mnt/lustre/proj/crayuser1/julia/hello_mpi.jl:11
unknown function (ip: 0x2aaac0c12c8f)
Hello world, I am 0 of 2

_jl_invoke at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/gf.c:2429
jl_apply at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/julia.h:1788 [inlined]
do_call at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/interpreter.c:126
eval_value at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/interpreter.c:215
eval_stmt_value at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/interpreter.c:166 [inlined]
eval_body at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/interpreter.c:583
jl_interpret_toplevel_thunk at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/interpreter.c:731
jl_toplevel_eval_flex at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/toplevel.c:885
jl_toplevel_eval_flex at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/toplevel.c:830
jl_toplevel_eval_in at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/toplevel.c:944
eval at ./boot.jl:373 [inlined]
include_string at ./loading.jl:1196
_jl_invoke at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/gf.c:2429
_include at ./loading.jl:1253
include at ./Base.jl:418
_jl_invoke at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/gf.c:2429
exec_options at ./client.jl:292
_start at ./client.jl:495
jfptr__start_44806 at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/83/lib/julia/sys.so (unknown line)
_jl_invoke at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/gf.c:2247 [inlined]
jl_apply_generic at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/gf.c:2429
jl_apply at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/julia.h:1788 [inlined]
true_main at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/jlapi.c:559
jl_repl_entrypoint at /mnt/lustre/opt/apps/julia/1.7.3b/GNU/src/jlapi.c:701
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
_start at julia (unknown line)
Allocations: 2721 (Pool: 2710; Big: 11); GC: 0
_pmiu_daemon(SIGCHLD): [NID 00009] [$$$$] [Mon Jul 25 12:48:31 2022] PE RANK 1 exit signal Segmentation fault
[NID 00009] 2022-07-25 12:48:31 Apid 1358469: initiated application termination
Application 1358469 exit codes: 139

@JBlaschke
Copy link
Contributor

JBlaschke commented Jul 25, 2022

Hi @w21085 (and @wons6554 ?). (Thanks @shahzebsiddiqui for bringing this issue to my attention.)

I think there are several things to unpack here: 1) you're building your own version of Julia; 2) you're building MPI.jl. Nothing in Julia uses MPI, so I don't think the segmentation fault you see should be related to 1). So I will address (2) first -- as I think this should work regardless of how you build Julia.

This is how we build MPI.jl at NERSC:

using Pkg

# We use a shared global environment -- you might be doing something similar, but this next line is not necessary for single-user repos.
# Pkg.activate("globalenv", shared=true)

ENV["JULIA_MPI_BINARY"] = "system"
ENV["JULIA_MPI_PATH"]   = ENV["CRAY_MPICH_DIR"]
ENV["JULIA_MPIEXEC"]    = "srun"

Pkg.add("MPI")
Pkg.build("MPI"; verbose=true)

That's it. No secret sauce. No specifying the compiler directly, or pointing to libmpich. Please give this a try (using aprun instead of srun), and then your test script. @wons6554 I notice you're not tearing down MPI using MPI.Finalize() -- that shouldn't cause a segmentation fault, but you never know with Cray. So here is my test script:

using MPI

MPI.Init()

comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)
name = gethostname()

println("Hello world, I am rank $(rank) of $(size) on $(name)")

MPI.Barrier(comm)
MPI.Finalize()

The fact that your test fails without [ap,s]run is a good sign -- the Cray MPICH libraries are reliant on PMI2 from the resource manager, which is invoked via [ap,s]run.

RE point (1) -- building Julia from source:

The use of the compiler wrappers is not necessary to build MPI.jl. I have built Julia from source (using the Makefile and also Spack) with varying degrees of success. There are situations where one would need the compiler wrappers in some circumstances (e.g. to pick up MKL). I think what @shahzebsiddiqui meant was specifying cc and CC there -- here is the Make.user we've got on NERSC's Perlmutter system:

# No BinaryBuilder for dependencies
USE_BINARYBUILDER = 0

# Cannot use cray `ftn` to build libblas and liblapack because:
# ftn-78 crayftn: ERROR in command line
#   The -f option has an invalid argument, "default-integer-8".
# from OpenBLAS => use system BLAS and LAPACK (which makes sense anyway)
USE_SYSTEM_BLAS = 1
LIBBLAS = -lopenblas
LIBBLASNAME = libopenblas
USE_SYSTEM_LAPACK = 1
LIBLAPACK = -llapack
LIBLAPACKNAME = liblapack

# LLVM doesn't compile with with `cc` => use the system one (I think it's the
# right thing to do anyway)
USE_SYSTEM_LLVM = 1

# Force the use of the Cray compiler wrappers
override FC := $(shell which ftn)
override CXX := $(shell which CC)
override CC := $(shell which cc)
# Some things still require the GCC libraries => patching those back into the
# compiler config
LDFLAGS += -L/opt/cray/pe/gcc-libs -Wl,-rpath,/opt/cray/pe/gcc-libs -lstdc++
CXXLDFLAGS += -L/opt/cray/pe/gcc-libs -Wl,-rpath,/opt/cray/pe/gcc-libs -lstdc++

# Rig PKG_CONFIG -- the Make.inc overwrites this, so we're making sure that the
# system configurations are in there
SYS_PKG_CONFIG_PATH := $(PKG_CONFIG_PATH):/opt/cray/xpmem/2.2.40-7.0.1.0_2.3__g1d7a24d.shasta/lib64/pkgconfig/
SYS_PKG_CONFIG_LIBDIR := $(PKG_CONFIG_LIBDIR)
override PKG_CONFIG_PATH = $(JULIAHOME)/usr/lib/pkgconfig:$(SYS_PKG_CONFIG_PATH)
override PKG_CONFIG_LIBDIR = $(JULIAHOME)/usr/lib/pkgconfig:$(SYS_PKG_CONFIG_LIBDIR)

Please note that this is full of jiggery-pokery to get something to work on an unstable platform. The specifics are taylored to the state of Perlmutter from 6 months ago. Use with caution.

Also looking forward to what @vchuravy has to say on these topics.

@wons6554
Copy link
Author

wons6554 commented Aug 4, 2022

Hi JBlaschke, Thank you so much for your advice.
I have been compiling the Julia again referring to your configuration of the Make.user. I tried again so many times to compile the Julia without some errors.

But, I am getting the problem on building the Julia as below now.


$ cat Make.user
prefix=/opt/prg/julia/1.7.3/GNU/73

USE_BINARYBUILDER = 0

USE_SYSTEM_LLVM = 1

override FC := ftn
override CXX := CC
override CC := cc

LDFLAGS += -L/opt/cray/pe/gcc-libs -Wl,-rpath,/opt/cray/pe/gcc-libs -lstdc++
CXXLDFLAGS += -L/opt/cray/pe/gcc-libs -Wl,-rpath,/opt/cray/pe/gcc-libs -lstdc++

SYS_PKG_CONFIG_PATH := $(PKG_CONFIG_PATH):/opt/cray/xpmem/2.2.15-6.0.7.1_5.7__g7549d06.ari/lib64/pkgconfig/
SYS_PKG_CONFIG_PATH := $(PKG_CONFIG_PATH):/usr/lib64/pkgconfig/

SYS_PKG_CONFIG_LIBDIR := $(CRAY_LD_LIBRARY_PATH)
SYS_PKG_CONFIG_LIBDIR := $(LD_LIBRARY_PATH)

override PKG_CONFIG_PATH = $(JULIAHOME)/usr/lib/pkgconfig:$(SYS_PKG_CONFIG_PATH)
override PKG_CONFIG_LIBDIR = $(JULIAHOME)/usr/lib/pkgconfig:$(SYS_PKG_CONFIG_LIBDIR)

$ make -C deps -j 20
$ make

--snip--snip--
CCLD pcre2grep
/usr/bin/ld: attempted static link of dynamic object `./.libs/libpcre2-8.so'
collect2: error: ld returned 1 exit status
Makefile:1779: recipe for target 'pcre2grep' failed
make[3]: *** [pcre2grep] Error 1
Makefile:1405: recipe for target 'all' failed
make[2]: *** [all] Error 2
/mnt/lustre/opt/prg/.src/julia/julia-1.7.3-full.gnu/deps/pcre.mk:35: recipe for target 'scratch/pcre2-10.36/build-compiled' failed
make[1]: *** [scratch/pcre2-10.36/build-compiled] Error 2
Makefile:60: recipe for target 'julia-deps' failed
make: *** [julia-deps] Error 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants