Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2.x: XRC UCDM openib failures while running Mellanox CI #3890

Closed
artpol84 opened this issue Jul 13, 2017 · 10 comments
Closed

v2.x: XRC UCDM openib failures while running Mellanox CI #3890

artpol84 opened this issue Jul 13, 2017 · 10 comments
Assignees

Comments

@artpol84
Copy link
Contributor

Background information

Silent Mellanox jenkins failures was observed recently.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Failures seems to be observed for GitHub v2.x branch only.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Regular Mellanox CI build

Please describe the system on which you are running

  • Operating system/version: Red Hat Enterprise Linux Server release 7.2 (Maipo)
  • Computer hardware: x86_64
  • Network type: Mellanox mlx5 adapters

Details of the problem

The following command silently fails:

20:54:55 + /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/bin/mpirun -np 8 \
-bind-to none -mca orte_tmpdir_base /tmp/tmp.8mj45mghXh --report-state-on-timeout \
--get-stack-traces --timeout 900 -mca btl_openib_if_include mlx5_0:1 \
-x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm \
-mca pml ob1 -mca btl self,openib \
-mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 taskset -c 6,7 \
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/examples/hello_c
20:54:55 [1499968495.528199] [jenkins03:1355 :0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.535609] [jenkins03:1354 :0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.534361] [jenkins03:1359 :0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.541761] [jenkins03:1356 :0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.552215] [jenkins03:1360 :0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.560606] [jenkins03:1361 :0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.562930] [jenkins03:1353 :0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3496.08
20:54:55 [1499968495.567548] [jenkins03:1363 :0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3496.08
20:54:56 + jenkins_cleanup
20:54:56 + echo 'Script exited with code = 1'
20:54:56 Script exited with code = 1
20:54:56 + rm -rf /tmp/tmp.8mj45mghXh
20:54:56 + echo 'rm -rf ... returned 0'
20:54:56 rm -rf ... returned 0

While expected output is

21:43:05 Hello, world, I am 4 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 6 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 0 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 2 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 7 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 5 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 1 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)
21:43:05 Hello, world, I am 3 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-g8a8b8cb, Unreleased developer copy, 141)

Same command with btl/tcp works fine:

$/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/bin/mpirun --debug-daemons -np 8 -bind-to none -mca orte_tmpdir_base /tmp/tmp.8mj45mghXh --report-state-on-timeout --get-stack-traces --timeout 900 -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm -mca pml ob1 -mca btl self,tcp -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 taskset -c 6,7 /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/examples/hello_c
[jenkins03:01400] [[15875,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
[jenkins03:01400] [[15875,0],0] orted_cmd: received add_local_procs
  MPIR_being_debugged = 0
  MPIR_debug_state = 1
  MPIR_partial_attach_ok = 1
  MPIR_i_am_starter = 0
  MPIR_forward_output = 0
  MPIR_proctable_size = 8
  MPIR_proctable:
    (i, host, exe, pid) = (0, jenkins03, /usr/bin/taskset, 1416)
    (i, host, exe, pid) = (1, jenkins03, /usr/bin/taskset, 1417)
    (i, host, exe, pid) = (2, jenkins03, /usr/bin/taskset, 1419)
    (i, host, exe, pid) = (3, jenkins03, /usr/bin/taskset, 1420)
    (i, host, exe, pid) = (4, jenkins03, /usr/bin/taskset, 1421)
    (i, host, exe, pid) = (5, jenkins03, /usr/bin/taskset, 1423)
    (i, host, exe, pid) = (6, jenkins03, /usr/bin/taskset, 1428)
    (i, host, exe, pid) = (7, jenkins03, /usr/bin/taskset, 1431)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
Hello, world, I am 2 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 4 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 0 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 3 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 7 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 6 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 1 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
Hello, world, I am 5 of 8, (Open MPI v2.1.2a1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2a1, repo rev: v2.1.1-92-gbd04a7d, Unreleased developer copy, 141)
[jenkins03:01400] [[15875,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_EXIT_CMD
[jenkins03:01400] [[15875,0],0] orted_cmd: received exit cmd
[jenkins03:01400] [[15875,0],0] orted_cmd: all routes and children gone - exiting

Here is more detailed log (with btl verbose on):
openib_failure.txt

Mellanox Jenkins script is updated to output the exit status so in future this behavior will not cause such confusion.

@artpol84
Copy link
Contributor Author

@hppritcha @jsquyres @hjelmn @bwbarrett @miked-mellanox @jladd-mlnx @Di0gen

@jsquyres
Copy link
Member

Ah, so there was an actual failure, it was just silent? Got it. Thanks for tracking it down and making it non-silent for the future.

@artpol84
Copy link
Contributor Author

With pml/yalla and pml/ucx those tests are fine as well.

@artpol84
Copy link
Contributor Author

What is the right way to proceed?
Wait for the fix or disable as it disturbs other PRs?

@jsquyres jsquyres changed the title openib failures while running Mellanox CI XRC UCDM openib failures while running Mellanox CI Jul 14, 2017
@jsquyres jsquyres changed the title XRC UCDM openib failures while running Mellanox CI v2.x: XRC UCDM openib failures while running Mellanox CI Jul 14, 2017
@jsquyres
Copy link
Member

In reading the description of the bug, I didn't realize it was a real openib problem -- I think most people will have missed the logfile you put at the bottom of the description.

@hjelmn @hppritcha @bharatpotnuri There appears to be a problem with XRC in the openib BTL in the v2.x branch right now. See below for a snippit from the logfile Artem included earlier in the ticket. Who will fix this?

[jenkins03][[13077,1],1][connect/btl_openib_connect_udcm.c:2540:udcm_xrc_send_qp_create] creating xrc send qp
[jenkins03][[13077,1],0][connect/btl_openib_connect_udcm.c:2540:udcm_xrc_send_qp_create] creating xrc send qp
[jenkins03][[13077,1],0][connect/btl_openib_connect_udcm.c:2677:udcm_xrc_recv_qp_create] creating xrc receive qp
[jenkins03][[13077,1],1][connect/btl_openib_connect_udcm.c:2677:udcm_xrc_recv_qp_create] creating xrc receive qp
[jenkins03][[13077,1],0][connect/btl_openib_connect_udcm.c:2440:udcm_xrc_send_qp_connect] Connecting send qp: 0x85a688, remote qp: 129875
[jenkins03][[13077,1],1][connect/btl_openib_connect_udcm.c:2440:udcm_xrc_send_qp_connect] Connecting send qp: 0x8656c8, remote qp: 129876
[jenkins03][[13077,1],0][connect/btl_openib_connect_udcm.c:581:udcm_endpoint_init_self_xrc] successfully created loopback queue pair
[jenkins03][[13077,1],0][connect/btl_openib_connect_udcm.c:1926:udcm_finish_connection] finishing connection for endpoint 0x83f650.
[jenkins03][[13077,1],0][btl_openib.c:849:init_ib_proc_nolock] got 1 port_infos 
[jenkins03][[13077,1],0][btl_openib.c:852:init_ib_proc_nolock] got a subnet fe80000000000000[jenkins03][[13077,1],1][connect/btl_openib_connect_udcm.c:581:udcm_endpoint_init_self_xrc] successfully created loopback queue pair
[jenkins03][[13077,1],1][connect/btl_openib_connect_udcm.c:1926:udcm_finish_connection] finishing connection for endpoint 0x85a680.
[jenkins03][[13077,1],1][btl_openib.c:1309:mca_btl_openib_del_procs] in del_procs 1, setting another endpoint to null
[jenkins03:02165] mca: bml: Using openib btl for send to [[13077,1],0] on node jenkins03

[jenkins03][[13077,1],0][btl_openib.c:855:init_ib_proc_nolock] Got a matching subnet!
[jenkins03][[13077,1],0][btl_openib.c:1309:mca_btl_openib_del_procs] in del_procs 0, setting another endpoint to null
[jenkins03][[13077,1],1][btl_openib_endpoint.c:406:mca_btl_openib_endpoint_destruct] Unregistered XRC Recv QP:129876

[jenkins03][[13077,1],0][btl_openib_endpoint.c:406:mca_btl_openib_endpoint_destruct] Unregistered XRC Recv QP:129875

[jenkins03:02164] mca: bml: Using openib btl for send to [[13077,1],1] on node jenkins03
[jenkins03][[13077,1],1][connect/btl_openib_connect_udcm.c:796:udcm_module_start_connect] endpoint 0x83f620 (lid 3, ep index 0)
[jenkins03][[13077,1],1][connect/btl_openib_connect_udcm.c:2360:udcm_xrc_start_connect] The IB addr: sid fe80000000000000 lid 101 with status 3, subscribing to this address
[jenkins03][[13077,1],1][connect/btl_openib_connect_udcm.c:2826:udcm_xrc_send_request] sending xrc request for endpoint 0x83f620
[jenkins03][[13077,1],1][connect/btl_openib_connect_udcm.c:2305:udcm_set_message_timeout] activating timeout for message 0x9831d0
[jenkins03][[13077,1],1][connect/btl_openib_connect_udcm.c:1670:udcm_new_message] created message 0x9831d0 with type 105
[jenkins03][[13077,1],1][connect/btl_openib_connect_udcm.c:2843:udcm_xrc_send_request] Sending XConnect2 with qp: 129876
[jenkins03][[13077,1],0][connect/btl_openib_connect_udcm.c:1975:udcm_process_messages] WC: wr_id: 0x0000000536870912, status: 0, opcode: 0x80, byte_len: 60, imm_data: 0x00000000, qp_num: 0x0001fb47, src_qp: 0x0001fb48, wc_flags: 0x0, slid: 0x0003
[jenkins03][[13077,1],0][connect/btl_openib_connect_udcm.c:2020:udcm_process_messages] received message. type: 105, lcl_ep = 0x8af020, rem_ep = 0x83f620, src qpn = 129864, length = 96, local buffer # = 0
[jenkins03][[13077,1],0][connect/btl_openib_connect_udcm.c:1771:udcm_send_ack] sending ack for message 0x9831d0 on ep 0x8af020
[jenkins03][[13077,1],1][connect/btl_openib_connect_udcm.c:1975:udcm_process_messages] WC: wr_id: 0x0000000536870912, status: 0, opcode: 0x80, byte_len: 60, imm_data: 0x00000000, qp_num: 0x0001fb48, src_qp: 0x0001fb47, wc_flags: 0x0, slid: 0x0003
[jenkins03][[13077,1],1][connect/btl_openib_connect_udcm.c:1788:udcm_handle_ack] got ack for message 0x9831d0 from slid 0x0003 qp 0x0001fb47
[jenkins03][[13077,1],1][connect/btl_openib_connect_udcm.c:1796:udcm_handle_ack] found matching message
[jenkins03][[13077,1],1][connect/btl_openib_connect_udcm.c:2174:udcm_message_callback] running message thread
[jenkins03][[13077,1],1][connect/btl_openib_connect_udcm.c:2219:udcm_message_callback] exiting message thread
[jenkins03][[13077,1],0][connect/btl_openib_connect_udcm.c:796:udcm_module_start_connect] endpoint 0x8af020 (lid 3, ep index 1)
[jenkins03][[13077,1],0][connect/btl_openib_connect_udcm.c:2360:udcm_xrc_start_connect] The IB addr: sid fe80000000000000 lid 101 with status 3, subscribing to this address
[jenkins03][[13077,1],0][connect/btl_openib_connect_udcm.c:2826:udcm_xrc_send_request] sending xrc request for endpoint 0x8af020
[jenkins03][[13077,1],0][connect/btl_openib_connect_udcm.c:2305:udcm_set_message_timeout] activating timeout for message 0x973310
[jenkins03][[13077,1],0][connect/btl_openib_connect_udcm.c:1670:udcm_new_message] created message 0x973310 with type 105
[jenkins03][[13077,1],0][connect/btl_openib_connect_udcm.c:2843:udcm_xrc_send_request] Sending XConnect2 with qp: 129875
[jenkins03][[13077,1],0][connect/btl_openib_connect_udcm.c:2174:udcm_message_callback] running message thread
[jenkins03][[13077,1],0][connect/btl_openib_connect_udcm.c:2635:udcm_xrc_recv_qp_connect] Connecting Recv QP

[jenkins03][[13077,1],0][connect/btl_openib_connect_udcm.c:2640:udcm_xrc_recv_qp_connect] Failed to register qp_num: 129876, get error: Invalid argument (22)
. Replying with RNR
[jenkins03][[13077,1],0][connect/btl_openib_connect_udcm.c:2980:udcm_xrc_handle_xconnect] rejecting request for reason -3

artpol84 added a commit to mellanox-hpc/jenkins_scripts that referenced this issue Jul 15, 2017
@artpol84
Copy link
Contributor Author

artpol84 commented Jul 15, 2017

I temporarily disabled openib for v2.x branch to allow PRs to be tested.
Please let me know ASAP when the problem will be fixed so I'll re-enable it.

@hppritcha
Copy link
Member

Looks like we need to cherry pick 56bdcd0

hppritcha pushed a commit to hppritcha/ompi that referenced this issue Jul 17, 2017
Before dynamic add_procs support was committed to master we called
add_procs with every proc in the job. The XRC code in the openib btl
was taking advantage of this and setting the number of work queue
entries (WQE) based on all the procs on a remote node. Since that is
no longer the case we can not simply increment the sd_wqe field on the
queue pair. To fix the issue a new field has been added to the xrc
queue pair structure to keep track of how many wqes there are total on
the queue pair. If a new endpoint is added that increases the number
of wqes and the xrc queue pair is already connected the code will
attempt to modify the number of wqes on the queue pair. A failure is
ignored because all that will happen is the number of active send work
requests on an XRC queue pair will be more limited.

related to open-mpi#1721
fixes open-mpi#3890

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit 56bdcd0)
@hppritcha
Copy link
Member

hppritcha commented Jul 25, 2017

Per discussions at devel core meeting 7/25/17 a PR will be opened to disable XRC support in the openib BTL. Mellanox will try to determine why the test fails on their config over the next week. If no progress made, by next Tuesday, this PR will get merged in to master.

@artpol84
Copy link
Contributor Author

@hppricha, I checked and we won't be able to get to this in near future.
No need to wait anymore.

hppritcha added a commit to hppritcha/ompi that referenced this issue Jul 26, 2017
Change the default enable configure option XRC to disabled.  If a user want's
to give it a try they have to explicitly ask for it.

Modify the configury help message to indicate it is not enabled by default.

Related to open-mpi#3890
Fixes open-mpi#3969

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
hppritcha added a commit to hppritcha/ompi that referenced this issue Aug 2, 2017
Change the default enable configure option XRC to disabled.  If a user want's
to give it a try they have to explicitly ask for it.

Modify the configury help message to indicate it is not enabled by default.

Related to open-mpi#3890
Fixes open-mpi#3969

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
hppritcha added a commit to hppritcha/ompi that referenced this issue Aug 4, 2017
Change the default enable configure option XRC to disabled.  If a user want's
to give it a try they have to explicitly ask for it.

Modify the configury help message to indicate it is not enabled by default.

Related to open-mpi#3890
Fixes open-mpi#3969

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
hppritcha added a commit to hppritcha/ompi that referenced this issue Aug 4, 2017
Change the default enable configure option XRC to disabled.  If a user want's
to give it a try they have to explicitly ask for it.

Modify the configury help message to indicate it is not enabled by default.

Related to open-mpi#3890
Fixes open-mpi#3969

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
hppritcha added a commit to hppritcha/ompi that referenced this issue Aug 4, 2017
Disable XRC support for OpenIB BTL

Related to open-mpi#3890
Fixes open-mpi#3969

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit 8223d4c)

Conflicts:
	NEWS
hppritcha added a commit to hppritcha/ompi that referenced this issue Aug 4, 2017
disable XRC in OpenIB BTL due to lack of support.

Related to open-mpi#3890
Fixes open-mpi#3969

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit 8223d4c)
(cherry picked from commit c22a7c7)

Conflicts:
	NEWS
	config/opal_check_openfabrics.m4
hppritcha added a commit to hppritcha/ompi that referenced this issue Aug 4, 2017
disable XRC in OpenIB BTL due to lack of support.

Related to open-mpi#3890
Fixes open-mpi#3969

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit 8223d4c)

Conflicts:
	NEWS
@jsquyres
Copy link
Member

jsquyres commented Oct 1, 2018

I think this was addressed long ago.

@jsquyres jsquyres closed this as completed Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants