Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HCOLL is causing XRC problems in v2.x #4087

Closed
artpol84 opened this issue Aug 15, 2017 · 7 comments
Closed

HCOLL is causing XRC problems in v2.x #4087

artpol84 opened this issue Aug 15, 2017 · 7 comments
Assignees

Comments

@artpol84
Copy link
Contributor

artpol84 commented Aug 15, 2017

This problem was originally treated as an btl/openib issue: #3890.
However more detailed investigation indicating that this is an effect of coll/hcoll component: #4082

Without hcoll it runs ok:

$ bash -x ./run.sh                                                                                                                                                      
+ ./mpirun -np 8 -bind-to none -mca orte_tmpdir_base /tmp/tmp.8mj45mghXh -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm -mca pml ob1 -mca btl self,openib -mca coll '^hcoll' -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 taskset -c 6,7 /hpc/home/USERS/artemp/scrap/OMPI/ompi/examples/hello_c
Hello, world, I am 2 of 8, (Open MPI v2.1.2rc1, package: Open MPI artemp@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-154-g459e5ae, Unreleased developer copy, 143)
Hello, world, I am 7 of 8, (Open MPI v2.1.2rc1, package: Open MPI artemp@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-154-g459e5ae, Unreleased developer copy, 143)
Hello, world, I am 4 of 8, (Open MPI v2.1.2rc1, package: Open MPI artemp@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-154-g459e5ae, Unreleased developer copy, 143)
Hello, world, I am 6 of 8, (Open MPI v2.1.2rc1, package: Open MPI artemp@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-154-g459e5ae, Unreleased developer copy, 143)
Hello, world, I am 3 of 8, (Open MPI v2.1.2rc1, package: Open MPI artemp@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-154-g459e5ae, Unreleased developer copy, 143)
Hello, world, I am 5 of 8, (Open MPI v2.1.2rc1, package: Open MPI artemp@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-154-g459e5ae, Unreleased developer copy, 143)
Hello, world, I am 0 of 8, (Open MPI v2.1.2rc1, package: Open MPI artemp@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-154-g459e5ae, Unreleased developer copy, 143)
Hello, world, I am 1 of 8, (Open MPI v2.1.2rc1, package: Open MPI artemp@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-154-g459e5ae, Unreleased developer copy, 143)
+ exit 0

While enabling hcoll introduces the problem:

$ bash -x ./run.sh                                                                                                                                                      
+ ./mpirun -np 8 -bind-to none -mca orte_tmpdir_base /tmp/tmp.8mj45mghXh -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm -mca pml ob1 -mca btl self,openib -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 taskset -c 6,7 /hpc/home/USERS/artemp/scrap/OMPI/ompi/examples/hello_c
[1502826187.186194] [jenkins03:21885:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.45
[1502826187.188457] [jenkins03:21888:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.45
[1502826187.192728] [jenkins03:21886:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.45
[1502826187.194526] [jenkins03:21884:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.45
[1502826187.200721] [jenkins03:21887:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.45
[1502826187.206316] [jenkins03:21890:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.45
[1502826187.209630] [jenkins03:21889:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.45
[1502826187.215512] [jenkins03:21891:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.45
+ echo 255
255
@artpol84
Copy link
Contributor Author

@hjelmn @hppritcha
One side effect of this issue that I've noticed is that OMPI fails silently in this case leaving no chance to regular user to understand that it even fails. Even we were thinking that all is good at the very beginning because the output was clean.

@artpol84
Copy link
Contributor Author

@jsquyres
Copy link
Member

Per #4082 (comment), @vspetrov is claiming the problem is definitely in the openib BTL.

Is there an XRC problem in both hcoll and openib?

@artpol84
Copy link
Contributor Author

@jsquyres I think that @vspetrov demonstrated that hcoll is just triggering the problem in openib by mimic it's activity with Allreduce and disabling hcoll.
So as I understand it - it is not an hcoll bug, but hcoll creates allowable conditions where XRC is failing.

@jsquyres
Copy link
Member

So do you want to close all those "Revert..." PRs?

@artpol84
Copy link
Contributor Author

Yes. I guess that I answered the question from this weeks telecon: only v2.x demonstrates this kind of problems with openib, all other branches are OK (according to our Jenkins at least).

@artpol84
Copy link
Contributor Author

I think we can close this issue as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants