Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux MCA plugin that implements fence_nb() as a blocking interface causes deadlock in UCX teardown #11938

Closed
garlick opened this issue Sep 21, 2023 · 12 comments

Comments

@garlick
Copy link

garlick commented Sep 21, 2023

Problem: a 3-node MPI hello world (init/barrier/finalize) sometimes hangs in MPI_Finalize()

Environment:

  • Open MPI v4.1.2, package: Open MPI gyllen@rzwhamo6 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021
  • managed with spack/environment modules
  • corona cluster at LLNL running TOSS 4 (RHEL 8.8)
  • primary resource manager and launcher of MPI programs is Flux

Stack traces show

  • rank 0 is stuck in opal_common_ucx_wait_all_requests ()
  • rank 1,2 are stuck in opal_pmix.fence_nb()

It appears that UCX makes use of the nonblocking nature of fence_nb() and ucp_disconnect_nb() to allow a PMI barrier and the sending of UCX disconnect requests to progress in parallel. But the flux MCA plugin that implements fence_nb() is actually a blocking call. A theory is that sometimes disconnect messages are queued instead of sent directly, and the lack of progress when that rank enters the fence_nb() prevents it from getting out, resulting in the fence never completing and deadlock.

Probably the right solution for users is to use the flux pmix plugin by running with -o pmi=pmix. This is confirmed to resolve this problem. However #8380 effectively converted a segfault due to calling a NULL fence_nb() into a semi-reproduceable hang, arguably not an improvement. Perhaps it would be better to revert it and have UCX treat the lack of a fence_nb() as a fatal runtime error.

Further details: flux-framework/flux-core#5460

Edit: however that was just a theory! Maybe someone who knows ompi/ucx code could confirm or deny?

@rhc54
Copy link
Contributor

rhc54 commented Sep 21, 2023

@garlick Perhaps you can refresh my memory. Doesn't your flux pmix plugin use PMI-1 calls behind the PMIx calls to actually implement things? Or has that changed?

Just trying to understand how using that PMI-1 based plugin would solve the fence_nb problem - do you have a way to execute the fence in a non-blocking manner in that plugin? If so, then maybe the correct answer would be to bring that implementation to the OMPI flux MCA plugin.

I do agree, however, that having a non-blocking fence actually be a blocking function is probably not a good thing to do.

@garlick
Copy link
Author

garlick commented Sep 21, 2023

When you say "flux pmix plugin" are you talking about the flux-pmix plugin to Flux or the flux MCA plugin to the pmix framework in ompi, e.g. "MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.1.0)"?

If the former, no that embeds your openpmix server and so ompi gets the full PMIx interface including non-blocking fence. When we use that, this problem does not manifest.

If the latter, yes, exactly, it dlopens flux's libpmi.so and makes calls like PMI_init() etc.. There is no quick way to implement non blocking - it would have to embed the PMI-1 wire protocol client (the guts of our libpmi.so) and then I guess integrate its file descriptor with ompi's event loop. That's more work than I'm up for given that this plugin is already removed upstream.

In this issue I was mainly interested in seeing if there is any quick fix we could apply which would avoid this deadlock, which is annoying users. A nice human readable message explaining that UCX requires a non-blocking fence that is not provided by the currently loaded pmix plugin and an abort would be better.

OR, if the UCX disconnects could be forced to complete before the barrier is called, that would be really nice. But I won't hold my breath on that one.

@rhc54
Copy link
Contributor

rhc54 commented Sep 22, 2023

Yes, thanks - that does help tickle the old brain cells. The problem is that there are a number of user-controlled behaviors that cause non-blocking fences to be executed - it isn't just a UCX finalize issue. Quite frankly, you've just been getting lucky that users haven't hit this more broadly 😄 Attempting to defensively program around the limitation everywhere one might encounter it is probably too big a change for OMPI to pursue.

Only a couple of things you could really do here:

  • change the fake fence_nb entry in the flux MCA plugin to report a "cannot be used" and abort. Only problem here is that if you don't hit this until "finalize", then the user could lose a lot of execution time only to abort at the very end. Not very appealing, I suspect.
  • simply remove the flux MCA plugin. In this case, OMPI will politely exit with an error message at MPI_Init if the user forgets to tell flux to use its PMIx support (I gather that is your -o pmi=pmix option), thus ensuring that this problem isn't hit.

HTH

@garlick
Copy link
Author

garlick commented Sep 22, 2023

simply remove the flux MCA plugin. In this case, OMPI will politely exit with an error message at MPI_Init if the user forgets to tell flux to use its PMIx support (I gather that is your -o pmi=pmix option), thus ensuring that this problem isn't hit.

Will it do that or will it start a bunch of singletons?

@rhc54
Copy link
Contributor

rhc54 commented Sep 22, 2023

Hmmm...I'm not entirely sure. I guess to be safe you could replace the existing MCA plugin with one that looks for the flux indicators (usually the component looks for some flux-specific envars to determine "I am the one!") and just report "you need to add -o pmi=pmix to your cmd line" before returning an error. The PMIx component should be at a higher priority (or you can lower the flux one if necessary), so if PMIx is seen that will get picked. Then this becomes the fallback and can just advise to use the PMIx support - the returned error will cause OMPI to abort in MPI_Init

@garlick
Copy link
Author

garlick commented Sep 22, 2023

OTOH in v5.x there will be no flux schizo plugin so maybe it is better to have consistent behavior going forward. I'll check in with our team and make sure nobody objects, but I'm inclined to remove it and let ompi work as designed without launcher specific hacks.

At least bunch-o-singletons is an MPI failure mode well understood by our support staff.

@jsquyres
Copy link
Member

@garlick Is this resolved?

@garlick
Copy link
Author

garlick commented Sep 24, 2023

I need to double check with the team this week to see if it's OK, but I was going to propose that we remove the flux plugins in the 4.x branches. (It's already gone in 5.x)

@jsquyres
Copy link
Member

@garlick Is Flux only used at LLNL?

@garlick
Copy link
Author

garlick commented Sep 25, 2023

No.

@jsquyres
Copy link
Member

@garlick Then it would be difficult for us to remove the flux plugins in the middle of a release series.

@garlick
Copy link
Author

garlick commented Sep 25, 2023

Ah OK. Well that de-complicates things then. This can be closed if you like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants