Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Catch clean up error to prevent crash #3729

Merged
merged 1 commit into from
Jun 8, 2021
Merged

Conversation

liuzhe-lz
Copy link
Contributor

@liuzhe-lz liuzhe-lz commented Jun 7, 2021

Named pipe can sometimes fall into extremely sophisticated states and there is very little benefit to handle them. So when strange things happen, just wait the OS to recycle resources.
The only potential problem is, if the user stops and resumes an experiment in one process, then resuming might fail. I believe this use case is too rare to worth a fix.

@liuzhe-lz liuzhe-lz requested a review from ultmaster June 7, 2021 08:13
@ultmaster
Copy link
Contributor

This PR doesn't seem to fix a problem, but to ignore a problem.

Does it mean that the exception is very rare? In that case, why don't we just raise an exception?

@liuzhe-lz
Copy link
Contributor Author

liuzhe-lz commented Jun 7, 2021

This PR doesn't seem to fix a problem, but to ignore a problem.

Does it mean that the exception is very rare? In that case, why don't we just raise an exception?

Yes, this PR ignores a problem which only happens on experiment exiting. Since the experiment is already done, it's not harmful to have some invalid internal state.
The root cause is that NNI manager and Python script share the pipe and therefore its state depends on NNI manager's exit timing. To fix this, it needs to check the pipe's status before closing (which is not easy), and it must deal with atomic problems for a correct fix.

@liuzhe-lz liuzhe-lz requested a review from QuanluZhang June 7, 2021 10:04
@ultmaster ultmaster merged commit c4d449c into microsoft:master Jun 8, 2021
@liuzhe-lz liuzhe-lz deleted the r-exit branch June 17, 2021 03:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants