Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace AsyncProcess exit handler by weakref.finalize #4184

Merged
merged 5 commits into from
Oct 30, 2020

Conversation

pentschev
Copy link
Member

This is an attempt at solving #4181 . I believe this is a more reliable approach, given we can ensure the process will not get destroyed before it's garbage collected.

@quasiben
Copy link
Member

This does resolve a long standing cleanup issue for UCX and we think this is a cleaner approach to halt workers. @jcrist if you have time can you look this over ?

@quasiben
Copy link
Member

@fjetter your thoughts here would also be helpful

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few quick comments. No thoughts on whether this is the right approach over other methods - generally I'd hope we could explicitly manage closing these processes rather than requiring a finalizer to implicitly handle them.

distributed/process.py Outdated Show resolved Hide resolved
distributed/process.py Outdated Show resolved Hide resolved
@pentschev
Copy link
Member Author

A few quick comments. No thoughts on whether this is the right approach over other methods - generally I'd hope we could explicitly manage closing these processes rather than requiring a finalizer to implicitly handle them.

I think the main problem we have today is that using exit handlers as globals as is the case in Distributed seems incorrect, as the order will not be respected. I haven't looked at all exit handlers but the few I looked at will all have the same problem, a couple examples:

def _close_global_client():
"""
Force close of global client. This cleans up when a client
wasn't close explicitly, e.g. interactive sessions.
"""
c = _get_global_client()
if c is not None:
c._should_close_loop = False
with suppress(TimeoutError, RuntimeError):
c.close(timeout=3)
atexit.register(_close_global_client)

@atexit.register
def close_clusters():
for cluster in list(SpecCluster._instances):
with suppress(gen.TimeoutError, TimeoutError):
if cluster.status != Status.closed:
cluster.close(timeout=10)

I don't know whether this is the right approach either, I was hoping to suggest this and begin a discussion on potential pitfalls and alternatives. At the moment I don't see a much better approach,

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided tests pass, this looks good to me.

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jcrist and would prefer a world where we would not need to cleanup with finalizers but instead would be able to close everything intentionally.

However, as far as finalizers go I think this approach is slightly better than the one before since it deals with each object/process individually instead of in bulk. In particular, as already mentioned, it respects the ordering in the order the objects where created. Before the order was determined on module import time. Not sure if this makes any difference but this approach seems also simpler to reason about than the module level finalizer.

@fjetter
Copy link
Member

fjetter commented Oct 26, 2020

There are a lot of failures on travis/py3.6. Most seem unrelated but I'm not entirely sure.

@pentschev
Copy link
Member Author

I'm also not sure whether the 3.6 failures are related to this, it seems that they aren't, but happy to look more into that if there's some suspicion this PR is the cause.

@quasiben
Copy link
Member

I restarted the 3.6 CI job

@pentschev
Copy link
Member Author

Seems like there's still one test failing with a timeout: test_broken_worker_during_computation. By looking at builds from other PRs in https://travis-ci.org/github/dask/distributed/jobs/738345888 and https://travis-ci.org/github/dask/distributed/jobs/738412961, the same test is failing, so I believe it's not a side-effect of the changes here.

@fjetter
Copy link
Member

fjetter commented Oct 29, 2020

The test_broken_worker_during_computation failure is already reported in #4173

@pentschev
Copy link
Member Author

Thanks @fjetter , I hadn't seen that. In this case, I really think this PR seems safe.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pentschev! (and @jcrist @fjetter for reviewing)

@jrbourbeau jrbourbeau merged commit 8612473 into dask:master Oct 30, 2020
jrbourbeau pushed a commit that referenced this pull request Oct 30, 2020
@pentschev
Copy link
Member Author

Thanks everyone for reviews and merging! :)

gforsyth pushed a commit to gforsyth/distributed that referenced this pull request Oct 31, 2020
@pentschev pentschev deleted the weakref-finalizer-asyncprocess branch November 11, 2020 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants