Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow worker extensions to piggy-back on heartbeat #5957

Merged
merged 21 commits into from
Mar 29, 2022

Conversation

mrocklin
Copy link
Member

To enable better diagnostics, it would be useful to allow worker
extensions to piggy-back on the standard heartbeat. This adds an
optional "heartbeat" method to extensions, and, if present, calls a
custom method that gets sent to the scheduler and processed by an
extension of the same name.

This also starts to store the extensions on the worker in a named
dictionary. Previously this was a list, but I'm not sure that it was
actually used anywhere. This is a breaking change without deprecation,
but in a space that I suspect no one will care about. I'm happy to
provide a fallback if desired.

To enable better diagnostics, it would be useful to allow worker
extensions to piggy-back on the standard heartbeat.  This adds an
optional "heartbeat" method to extensions, and, if present, calls a
custom method that gets sent to the scheduler and processed by an
extension of the same name.

This also starts to store the extensions on the worker in a named
dictionary.  Previously this was a list, but I'm not sure that it was
actually used anywhere.  This is a breaking change without deprecation,
but in a space that I suspect no one will care about.  I'm happy to
provide a fallback if desired.
@github-actions
Copy link
Contributor

github-actions bot commented Mar 17, 2022

Unit Test Results

       12 files  +       2         12 suites  +2   5h 56m 47s ⏱️ + 39m 28s
  2 675 tests +       1    2 593 ✔️ ±       0    81 💤 ±    0  0 ±0  1 🔥 +1 
15 954 runs  +2 670  15 095 ✔️ +2 517  857 💤 +151  1 +1  1 🔥 +1 

For more details on these errors, see this check.

Results for commit 2a3c35d. ± Comparison against base commit e8c0669.

♻️ This comment has been updated with latest results.

@mrocklin
Copy link
Member Author

This now also adds named extensions in the worker and scheduler

@mrocklin
Copy link
Member Author

@fjetter are you ok with this change? I'm changing the DEFAULT_EXTENSIONS collections from lists to named dictionaries. Then secondarily I also add a mechanism to have extensions send messages up to their counterparts along with the regular heartbeat.

@mrocklin
Copy link
Member Author

Merging in 24 hours if there are not further comments

@mrocklin mrocklin mentioned this pull request Mar 22, 2022
Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love how well this mechanism is hidden. For this to work, one needs to define a pair of extensions on both scheduler and worker side that share the same name.

So far, we also do not implement any hooks for extensions but only for plugins. I don't like the idea of introducing hooks for both extensions and plugins, particularly if we have things that are both an extension and a plugin (e.g. stealing)

distributed/stealing.py Outdated Show resolved Hide resolved
distributed/tests/test_worker.py Outdated Show resolved Hide resolved
Co-authored-by: Florian Jetter <fjetter@users.noreply.github.com>
@mrocklin
Copy link
Member Author

I don't love how well this mechanism is hidden. For this to work, one needs to define a pair of extensions on both scheduler and worker side that share the same name. Correct.

So far, we also do not implement any hooks for extensions but only for plugins. I don't like the idea of introducing hooks for both extensions and plugins, particularly if we have things that are both an extension and a plugin (e.g. stealing).

I agree that this is creating an implicit and hidden contract, which isn't great. That being said, I don't currently see a much better option, other than making things very explicit and published (which seems premature to me, I don't yet have enough confidence that this is the one true way). Thoughts?

@fjetter
Copy link
Member

fjetter commented Mar 23, 2022

That being said, I don't currently see a much better option, other than making things very explicit and published (which seems premature to me, I don't yet have enough confidence that this is the one true way). Thoughts?

Why do you want to use the heartbeat for this in the first place? We do have use cases for extensions to submit data regularly but haven't used the heartbeat before.
For instance, the Semaphore requires regular heartbeat-like messages to refresh leases, see

async def _refresh_leases(self):
if self.refresh_leases and self._leases:
logger.debug(
"%s refreshing leases for %s with IDs %s",
self.id,
self.name,
self._leases,
)
await retry_operation(
self.scheduler.semaphore_refresh_leases,
lease_ids=list(self._leases),
name=self.name,
operation="semaphore refresh leases: id=%s, lease_ids=%s, name=%s"
% (self.id, list(self._leases), self.name),
)

This originally used clients which was a scaling problem (many clients do not scale well right now) and we moved to reusing the worker comm pool in #4195
We discussed using the BatchedSend instead but I decided against this for complexity reasons back then. However, if performance is of concern that may probably be the most straight forward approach.

The one property I see in the heartbeat that is unique is its "scale interval to number of workers" property. Is this what you require?

@mrocklin
Copy link
Member Author

The one property I see in the heartbeat that is unique is its "scale interval to number of workers" property. Is this what you require?

Require is a strong word, but yes, that's the gesture that I'm trying to retain. At the many hundred worker scale sending heartbeat like messages becomes a problem. We should probably avoid it generally. If semaphores are doing this then we should consider shifting them over to a system like this. Are there other cases that you can think of?

@fjetter
Copy link
Member

fjetter commented Mar 23, 2022

Are there other cases that you can think of?

The client itself is not scaling well at all. I never debugged this properly but am suspecting heartbeats as one of the reasons. You have many clients when using tasks-in-tasks or other worker_client features. I've seen users being bothered by this without knowing it, e.g. when using prefect. I can also see the scheduler_info thing in the client... maybe more?

from the top of my head, I can't think of anything else. Briefly looking over the extensions, nothing jumps out at me

@mrocklin
Copy link
Member Author

I'll try to take a look at the semaphore extension today and see if it makes sense to bring that over to something like this. I imagine that this hasn't been as much of an issue just because semaphores are less commonly used. For things like shuffle I would expect this to come up more often.

(heads-up, I'm at a conference today and may be less responsive than normal)

@mrocklin
Copy link
Member Author

I suggest that, as follow-up work, we consider replacing all extensions with plugins. Thoughts?

@fjetter
Copy link
Member

fjetter commented Mar 25, 2022

I suggest that, as follow-up work, we consider replacing all extensions with plugins. Thoughts?

Sounds good. This PR should then use a plugin heartbeat and not an extension heartbeat, shouldn't it? This would also offer a natural place to document this

@mrocklin
Copy link
Member Author

Sounds good. This PR should then use a plugin heartbeat and not an extension heartbeat, shouldn't it? This would also offer a natural place to document this

It sounds like you're suggesting moving the heartbeat functionality from extensions to plugins. Is that correct? If so, would you also expect systems like shuffling to use plugins rather than extensions? Semaphores too?

In principle I think that it would be good to unify on one extension mechanism. This would be a big change though.

@mrocklin
Copy link
Member Author

Sounds good. This PR should then use a plugin heartbeat and not an extension heartbeat, shouldn't it? This would also offer a natural place to document this

Actually, it looks like plugins don't any heartbeat mechanism today.

I'm still thinking that this should go in, and then sometime in the future we should think about trying to unify plugins and heartbeats (and if that makes sense). That seems like a distraction to me right now though.

@fjetter
Copy link
Member

fjetter commented Mar 28, 2022

It sounds like you're suggesting moving the heartbeat functionality from extensions to plugins. Is that correct? If so, would you also expect systems like shuffling to use plugins rather than extensions?

This is how I understood your proposal. At this stage, I figured this would be an easy change.

Semaphores too?

In principle I think that it would be good to unify on one extension mechanism. This would be a big change though.

Moving everything is most definitely not in scope for this PR and I didn't want to suggest you doing all that work now.

Actually, it looks like plugins don't any heartbeat mechanism today.

No it doesn't. nothing has a proper heartbeat mechanism and I figured this would be an easy way to make both sides happy.

@mrocklin
Copy link
Member Author

Follow-up issue on extensions: #6011

@mrocklin
Copy link
Member Author

I spoke with @fjetter in a side conversation. He said that he was comfortable enough with this approach given that there isn't a trivial approach to heartbeats elsewhere, and given that this is mostly just being used for diagnostics. I plan to look over tests again and then merge.

@mrocklin mrocklin merged commit 9fbf6d4 into dask:main Mar 29, 2022
@mrocklin mrocklin deleted the heartbeat-extensions branch March 29, 2022 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants