-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow worker extensions to piggy-back on heartbeat #5957
Conversation
To enable better diagnostics, it would be useful to allow worker extensions to piggy-back on the standard heartbeat. This adds an optional "heartbeat" method to extensions, and, if present, calls a custom method that gets sent to the scheduler and processed by an extension of the same name. This also starts to store the extensions on the worker in a named dictionary. Previously this was a list, but I'm not sure that it was actually used anywhere. This is a breaking change without deprecation, but in a space that I suspect no one will care about. I'm happy to provide a fallback if desired.
Unit Test Results 12 files + 2 12 suites +2 5h 56m 47s ⏱️ + 39m 28s For more details on these errors, see this check. Results for commit 2a3c35d. ± Comparison against base commit e8c0669. ♻️ This comment has been updated with latest results. |
This now also adds named extensions in the worker and scheduler |
Tests are failing. I can't reproduce locally. This is just blind praying that it fixes the problem. It should be innocuous.
…d into heartbeat-extensions
This was the result of a bad merge conflict
@fjetter are you ok with this change? I'm changing the DEFAULT_EXTENSIONS collections from lists to named dictionaries. Then secondarily I also add a mechanism to have extensions send messages up to their counterparts along with the regular heartbeat. |
Merging in 24 hours if there are not further comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love how well this mechanism is hidden. For this to work, one needs to define a pair of extensions on both scheduler and worker side that share the same name.
So far, we also do not implement any hooks for extensions but only for plugins. I don't like the idea of introducing hooks for both extensions and plugins, particularly if we have things that are both an extension and a plugin (e.g. stealing
)
Co-authored-by: Florian Jetter <fjetter@users.noreply.github.com>
I agree that this is creating an implicit and hidden contract, which isn't great. That being said, I don't currently see a much better option, other than making things very explicit and published (which seems premature to me, I don't yet have enough confidence that this is the one true way). Thoughts? |
…d into heartbeat-extensions
Why do you want to use the heartbeat for this in the first place? We do have use cases for extensions to submit data regularly but haven't used the heartbeat before. distributed/distributed/semaphore.py Lines 432 to 446 in 4866615
This originally used clients which was a scaling problem (many clients do not scale well right now) and we moved to reusing the worker comm pool in #4195 We discussed using the BatchedSend instead but I decided against this for complexity reasons back then. However, if performance is of concern that may probably be the most straight forward approach.
The one property I see in the heartbeat that is unique is its "scale interval to number of workers" property. Is this what you require? |
Require is a strong word, but yes, that's the gesture that I'm trying to retain. At the many hundred worker scale sending heartbeat like messages becomes a problem. We should probably avoid it generally. If semaphores are doing this then we should consider shifting them over to a system like this. Are there other cases that you can think of? |
The client itself is not scaling well at all. I never debugged this properly but am suspecting heartbeats as one of the reasons. You have many clients when using tasks-in-tasks or other worker_client features. I've seen users being bothered by this without knowing it, e.g. when using prefect. I can also see the scheduler_info thing in the client... maybe more? from the top of my head, I can't think of anything else. Briefly looking over the extensions, nothing jumps out at me |
I'll try to take a look at the semaphore extension today and see if it makes sense to bring that over to something like this. I imagine that this hasn't been as much of an issue just because semaphores are less commonly used. For things like shuffle I would expect this to come up more often. (heads-up, I'm at a conference today and may be less responsive than normal) |
I suggest that, as follow-up work, we consider replacing all extensions with plugins. Thoughts? |
Sounds good. This PR should then use a plugin heartbeat and not an extension heartbeat, shouldn't it? This would also offer a natural place to document this |
It sounds like you're suggesting moving the heartbeat functionality from extensions to plugins. Is that correct? If so, would you also expect systems like shuffling to use plugins rather than extensions? Semaphores too? In principle I think that it would be good to unify on one extension mechanism. This would be a big change though. |
Actually, it looks like plugins don't any heartbeat mechanism today. I'm still thinking that this should go in, and then sometime in the future we should think about trying to unify plugins and heartbeats (and if that makes sense). That seems like a distraction to me right now though. |
This is how I understood your proposal. At this stage, I figured this would be an easy change.
Moving everything is most definitely not in scope for this PR and I didn't want to suggest you doing all that work now.
No it doesn't. nothing has a proper heartbeat mechanism and I figured this would be an easy way to make both sides happy. |
Follow-up issue on extensions: #6011 |
I spoke with @fjetter in a side conversation. He said that he was comfortable enough with this approach given that there isn't a trivial approach to heartbeats elsewhere, and given that this is mostly just being used for diagnostics. I plan to look over tests again and then merge. |
To enable better diagnostics, it would be useful to allow worker
extensions to piggy-back on the standard heartbeat. This adds an
optional "heartbeat" method to extensions, and, if present, calls a
custom method that gets sent to the scheduler and processed by an
extension of the same name.
This also starts to store the extensions on the worker in a named
dictionary. Previously this was a list, but I'm not sure that it was
actually used anywhere. This is a breaking change without deprecation,
but in a space that I suspect no one will care about. I'm happy to
provide a fallback if desired.