-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More comprehensive WorkerState task counters #7167
Conversation
tasks.add_metric(["ready"], len(self.server.state.ready)) | ||
tasks.add_metric(["waiting"], self.server.state.waiting_for_data_count) | ||
for k, n in ws.task_counts.items(): | ||
tasks.add_metric([k], n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed stored -> memory and added many more measures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference between things in data
and actors in "memory" state? (I noticed that actors weren't included in previous metric and are included in new metric)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "real" instance of an actor lives in WorkerState.actors instead of data. Its proxies on remote workers live in data. Actors are an edge case I would not spend time fine-tuning anything towards anyway.
# Actors can be in any state other than {fetch, flight, missing} | ||
n_actors_in_memory = sum( | ||
self.tasks[key].state == "memory" for key in self.actors | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is O(n) but actors are such a niche feature that we should not care in 99% of the cases
a249f36
to
998de6c
Compare
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 6h 5m 41s ⏱️ - 12m 50s For more details on these failures, see this check. Results for commit 900e8b2. ± Comparison against base commit 6afce9c. ♻️ This comment has been updated with latest results. |
0850804
to
1d2fca9
Compare
Ready for review and merge. All test failures seem unrelated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @crusaderky!
"flight": len(self.in_flight_tasks), | ||
} | ||
# released | error | ||
out["other"] = other = len(self.tasks) - sum(out.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of scope: I am wondering whether having the count for error
tasks available would be worth the additional effort of maintaining yet another set. This feels like useful information in Prometheus, but might also be less relevant since we track erred
tasks (after retries) on the scheduler as well. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It felt a bit useless, from the point of view of performance metrics. An erred task is just sitting there until it's either released or its exception is retrieved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, I guess a counter of tasks transitioning into this state would be the interesting bit here rather than any point-in-time count.
fetch
state.stored
has been renamed tomemory
CC @ntabris