More comprehensive WorkerState task counters #7167

crusaderky · 2022-10-20T17:25:39Z

Expose a more comprehensive and accurate view of the number of tasks in each state on the worker to the scheduler (through the heartbeat) and to prometheus. Crucially, this PR gives us a measure of network saturation and co-assignment by showing how many tasks are queued up in fetch state.
O(1) count number of tasks in fetch state
Closes Inconsistent Worker.waiting_for_data_count #6319
Disallow directly writing to Worker.data. I am quite frankly baffled this was possible before. I don't think we should have a deprecation cycle for this?
Prometheus tasks metric stored has been renamed to memory CC @ntabris

crusaderky · 2022-10-20T17:35:29Z

distributed/http/worker/prometheus/core.py

-        tasks.add_metric(["ready"], len(self.server.state.ready))
-        tasks.add_metric(["waiting"], self.server.state.waiting_for_data_count)
+        for k, n in ws.task_counts.items():
+            tasks.add_metric([k], n)


Renamed stored -> memory and added many more measures

what's the difference between things in data and actors in "memory" state? (I noticed that actors weren't included in previous metric and are included in new metric)

The "real" instance of an actor lives in WorkerState.actors instead of data. Its proxies on remote workers live in data. Actors are an edge case I would not spend time fine-tuning anything towards anyway.

crusaderky · 2022-10-20T17:41:16Z

distributed/worker_state_machine.py

+        # Actors can be in any state other than {fetch, flight, missing}
+        n_actors_in_memory = sum(
+            self.tasks[key].state == "memory" for key in self.actors
+        )


This is O(n) but actors are such a niche feature that we should not care in 99% of the cases

github-actions · 2022-10-20T18:26:37Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±  0       15 suites ±0 6h 5m 41s ⏱️ - 12m 50s
  3 153 tests +  3   3 062 ✔️ +  4   85 💤 ±0   6 ❌ - 1
23 328 runs +21 22 394 ✔️ +22 913 💤 ±0 21 ❌ - 1

For more details on these failures, see this check.

Results for commit 900e8b2. ± Comparison against base commit 6afce9c.

♻️ This comment has been updated with latest results.

crusaderky · 2022-10-20T21:49:12Z

Ready for review and merge. All test failures seem unrelated.

hendrikmakait

LGTM, thanks @crusaderky!

hendrikmakait · 2022-10-21T14:48:54Z

distributed/worker_state_machine.py

+            "flight": len(self.in_flight_tasks),
+        }
+        # released | error
+        out["other"] = other = len(self.tasks) - sum(out.values())


Out of scope: I am wondering whether having the count for error tasks available would be worth the additional effort of maintaining yet another set. This feels like useful information in Prometheus, but might also be less relevant since we track erred tasks (after retries) on the scheduler as well. What do you think?

It felt a bit useless, from the point of view of performance metrics. An erred task is just sitting there until it's either released or its exception is retrieved.

Fair point, I guess a counter of tasks transitioning into this state would be the interesting bit here rather than any point-in-time count.

crusaderky commented Oct 20, 2022

View reviewed changes

crusaderky force-pushed the task_counts branch from a249f36 to 998de6c Compare October 20, 2022 17:44

More comprehensive WorkerState task counters

1d2fca9

crusaderky force-pushed the task_counts branch from 0850804 to 1d2fca9 Compare October 20, 2022 18:28

Fix test_add_worker

900e8b2

crusaderky self-assigned this Oct 20, 2022

crusaderky marked this pull request as ready for review October 20, 2022 21:48

hendrikmakait approved these changes Oct 21, 2022

View reviewed changes

crusaderky merged commit 8f25111 into dask:main Oct 24, 2022

crusaderky deleted the task_counts branch October 24, 2022 15:20

crusaderky mentioned this pull request Oct 25, 2022

AMM: Increase data transfer priorities for graceful worker retirement #7183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More comprehensive WorkerState task counters #7167

More comprehensive WorkerState task counters #7167

crusaderky commented Oct 20, 2022 •

edited

Loading

crusaderky Oct 20, 2022 •

edited

Loading

ntabris Oct 20, 2022

crusaderky Oct 20, 2022

crusaderky Oct 20, 2022

github-actions bot commented Oct 20, 2022 •

edited

Loading

crusaderky commented Oct 20, 2022

hendrikmakait left a comment

hendrikmakait Oct 21, 2022

crusaderky Oct 21, 2022

hendrikmakait Oct 21, 2022

More comprehensive WorkerState task counters #7167

More comprehensive WorkerState task counters #7167

Conversation

crusaderky commented Oct 20, 2022 • edited Loading

crusaderky Oct 20, 2022 • edited Loading

Choose a reason for hiding this comment

ntabris Oct 20, 2022

Choose a reason for hiding this comment

crusaderky Oct 20, 2022

Choose a reason for hiding this comment

crusaderky Oct 20, 2022

Choose a reason for hiding this comment

github-actions bot commented Oct 20, 2022 • edited Loading

Unit Test Results

crusaderky commented Oct 20, 2022

hendrikmakait left a comment

Choose a reason for hiding this comment

hendrikmakait Oct 21, 2022

Choose a reason for hiding this comment

crusaderky Oct 21, 2022

Choose a reason for hiding this comment

hendrikmakait Oct 21, 2022

Choose a reason for hiding this comment

crusaderky commented Oct 20, 2022 •

edited

Loading

crusaderky Oct 20, 2022 •

edited

Loading

github-actions bot commented Oct 20, 2022 •

edited

Loading