stabilize worker_total_busy_duration #6899

Owen-CH-Leung · 2024-10-11T04:54:29Z

Motivation

Stabilize worker_total_busy_duration so it can be used outside of --cfg tokio_unstable

Solution

Move the API impl out of tokio_unstable flag

Ref: #6546

…ch and Histogram out of unstable flag

…o WorkerMetrics

rcoh · 2024-10-11T13:30:36Z

tokio/src/runtime/metrics/histogram.rs

+/// Whether the histogram used to aggregate a metric uses a linear or
+/// logarithmic scale.
+#[derive(Debug, Copy, Clone, Eq, PartialEq)]
+#[non_exhaustive]
+#[allow(unreachable_pub)]
+pub enum HistogramScale {
+    /// Linear bucket scale
+    Linear,
+
+    /// Logarithmic bucket scale
+    #[allow(dead_code)]


is this change required for the mean time stabilization or just a coincidental change?

I believe this change is required.

At master we have implemented another struct WorkerMetrics at tokio/src/runtime/metrics/mock.rs which didn't use struct Histogram defined at tokio/src/runtime/metrics/histogram.rs.

In order to stabilize the API, I have to removed this mock WorkerMetrics and instead point to the "real" WorkerMetrics at tokio/src/runtime/metrics/worker.rs

The "real" WorkerMetrics has a field poll_count_histogram which is Option<Histogram>, and thus will attempt to parse tokio/src/runtime/metrics/histogram.rs. From there, you can see Histogram and HistogramBuilder both refers to HistogramScale which is hidden inside tokio_unstable. I think this wouldn't compile.

(I might be wrong though so feel free to correct me)

I need to read through this PR in more detail so bare with me, but we shouldn't be making this stable and public if it isn't actually needed to to stabilize worker_total_busy_duration (and I'm mostly sure that it isn't).

In general, all the #[allow(dead_code)] that are required for this chnage give me the impression that we're exposing something that is only really used in tokio_unstable and so we should find a way to expose it only unstable tokio_unstable.

Thanks for your comment!

I think the core issue here is that by exposing worker_total_busy_duration, we are also exposing the "real" WorkerMetrics at tokio/src/runtime/metrics/worker.rs, which in turn will attempt to parse HistogramScale.

Currently at master branch, when no tokio_unstable flag is passed in, the WorkerMetrics at tokio/src/runtime/metrics/mock.rs will be parsed (which is just an empty struct). And if the flag is passed in, the WorkerMetrics at tokio/src/runtime/metrics/worker.rs will be parsed

https://github.com/tokio-rs/tokio/blob/master/tokio/src/runtime/metrics/mod.rs#L27-L40

it's not actually public, it's still behind config_unstable. (Just verified via the autogenerated docs)

Yeah becoz HistogramScale is still behind cfg_unstable_metrics:

https://github.com/tokio-rs/tokio/pull/6899/files#diff-15ebfca6124144b3cffd9845fa3ce5596c342ce94d84dd406ee84f856db4c66cR23-R25

OK, now I believe I understand.

I would suggest that instead of compiling with the entire worker metrics in order to access only the busy_duration_total field, we should gate all the fields that we won't be using on cfg_unstable_metrics. Otherwise users will still be paying the price for metrics which they don't have access to - and that is something that we would really like to avoid.

Have a look at the runtime Builder for an example of how we have fields and implementations that touch them gated on a cfg flag:

tokio/tokio/src/runtime/builder.rs

Lines 125 to 126 in 512e9de

#[cfg(tokio_unstable)]

pub(super) unhandled_panic: UnhandledPanic,

As a general rule, if you need #[allow(dead_code)] then there is probably something that should be gated on a cfg flag instead.

Thanks @hds . I've reverted changes to histogram.rs. Also the real Histogram, HistogramBuilder and HistogramBatch are gated behind unstable flag now, and instead I've created a mocked version of these for compilation.

For WorkerMetrics, as we target to stabilize more metrics, I'd suggest exposing all fields instead of stabilizing only busy_duration_total and putting other fields behind unstable.

Let me know what you think

@Owen-CH-Leung the problem with stabilizing all of WorkerMetrics is that when tokio_unstable is not enabled, the runtime will pay the price for all those metrics, but there will be no way to access them. For this reason I think that it would be better to only stabilize what we're exposing.

Thanks @hds . I've made changes to hide most of the fields of WorkerMetrics behind unstable flag, except for busy_duration_total, queue_depth and thread_id. The latter 2 fields are needed in order for set_queue_depth and set_thread_id to work properly.

I've also enriched the mock MetricsBatch to have minimal implementation of batch::MetricsBatch so that the worker_total_busy_duration API can function properly under stable build

Let me know your thoughts!

tokio/src/runtime/metrics/runtime.rs

tokio/src/runtime/scheduler/mod.rs

tokio/src/runtime/scheduler/multi_thread/handle/metrics.rs

hds

Please bare with me as I haven't finished the review, but my first impression is that this PR is making more things public than is strictly necessary for the stabilization of worker_total_busy_duration and we should try to avoid that.

This is especially true since #6897 is also touching the histogram (although not taking it out of tokio_unstable) and that would be a breaking change if this PR is released first.

hds · 2024-10-12T16:52:48Z

tokio/src/runtime/metrics/histogram.rs

+/// Whether the histogram used to aggregate a metric uses a linear or
+/// logarithmic scale.
+#[derive(Debug, Copy, Clone, Eq, PartialEq)]
+#[non_exhaustive]
+#[allow(unreachable_pub)]
+pub enum HistogramScale {
+    /// Linear bucket scale
+    Linear,
+
+    /// Logarithmic bucket scale
+    #[allow(dead_code)]


I need to read through this PR in more detail so bare with me, but we shouldn't be making this stable and public if it isn't actually needed to to stabilize worker_total_busy_duration (and I'm mostly sure that it isn't).

In general, all the #[allow(dead_code)] that are required for this chnage give me the impression that we're exposing something that is only really used in tokio_unstable and so we should find a way to expose it only unstable tokio_unstable.

…nges on histogram.rs and guard it behind unstable flag

…on_total, queue_depth and thread_id

rcoh · 2024-10-21T14:02:28Z

tokio/src/runtime/scheduler/current_thread/mod.rs

@@ -533,7 +533,7 @@ impl Handle {
        self.shared.inject.len()
    }

-    #[allow(dead_code)]
+    // #[allow(dead_code)]


Suggested change

// #[allow(dead_code)]

rcoh · 2024-10-21T14:02:36Z

tokio/src/runtime/scheduler/multi_thread/handle/metrics.rs

@@ -18,7 +18,7 @@ impl Handle {
        self.shared.injection_queue_depth()
    }

-    #[allow(dead_code)]
+    // #[allow(dead_code)]


Suggested change

// #[allow(dead_code)]

hds · 2024-10-21T14:06:25Z

tokio/src/runtime/metrics/worker.rs

-            .as_ref()
-            .map(|histogram_builder| histogram_builder.build());
-        worker_metrics
+    cfg_unstable_metrics! {


It would be better if we grouped all the unstable functions together at the bottom of the impl block, instead of spreading them out.

hds · 2024-10-21T14:07:28Z

tokio/src/runtime/metrics/worker.rs

@@ -15,40 +18,60 @@ use std::thread::ThreadId;
 #[derive(Debug, Default)]
 #[repr(align(128))]
 pub(crate) struct WorkerMetrics {
+    #[cfg(tokio_unstable)]
+    #[cfg_attr(docsrs, doc(cfg(tokio_unstable)))]


Is this necessary? Since this isn't a public method, it won't appear in the documentation.

hds · 2024-10-21T14:19:49Z

tokio/src/runtime/metrics/mock.rs

    }

-    pub(crate) fn submit(&mut self, _to: &WorkerMetrics, _mean_poll_time: u64) {}
+    pub(crate) fn submit(&mut self, worker: &WorkerMetrics, _mean_poll_time: u64) {


Do I understand correctly that this function duplicates part of the submit function in batch::MetricsBatch?

I think this is a problematic way of gradually stabilizing metrics, as it opens the possibility of having divirging implementations if a change is made to the "real" MetricsBatch by someone who doesn't realise that there is another one.

This is additionally confusing because this effectively becomes the "stable" implementation, but it lives in a module called mock.

I would propose that we instead split the metrics::MetricsBatch implementation into stable (always compiles) and unstable (gated by cfg option), the same way we've done elsewhere in this PR. The same as with another comment, we would group all the unstable functions into a single cfg_unstable_metrics! block.

Indeed spliting metrics::MetricsBatch is a much viable way of stabilising. I've adopted your suggestion and split it into stable & unstable (and group unstable functions into a single unstable block. Thanks a lot for reviewing!

… & unstable

hds

The implementation is looking good now. Just a few organizational things so that we do conditional compilation the same as the rest of the code base.

hds · 2024-11-13T13:52:53Z

tokio/src/runtime/metrics/batch.rs

        worker
            .busy_duration_total
            .store(self.busy_duration_total, Relaxed);


Let's move the stablized items up to the top of the function.

hds · 2024-11-13T14:38:15Z

tokio/src/runtime/metrics/batch.rs

    pub(crate) fn unparked(&mut self) {
-        self.park_unpark_count += 1;
+        #[cfg(tokio_unstable)]
+        {
+            self.park_unpark_count += 1;
+        }
    }


In the rest of the Tokio code base, we do this a different way, so let's stick to that convension. Instead of gating functionality within a function, we have a separate empty function definition when the cfg flag isn't enabled. So this function would become:

cfg_unstable_metrics! { /// The worker was unparked. pub(crate) fn unparked(&mut self) { self.park_unpark_count += 1; } } cfg_not_unstable_metrics! { /// The worker was unparked. pub(crate) fn unparked(&mut self) {} }

Please do the same here. Keep a single cfg_unstable_metrics block (and a single cfg_not_unstable_metrics block) for all the functions that require this behavior, so that they're grouped together.

For the more complex functions above that have a mix of stablized and unstablized implementation, split the unstablized part out into a separate function with an impl in each of the macro blocks (see example in the comment on submit).

hds · 2024-11-13T14:43:32Z

tokio/src/runtime/metrics/batch.rs

-            .store(self.steal_operations, Relaxed);
-        worker.poll_count.store(self.poll_count, Relaxed);
-
+    pub(crate) fn submit(&mut self, worker: &WorkerMetrics, _mean_poll_time: u64) {


As per the comment on unparked, let's follow the convension elsewhere in the Tokio codebase where we need different implementations for stablized vs. unstablized functionality. Here that would mean the following:

pub(crate) fn submit(&mut self, worker: &WorkerMetrics, mean_poll_time: u64) { worker .busy_duration_total .store(self.busy_duration_total, Relaxed); self.submit_unstable(worker, mean_poll_time); } cfg_not_unstable_metrics! { #[inline(always)] fn submit_unstable(&mut self, _worker: &WorkerMetrics, _mean_poll_time: u64) {} } cfg_unstable_metrics! { #[inline(always)] fn submit_unstable(&mut self, worker: &WorkerMetrics, mean_poll_time: u64){ worker.mean_poll_time.store(_mean_poll_time, Relaxed); worker.park_count.store(self.park_count, Relaxed); worker .park_unpark_count .store(self.park_unpark_count, Relaxed); worker.noop_count.store(self.noop_count, Relaxed); worker.steal_count.store(self.steal_count, Relaxed); worker .steal_operations .store(self.steal_operations, Relaxed); worker.poll_count.store(self.poll_count, Relaxed); worker .local_schedule_count .store(self.local_schedule_count, Relaxed); worker.overflow_count.store(self.overflow_count, Relaxed); if let Some(poll_timer) = &self.poll_timer { let dst = worker.poll_count_histogram.as_ref().unwrap(); poll_timer.poll_counts.submit(dst); } } }

Use the same cfg_unstable_metrics and cfg_not_unstable_metrics blocks as for the other functions that need them (so there is only one of each in impl MetricsBatch.

hds · 2024-11-13T14:44:23Z

tokio/src/runtime/metrics/batch.rs

 }

 impl MetricsBatch {
-    pub(crate) fn new(worker_metrics: &WorkerMetrics) -> MetricsBatch {
+    pub(crate) fn new(_worker_metrics: &WorkerMetrics) -> MetricsBatch {


Split this into 2 implementations each gated by cfg_(not_)unstable_metrics!.

hds · 2024-11-13T14:45:35Z

tokio/src/runtime/metrics/mod.rs

+    mod worker;
+    pub(crate) use worker::WorkerMetrics;


Please move the modules and imports that are in both the macro blocks out above them, we don't need to gate them at all.

hds · 2024-11-13T14:47:48Z

tokio/src/runtime/metrics/worker.rs

-            .as_ref()
-            .map(|histogram_builder| histogram_builder.build());
-        worker_metrics
+    cfg_not_unstable_metrics! {


Let's move this down to be right above the cfg_unstable_metrics! block so that we keep the conditionally compiled implementations together.

Owen-CH-Leung · 2024-11-23T03:40:08Z

@hds Thanks for your detailed review!

I've revised the PR to gate code based on cfg_(not_)unstable_metrics! macros, and also re-organize code in the mod.rs and worker.rs. Let me know if you have any other feedback.

rcoh

Overall this looks good to me. I think incremental stabilization is inherently annoying to do—I think it might be better if we invested in some primitives. I think we should also endeavor to standardize the way functions are made to be no-ops.

rcoh · 2025-01-03T20:04:59Z

tokio/src/runtime/metrics/batch.rs

    /// Number of tasks that were scheduled locally on this worker.
    local_schedule_count: u64,

+    #[cfg(tokio_unstable)]
    /// Number of tasks moved to the global queue to make space in the local


consider reorganizing to put the stable fields on the top to make it clearer

rcoh · 2025-01-03T20:08:50Z

tokio/src/runtime/metrics/batch.rs

+    cfg_not_unstable_metrics! {
+        /// Start polling an individual task
+        pub(crate) fn start_poll(&mut self) {}
+    }

-        if let Some(poll_timer) = &mut self.poll_timer {
-            poll_timer.poll_started_at = Instant::now();
+    cfg_unstable_metrics! {
+        /// Stop polling an individual task
+        pub(crate) fn end_poll(&mut self) {
+            #[cfg(tokio_unstable)]
+            if let Some(poll_timer) = &mut self.poll_timer {
+                let elapsed = duration_as_u64(poll_timer.poll_started_at.elapsed());
+                poll_timer.poll_counts.measure(elapsed, 1);
+            }
        }
    }


reading through this, I wonder if we should make a macro specifically for this pattern, something like:

cfg_metrics! { stable: { ... }, unstable: { ... } }

if we are going to have a lot of a/b code when metrics or stable or not, could be helpful to avoid bugs

rcoh · 2025-01-03T20:10:24Z

tokio/src/runtime/metrics/batch.rs

+            #[cfg(tokio_unstable)] {
+                self.steal_count += _by as u64;
+            }


I think we should probably pick a pattern—either cfg-ing the body of the function or the function entirely instead of using both

Thanks @rcoh . I've added a new macro and put all the ab code implementation into this macro. I also recorganize fields so that stable fields come first. Please can I have your review again ? Thanks!

rcoh · 2025-01-13T14:29:12Z

tokio/src/runtime/metrics/batch.rs

+                pub(crate) fn incr_steal_count(&mut self, _by: u16) {
+                    self.steal_count += _by as u64;
+                }


Suggested change

pub(crate) fn incr_steal_count(&mut self, _by: u16) {

self.steal_count += _by as u64;

}

pub(crate) fn incr_steal_count(&mut self, by: u16) {

self.steal_count += by as u64;

}

Thanks & revised

rcoh · 2025-01-13T14:30:38Z

tokio/src/runtime/metrics/worker.rs

@@ -15,40 +18,50 @@ use std::thread::ThreadId;
 #[derive(Debug, Default)]
 #[repr(align(128))]
 pub(crate) struct WorkerMetrics {


same—please move stable fields to top

Thanks. Moved stable fields to top

rcoh · 2025-01-13T14:31:21Z

tokio/src/runtime/metrics/worker.rs

    /// Number of tasks currently in the local queue. Used only by the
    /// current-thread scheduler.
    pub(crate) queue_depth: MetricAtomicUsize,


is this intentionally stabilized?

I believe queue_depth is already stabilised at master and is used by the current thread scheduler :

https://github.com/tokio-rs/tokio/blob/master/tokio/src/runtime/scheduler/current_thread/mod.rs#L341

…ics_variant

stabilize worker total busy duration, bring WorkerMetrics, MetricsBat…

8f1fcb4

…ch and Histogram out of unstable flag

github-actions bot added R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt Run loom multi-thread alt tests on this PR labels Oct 11, 2024

Fix rustfmt ci job

9b47cf9

Darksonn requested a review from hds October 11, 2024 07:45

Darksonn added A-tokio Area: The main tokio crate M-metrics Module: tokio/runtime/metrics labels Oct 11, 2024

Owen-CH-Leung added 2 commits October 11, 2024 15:52

Fix various failing CI jobs by adding cfg(target_has_atomic = "64") t…

e2f4f33

…o WorkerMetrics

Use cfg_64bit_metrics instead

b6974ba

rcoh reviewed Oct 11, 2024

View reviewed changes

Owen-CH-Leung changed the title ~~stabilize worker total busy duration, bring WorkerMetrics, MetricsBat…~~ stabilize worker_total_busy_duration Oct 11, 2024

Fix formatting and remove brackets

86f019f

Owen-CH-Leung marked this pull request as ready for review October 11, 2024 15:33

hds requested changes Oct 12, 2024

View reviewed changes

Owen-CH-Leung added 2 commits October 16, 2024 22:34

Creat Mock Histogram, HistogramBatch and HistogramBuilder. Revert cha…

64f626d

…nges on histogram.rs and guard it behind unstable flag

Hide queue_depth and thread_id behind unstable flag

489003c

Darksonn mentioned this pull request Oct 16, 2024

Flaky test: single_ack_eliciting_packet_triggers_ack_after_delay quinn-rs/quinn#2014

Open

Owen-CH-Leung added 5 commits October 19, 2024 12:00

Mark most fields of WorkerMetrics as unstable, except for busy_durati…

8a134a2

…on_total, queue_depth and thread_id

Merge branch 'master' into stabilize_worker_total_busy_duration

4bc00cf

Remove allow dead_code, merge master & fix spellcheck

7eb6b97

Merge branch 'master' into stabilize_worker_total_busy_duration

7239af5

Add back worker_total_busy_duration test

57f6b9b

rcoh reviewed Oct 21, 2024

View reviewed changes

hds requested changes Oct 21, 2024

View reviewed changes

Remove mock metricBatch, split MetricBatch implementation into stable…

c8b2c7d

… & unstable

hds requested changes Nov 13, 2024

View reviewed changes

Owen-CH-Leung added 2 commits November 23, 2024 10:07

Merge branch 'master' into stabilize_worker_total_busy_duration

d7333cf

Gate code based on cfg_unstable_metrics and cfg_not_unstable_metrics

14543de

Merge branch 'master' into stabilize_worker_total_busy_duration

245d75c

Owen-CH-Leung requested review from hds and rcoh December 30, 2024 17:17

rcoh approved these changes Jan 3, 2025

View reviewed changes

Owen-CH-Leung added 2 commits January 11, 2025 11:26

Merge branch 'master' into stabilize_worker_total_busy_duration

e0d0b84

add new macro cfg_metrics_variant, refactor stable/unstable code

b821f92

rcoh reviewed Jan 13, 2025

View reviewed changes

Owen-CH-Leung added 2 commits January 14, 2025 00:07

refactor field ordering, add more stable/unstable code using cfg_metr…

64f029a

…ics_variant

Merge branch 'master' into stabilize_worker_total_busy_duration

73dfd89

Owen-CH-Leung requested a review from rcoh February 2, 2025 07:05

	#[cfg(tokio_unstable)]
	pub(super) unhandled_panic: UnhandledPanic,

stabilize worker_total_busy_duration #6899

Are you sure you want to change the base?

stabilize worker_total_busy_duration #6899

Conversation

Owen-CH-Leung commented Oct 11, 2024

Motivation

Solution

Choose a reason for hiding this comment

Owen-CH-Leung Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hds left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hds left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Owen-CH-Leung commented Nov 23, 2024

rcoh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Owen-CH-Leung Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

Owen-CH-Leung Oct 11, 2024 •

edited

Loading

Owen-CH-Leung Jan 13, 2025 •

edited

Loading