Add `TimerHeapMetrics` #4187

iRevive · 2024-11-27T08:52:53Z

I'm unfamiliar with the TimerHeap internals, so perhaps the direction needs to be corrected.

armanbilge

thanks so much for starting this follow-up :)

armanbilge · 2024-11-27T19:38:32Z

core/jvm/src/main/scala/cats/effect/unsafe/metrics/WorkStealingPoolMetrics.scala

+    /**
+     * Returns the next due to fire.
+     */
+    def nextTimerDue(): Option[Long]


We should figure out what this Long represents. Currently it is an absolute time in nano seconds relative to the (arbitrary) nanoTime offset. I am not sure how useful that is.

Would it make sense to convert to a unix epoch time? I think we'd have to give up some precision.

Or, we could return the number of nanoseconds until the next timer fires — so a relative time to the present, not absolute.

the number of nanoseconds until the next timer fires — so a relative time to the present, not absolute.

This option looks more versatile and suitable.

armanbilge · 2024-11-27T19:47:28Z

core/jvm/src/main/scala/cats/effect/unsafe/TimerHeap.scala

+  def nextTimerDue(): Option[Long] =
+    heap
+      .lift(1)
+      .filter(t => !t.isDeleted() && t.triggerTime != Long.MinValue)
+      .map(_.triggerTime)


Once we decide on the API I can help with the implementation. I assume that this method may be called from non-owning threads, in which case we need to write the implementation defensively against race conditions.

I also think we should just return the time at the root of the heap, even if it's deleted/canceled. Returning None is confusing when there are more active timers besides the root. Besides, the timer was active at some point, deletion/cancelation is a race condition against metrics retrieval, and it's still a lower bound on the actual next timer due. So I think it's still relevant statistically speaking.

iRevive · 2024-11-29T11:32:46Z

From the OpenTelemetry perspective, I can think of the following metrics:

timerheap.scheduled.count - total number of timers that have been scheduled (enqueued)
timerheap.executed.count - total number of executed timers
timerheap.cancelled.count - total number of canceled timers - canceledCounter.get() or removedCanceledCounter
timerheap.outstanding.count - current number of pending timers - size
timerheap.next.execution.time - the due of the next timer - nextTimerDue

We may need to introduce new counters to keep track of scheduled and executed timers.

armanbilge · 2024-12-10T19:10:47Z

@iRevive I just pushed changes to TimerHeap to expose the metrics you listed, passing the baton back to you 😁

armanbilge · 2024-12-10T19:13:01Z

core/jvm/src/main/scala/cats/effect/unsafe/TimerHeap.scala

+  /**
+   * Returns the next due to fire.
+   */
+  def nextTimerDue(): Option[Long] = {
+    val root = heap(1)
+    if (root ne null) {
+      val now = System.nanoTime()
+      val when = root.triggerTime - now
+      Some(when)
+    } else None
+  }


Caveats:

May return a negative number if the next timer is overdue to fire.

May overflow if the next timer is too far in the future.

Thoughts?

May return a negative number if the next timer is overdue to fire.

It's ok, I guess.

If the negative number prevails over time (each time we collect metrics), it means that timers are under heavy pressure, right?

May overflow if the next timer is too far in the future.

We can use math.multiplyExact and return None in the case of overflow. But it's not ideal either, I think.

If the negative number prevails over time (each time we collect metrics), it means that timers are under heavy pressure, right?

Well, not necessarily. Assuming that we are seeing the latest data, it could indicate the worker thread is overwhelmed by (long-running) tasks and not able to check/trigger timers frequently enough. It would be measuring essentially the same thing as the starvation checker.

However, TimerHeap is not a thread-safe data structure so it's also possible that we might be seeing stale data i.e. a timer that was already triggered on the owner thread, but the changes to the heap were not yet published to the thread that is querying metrics.

We can use math.multiplyExact and return None in the case of overflow. But it's not ideal either, I think.

Actually we would need subtractExact. But I agree returning None is not ideal, but on the other hand this situation is pretty unusual. Basically it means the next timer is scheduled 292 years in the future or something like that. That's very nearly like having no timer scheduled I suppose ...

I think we can ignore the overflow case. :D Anyone scheduling timers that far in the future deserves whatever they get from the metrics. Also I'm pretty sure the implementation itself would probably handle it incorrectly anyway since, at some point prior-or-equal to the relative overflow value, it's going to overflow the comparison math and start looking like an expired timer rather than a far-future one. So you basically can't have a functional timer that far in the future.

Also the negative number feels completely fine to me. It's actually an interesting signal in and of itself (as you two discussed), modulo the publication variability Arman already noted.

Edit: Oh this was discussed more below. Ignore me tyty

iRevive · 2024-12-12T10:22:13Z

@armanbilge I added the discussed methods to the TimerHeapMetrics but didn't touch the overflow case.

Should we keep it as is, perhaps? And change the calculation logic if users complain?

armanbilge · 2024-12-13T19:55:06Z

but didn't touch the overflow case.

Actually, I think it is okay and will not overflow in practice. This is because there is already some special handling to prevent overflows when doing comparisons within the heap. After thinking about it more, I realize that it will also prevent overflows when computing this metric.

cats-effect/core/jvm/src/main/scala/cats/effect/unsafe/TimerHeap.scala

Lines 394 to 402 in 9ce05f2

    
             /** 
        
              * Computes the trigger time in an overflow-safe manner. The trigger time is essentially `now 
        
              * + delay`. However, we must constrain all trigger times in the heap to be within 
        
              * `Long.MaxValue` of each other (otherwise there will be overflow when comparing in `cpr`). 
        
              * Thus, if `delay` is so big, we'll reduce it to the greatest allowable (in `overflowFree`). 
        
              * 
        
              * From the public domain JSR-166 `ScheduledThreadPoolExecutor` (`triggerTime` method). 
        
              */ 
        
             private[this] def computeTriggerTime(now: Long, delay: Long): Long = {

armanbilge added this to the v3.6.0 milestone Nov 27, 2024

armanbilge reviewed Nov 27, 2024

View reviewed changes

iRevive mentioned this pull request Dec 2, 2024

Add IORuntimeMetrics typelevel/otel4s#861

Merged

armanbilge reviewed Dec 10, 2024

View reviewed changes

iRevive and others added 3 commits December 11, 2024 17:39

Add TimerHeapMetrics

c7acc74

Track total counts in TimerHeap

d57c72d

Add TimerHeapMetrics

5048008

iRevive force-pushed the timer-heap-metrics branch from 18dd0df to 5048008 Compare December 11, 2024 15:43

iRevive marked this pull request as ready for review December 11, 2024 15:44

armanbilge requested a review from djspiewak December 18, 2024 21:00

djspiewak approved these changes Dec 19, 2024

View reviewed changes

djspiewak merged commit aa5fbbd into typelevel:series/3.x Dec 19, 2024
29 of 33 checks passed

armanbilge added the 🍄 enhancement label Dec 20, 2024

iRevive mentioned this pull request Dec 22, 2024

Make runtime metrics publicly available #3270

Closed

armanbilge mentioned this pull request Dec 23, 2024

Register new metrics as MBeans #4213

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `TimerHeapMetrics` #4187

Add `TimerHeapMetrics` #4187

iRevive commented Nov 27, 2024 •

edited

Loading

armanbilge left a comment

armanbilge Nov 27, 2024

iRevive Nov 29, 2024

armanbilge Nov 27, 2024

iRevive commented Nov 29, 2024 •

edited

Loading

armanbilge commented Dec 10, 2024

armanbilge Dec 10, 2024

iRevive Dec 10, 2024

armanbilge Dec 10, 2024 •

edited

Loading

djspiewak Dec 18, 2024 •

edited

Loading

iRevive commented Dec 12, 2024 •

edited

Loading

armanbilge commented Dec 13, 2024

Add TimerHeapMetrics #4187

Add TimerHeapMetrics #4187

Conversation

iRevive commented Nov 27, 2024 • edited Loading

armanbilge left a comment

Choose a reason for hiding this comment

armanbilge Nov 27, 2024

Choose a reason for hiding this comment

iRevive Nov 29, 2024

Choose a reason for hiding this comment

armanbilge Nov 27, 2024

Choose a reason for hiding this comment

iRevive commented Nov 29, 2024 • edited Loading

armanbilge commented Dec 10, 2024

armanbilge Dec 10, 2024

Choose a reason for hiding this comment

iRevive Dec 10, 2024

Choose a reason for hiding this comment

armanbilge Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

djspiewak Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

iRevive commented Dec 12, 2024 • edited Loading

armanbilge commented Dec 13, 2024

Add `TimerHeapMetrics` #4187

Add `TimerHeapMetrics` #4187

iRevive commented Nov 27, 2024 •

edited

Loading

iRevive commented Nov 29, 2024 •

edited

Loading

armanbilge Dec 10, 2024 •

edited

Loading

djspiewak Dec 18, 2024 •

edited

Loading

iRevive commented Dec 12, 2024 •

edited

Loading