Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TimerHeapMetrics #4187

Merged
merged 3 commits into from
Dec 19, 2024
Merged

Conversation

iRevive
Copy link
Contributor

@iRevive iRevive commented Nov 27, 2024

A follow-up for #3317 (comment).

I'm unfamiliar with the TimerHeap internals, so perhaps the direction needs to be corrected.

@armanbilge armanbilge added this to the v3.6.0 milestone Nov 27, 2024
Copy link
Member

@armanbilge armanbilge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks so much for starting this follow-up :)

/**
* Returns the next due to fire.
*/
def nextTimerDue(): Option[Long]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should figure out what this Long represents. Currently it is an absolute time in nano seconds relative to the (arbitrary) nanoTime offset. I am not sure how useful that is.

  • Would it make sense to convert to a unix epoch time? I think we'd have to give up some precision.
  • Or, we could return the number of nanoseconds until the next timer fires — so a relative time to the present, not absolute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the number of nanoseconds until the next timer fires — so a relative time to the present, not absolute.

This option looks more versatile and suitable.

Comment on lines 252 to 256
def nextTimerDue(): Option[Long] =
heap
.lift(1)
.filter(t => !t.isDeleted() && t.triggerTime != Long.MinValue)
.map(_.triggerTime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we decide on the API I can help with the implementation. I assume that this method may be called from non-owning threads, in which case we need to write the implementation defensively against race conditions.

I also think we should just return the time at the root of the heap, even if it's deleted/canceled. Returning None is confusing when there are more active timers besides the root. Besides, the timer was active at some point, deletion/cancelation is a race condition against metrics retrieval, and it's still a lower bound on the actual next timer due. So I think it's still relevant statistically speaking.

@iRevive
Copy link
Contributor Author

iRevive commented Nov 29, 2024

From the OpenTelemetry perspective, I can think of the following metrics:

  • timerheap.scheduled.count - total number of timers that have been scheduled (enqueued)
  • timerheap.executed.count - total number of executed timers
  • timerheap.cancelled.count - total number of canceled timers - canceledCounter.get() or removedCanceledCounter
  • timerheap.outstanding.count - current number of pending timers - size
  • timerheap.next.execution.time - the due of the next timer - nextTimerDue

We may need to introduce new counters to keep track of scheduled and executed timers.

@armanbilge
Copy link
Member

@iRevive I just pushed changes to TimerHeap to expose the metrics you listed, passing the baton back to you 😁

Comment on lines +286 to +296
/**
* Returns the next due to fire.
*/
def nextTimerDue(): Option[Long] = {
val root = heap(1)
if (root ne null) {
val now = System.nanoTime()
val when = root.triggerTime - now
Some(when)
} else None
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caveats:

  1. May return a negative number if the next timer is overdue to fire.
  2. May overflow if the next timer is too far in the future.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May return a negative number if the next timer is overdue to fire.

It's ok, I guess.

If the negative number prevails over time (each time we collect metrics), it means that timers are under heavy pressure, right?

May overflow if the next timer is too far in the future.

We can use math.multiplyExact and return None in the case of overflow. But it's not ideal either, I think.

Copy link
Member

@armanbilge armanbilge Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the negative number prevails over time (each time we collect metrics), it means that timers are under heavy pressure, right?

Well, not necessarily. Assuming that we are seeing the latest data, it could indicate the worker thread is overwhelmed by (long-running) tasks and not able to check/trigger timers frequently enough. It would be measuring essentially the same thing as the starvation checker.

However, TimerHeap is not a thread-safe data structure so it's also possible that we might be seeing stale data i.e. a timer that was already triggered on the owner thread, but the changes to the heap were not yet published to the thread that is querying metrics.


We can use math.multiplyExact and return None in the case of overflow. But it's not ideal either, I think.

Actually we would need subtractExact. But I agree returning None is not ideal, but on the other hand this situation is pretty unusual. Basically it means the next timer is scheduled 292 years in the future or something like that. That's very nearly like having no timer scheduled I suppose ...

Copy link
Member

@djspiewak djspiewak Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can ignore the overflow case. :D Anyone scheduling timers that far in the future deserves whatever they get from the metrics. Also I'm pretty sure the implementation itself would probably handle it incorrectly anyway since, at some point prior-or-equal to the relative overflow value, it's going to overflow the comparison math and start looking like an expired timer rather than a far-future one. So you basically can't have a functional timer that far in the future.

Also the negative number feels completely fine to me. It's actually an interesting signal in and of itself (as you two discussed), modulo the publication variability Arman already noted.

Edit: Oh this was discussed more below. Ignore me tyty

@iRevive iRevive marked this pull request as ready for review December 11, 2024 15:44
@iRevive
Copy link
Contributor Author

iRevive commented Dec 12, 2024

@armanbilge I added the discussed methods to the TimerHeapMetrics but didn't touch the overflow case.

Should we keep it as is, perhaps? And change the calculation logic if users complain?

@armanbilge
Copy link
Member

but didn't touch the overflow case.

Actually, I think it is okay and will not overflow in practice. This is because there is already some special handling to prevent overflows when doing comparisons within the heap. After thinking about it more, I realize that it will also prevent overflows when computing this metric.

/**
* Computes the trigger time in an overflow-safe manner. The trigger time is essentially `now
* + delay`. However, we must constrain all trigger times in the heap to be within
* `Long.MaxValue` of each other (otherwise there will be overflow when comparing in `cpr`).
* Thus, if `delay` is so big, we'll reduce it to the greatest allowable (in `overflowFree`).
*
* From the public domain JSR-166 `ScheduledThreadPoolExecutor` (`triggerTime` method).
*/
private[this] def computeTriggerTime(now: Long, delay: Long): Long = {

@armanbilge armanbilge requested a review from djspiewak December 18, 2024 21:00
@djspiewak djspiewak merged commit aa5fbbd into typelevel:series/3.x Dec 19, 2024
29 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants