-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TimerHeapMetrics
#4187
Add TimerHeapMetrics
#4187
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks so much for starting this follow-up :)
/** | ||
* Returns the next due to fire. | ||
*/ | ||
def nextTimerDue(): Option[Long] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should figure out what this Long
represents. Currently it is an absolute time in nano seconds relative to the (arbitrary) nanoTime
offset. I am not sure how useful that is.
- Would it make sense to convert to a unix epoch time? I think we'd have to give up some precision.
- Or, we could return the number of nanoseconds until the next timer fires — so a relative time to the present, not absolute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the number of nanoseconds until the next timer fires — so a relative time to the present, not absolute.
This option looks more versatile and suitable.
def nextTimerDue(): Option[Long] = | ||
heap | ||
.lift(1) | ||
.filter(t => !t.isDeleted() && t.triggerTime != Long.MinValue) | ||
.map(_.triggerTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we decide on the API I can help with the implementation. I assume that this method may be called from non-owning threads, in which case we need to write the implementation defensively against race conditions.
I also think we should just return the time at the root of the heap, even if it's deleted/canceled. Returning None
is confusing when there are more active timers besides the root. Besides, the timer was active at some point, deletion/cancelation is a race condition against metrics retrieval, and it's still a lower bound on the actual next timer due. So I think it's still relevant statistically speaking.
From the OpenTelemetry perspective, I can think of the following metrics:
We may need to introduce new counters to keep track of |
@iRevive I just pushed changes to |
/** | ||
* Returns the next due to fire. | ||
*/ | ||
def nextTimerDue(): Option[Long] = { | ||
val root = heap(1) | ||
if (root ne null) { | ||
val now = System.nanoTime() | ||
val when = root.triggerTime - now | ||
Some(when) | ||
} else None | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caveats:
- May return a negative number if the next timer is overdue to fire.
- May overflow if the next timer is too far in the future.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May return a negative number if the next timer is overdue to fire.
It's ok, I guess.
If the negative number prevails over time (each time we collect metrics), it means that timers are under heavy pressure, right?
May overflow if the next timer is too far in the future.
We can use math.multiplyExact
and return None
in the case of overflow. But it's not ideal either, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the negative number prevails over time (each time we collect metrics), it means that timers are under heavy pressure, right?
Well, not necessarily. Assuming that we are seeing the latest data, it could indicate the worker thread is overwhelmed by (long-running) tasks and not able to check/trigger timers frequently enough. It would be measuring essentially the same thing as the starvation checker.
However, TimerHeap
is not a thread-safe data structure so it's also possible that we might be seeing stale data i.e. a timer that was already triggered on the owner thread, but the changes to the heap were not yet published to the thread that is querying metrics.
We can use
math.multiplyExact
and returnNone
in the case of overflow. But it's not ideal either, I think.
Actually we would need subtractExact
. But I agree returning None
is not ideal, but on the other hand this situation is pretty unusual. Basically it means the next timer is scheduled 292 years in the future or something like that. That's very nearly like having no timer scheduled I suppose ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can ignore the overflow case. :D Anyone scheduling timers that far in the future deserves whatever they get from the metrics. Also I'm pretty sure the implementation itself would probably handle it incorrectly anyway since, at some point prior-or-equal to the relative overflow value, it's going to overflow the comparison math and start looking like an expired timer rather than a far-future one. So you basically can't have a functional timer that far in the future.
Also the negative number feels completely fine to me. It's actually an interesting signal in and of itself (as you two discussed), modulo the publication variability Arman already noted.
Edit: Oh this was discussed more below. Ignore me tyty
18dd0df
to
5048008
Compare
@armanbilge I added the discussed methods to the Should we keep it as is, perhaps? And change the calculation logic if users complain? |
Actually, I think it is okay and will not overflow in practice. This is because there is already some special handling to prevent overflows when doing comparisons within the heap. After thinking about it more, I realize that it will also prevent overflows when computing this metric. cats-effect/core/jvm/src/main/scala/cats/effect/unsafe/TimerHeap.scala Lines 394 to 402 in 9ce05f2
|
A follow-up for #3317 (comment).
I'm unfamiliar with the
TimerHeap
internals, so perhaps the direction needs to be corrected.