-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ThreadRegistry#register behavior to ensure correct Prom metrics #4300
Fix ThreadRegistry#register behavior to ensure correct Prom metrics #4300
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great Catch
This is a bad bug affecting also 4.16. we will discuss cutting an hotfix release also for that branch
stats/bookkeeper-stats-api/src/main/java/org/apache/bookkeeper/stats/ThreadRegistry.java
Outdated
Show resolved
Hide resolved
Notably, this PR is particularly relevant after 4.16 because See apache/pulsar#21766, #4179, #3837 |
Great catch @michaeljmarshall |
checkstyle error
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice Catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Great work
in order to let the tests pass I suggest to remove the assertions and simply use "log.error". In production you will see ERRORs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaeljmarshall tests are failing because most of them spin up N bookies in the same JVM, and since ThreadRegistry is static they overlap each other.
I suggest we replace the assertion with a log.error
@eolivelli @nicoloboschi Agreed, I think we can have an issue to track subsequent refactorings to non-static. |
...rc/main/java/org/apache/bookkeeper/stats/prometheus/ThreadScopedDataSketchesStatsLogger.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I added two minor suggestions
...vider/src/main/java/org/apache/bookkeeper/stats/prometheus/ThreadScopedLongAdderCounter.java
Outdated
Show resolved
Hide resolved
9a60d93
to
1ac6fc1
Compare
Hi @michaeljmarshall Could you add a test to protect this change? |
…pache#4300) * Make tests fail * Fix ThreadRegistry#register to ensure correct Prom metrics * Change style to match BK standards * Fix tests --------- Co-authored-by: Nicolò Boschi <boschi1997@gmail.com> --------- (cherry picked from commit 1d82b92) Conflicts: bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/BookieImpl.java bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/SyncThread.java
…pache#4300) * Make tests fail * Fix ThreadRegistry#register to ensure correct Prom metrics * Change style to match BK standards * Fix tests --------- Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>
Descriptions of the changes in this PR:
There is an issue in the
ThreadRegistry
that causes incomplete Prometheus metrics reporting. It is trivial to reproduce:bookkeeper_server_ADD_ENTRY_count
does not increment at a similar rate to the journal's add entry rate.The bug can be understood abstractly by considering the following:
threadPoolMap
inThreadRegistry
goes from thread id toThreadPoolThread
.ThreadPoolThread
is defined by the thread pool and the configured id.0
.ThreadRegistry.register(journalThreadName, 0);
method multiple times.threadPoolMap
ends up with multiple identical values for different keys.ThreadScopedDataSketchesStatsLogger#getStatsLogger
stores the identical values in theprovider.opStats
map and ultimately overwrites map values, which leads to an under-reporting of metrics.Motivation
Correct bookkeeper metrics reporting by ensuring that threads are properly registered in the
ThreadRegistry
.Changes
I replaced the static map in
ThreadRegistry#register
with a staticThreadLocal
instance. While the original work in #2839, which was part of #2835, specifically indicated it did not want to use thread local storage (TLS) due to its complexity, I think it is worth it because of the added flexibility. Specifically, it is much simpler to handle cleanup of unnecessary metadata by using thread locals. Note that the thread id, previously used as the key in the static map, is not unique to the thread after the thread goes away, which poses problems if any threads recycle. While BK doesn't generally recycle these threads, the test framework does, so it helps to make tests pass while still being able to make useful assertions about state.An interesting design detail that I noticed and did not change is documented in the ThreadRegistry. Specifically, it is that threads that go away will continue to report metrics. I think this is fine, but it's worth calling out.