-
Notifications
You must be signed in to change notification settings - Fork 714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix VirtualThread pinning with RealSpan and zipkin2.reporter by switching from synchronized to ReentrantLock #1427
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@codefromthecrypt please correct me if I am doing something silly here, but it looks like operations over
pendingSpans
do not depend onstate
and could be synchronized over a separatelock
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
state is a MutableSpan looked up by tracecontext. The problem with us having to remove all synchronized is now we have a much more complex sitation. I mean I am assuming it is really very required and still have a hard time believing it is.
Before the object was its own lock, guarded by itself. Now we are externalizing it even though the object (MutableSpan) can be mutated via various callers.
So, I think we need to focus on what is the pattern suggested for migrating to ReentrantLock from synchronized, when in synchronized we can use the object itself as the lock. Be careful to not expose API surface when thinking this through!
One way is to make sure that any locked access is using PendingSpan (internal) and moving the lock there. At least there, it has 1-1 relationship to what it is guarding (a mutable span)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks @codefromthecrypt . To elaborate a bit on the issue,
synchronized
overstate
is fine and non-issue with virtual thread (it is super short and I didn't change it expect here) butpendingSpans.xxx
methods have unpredictable latency and may holdsynchronized
for a looong time - this is an issue.That's I think is one of the problems: we cannot reliably guard
MutableSpan
unless all accesses to it are guarded as well (not necessarily internally). But thanks a lot for highlightingcontext
<-> mutablespan link, I will look further.ReentrantLock
sadly does not sadly help us much here sincesynchronized(state)
in brute force would be equivalent to something likestate.lock.lock(); ....
- exactly the problem you mentioned, API surface will be harmed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we solve for the issue then and leave the rest alone?
So, the only reason why state is held for a long time (microseconds) is when it is being sized for encoding in the async reporter, which is in another project, or someone is doing large amount of work tagging (bug).
One option could be that when a configuration is present, use copy mode which will be faster than microseconds (faster than sizing the span as you don't need to check encoded size of strings etc).
Also, the reported stack trace is inherently inefficient as it is copying the span from brave's type to zipkin's type first (
ZipkinSpanHandler
). There's a brave-native encoder out for years, but for some reason wasn't queued for micrometer until I personally added it in 1.3 (`https://github.com/micrometer-metrics/tracing/releases/tag/v1.3.0-M2Finally, I really think we owe it to define what is expensive blocking and so we can measure it. We are assuming this is expensive just because some message ended up in the log of a suboptimal setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tracing in micrometer is very slow due to all the frameworks used to the point where overhead vs other options is milliseconds worse, not microseconds. Staring at this issue as if it holistically changes things is highly suspicious and a questionable reason to introduce a lot of risk. I think the most narrow solution to the problem, including muting log messages is a decent thing to consider.
e.g. measure this and you'll see hello world requests with millis extra overhead. https://github.com/openzipkin/brave-example/tree/master/webflux6-micrometer