Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix VirtualThread pinning with RealSpan and zipkin2.reporter by switching from synchronized to ReentrantLock #1427

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 20 additions & 4 deletions brave/src/main/java/brave/RealSpan.java
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright 2013-2020 The OpenZipkin Authors
* Copyright 2013-2024 The OpenZipkin Authors
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
* in compliance with the License. You may obtain a copy of the License at
Expand All @@ -13,12 +13,15 @@
*/
package brave;

import java.util.concurrent.locks.ReentrantLock;

import brave.handler.MutableSpan;
import brave.internal.recorder.PendingSpans;
import brave.propagation.TraceContext;

/** This wraps the public api and guards access to a mutable span. */
final class RealSpan extends Span {
final ReentrantLock lock = new ReentrantLock();
final TraceContext context;
final PendingSpans pendingSpans;
final MutableSpan state;
Expand Down Expand Up @@ -139,17 +142,30 @@ final class RealSpan extends Span {
}

@Override public void finish(long timestamp) {
synchronized (state) {
lock.lock();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codefromthecrypt please correct me if I am doing something silly here, but it looks like operations over pendingSpans do not depend on state and could be synchronized over a separate lock.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state is a MutableSpan looked up by tracecontext. The problem with us having to remove all synchronized is now we have a much more complex sitation. I mean I am assuming it is really very required and still have a hard time believing it is.

Before the object was its own lock, guarded by itself. Now we are externalizing it even though the object (MutableSpan) can be mutated via various callers.

So, I think we need to focus on what is the pattern suggested for migrating to ReentrantLock from synchronized, when in synchronized we can use the object itself as the lock. Be careful to not expose API surface when thinking this through!

One way is to make sure that any locked access is using PendingSpan (internal) and moving the lock there. At least there, it has 1-1 relationship to what it is guarding (a mutable span)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way is to make sure that any locked access is using PendingSpan (internal) and moving the lock there. At least there, it has 1-1 relationship to what it is guarding (a mutable span)

Got it, thanks @codefromthecrypt . To elaborate a bit on the issue, synchronized over state is fine and non-issue with virtual thread (it is super short and I didn't change it expect here) but pendingSpans.xxx methods have unpredictable latency and may hold synchronized for a looong time - this is an issue.

Before the object was its own lock, guarded by itself. Now we are externalizing it even though the object (MutableSpan) can be mutated via various callers.

That's I think is one of the problems: we cannot reliably guard MutableSpan unless all accesses to it are guarded as well (not necessarily internally). But thanks a lot for highlighting context <-> mutablespan link, I will look further.

So, I think we need to focus on what is the pattern suggested for migrating to ReentrantLock from synchronized, when in synchronized we can use the object itself as the lock. Be careful to not expose API surface when thinking this through!

ReentrantLock sadly does not sadly help us much here since synchronized(state) in brute force would be equivalent to something like state.lock.lock(); .... - exactly the problem you mentioned, API surface will be harmed.

Copy link
Member

@codefromthecrypt codefromthecrypt Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we solve for the issue then and leave the rest alone?

So, the only reason why state is held for a long time (microseconds) is when it is being sized for encoding in the async reporter, which is in another project, or someone is doing large amount of work tagging (bug).

One option could be that when a configuration is present, use copy mode which will be faster than microseconds (faster than sizing the span as you don't need to check encoded size of strings etc).

Also, the reported stack trace is inherently inefficient as it is copying the span from brave's type to zipkin's type first (ZipkinSpanHandler). There's a brave-native encoder out for years, but for some reason wasn't queued for micrometer until I personally added it in 1.3 (`https://github.com/micrometer-metrics/tracing/releases/tag/v1.3.0-M2

    java.base/java.util.concurrent.locks.ReentrantLock$Sync.lock(Unknown Source)
    java.base/java.util.concurrent.locks.ReentrantLock.lock(Unknown Source)
    zipkin2.reporter.ByteBoundedQueue.offer(ByteBoundedQueue.java:51)
    zipkin2.reporter.AsyncReporter$BoundedAsyncReporter.report(AsyncReporter.java:259)
    zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:44)
    zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:29)
    zipkin2.reporter.brave.ZipkinSpanHandler.end(ZipkinSpanHandler.java:148)
    brave.internal.handler.NoopAwareSpanHandler$CompositeSpanHandler.end(NoopAwareSpanHandler.java:107)
    brave.internal.handler.NoopAwareSpanHandler.end(NoopAwareSpanHandler.java:57)

Finally, I really think we owe it to define what is expensive blocking and so we can measure it. We are assuming this is expensive just because some message ended up in the log of a suboptimal setup.

Copy link
Member

@codefromthecrypt codefromthecrypt Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracing in micrometer is very slow due to all the frameworks used to the point where overhead vs other options is milliseconds worse, not microseconds. Staring at this issue as if it holistically changes things is highly suspicious and a questionable reason to introduce a lot of risk. I think the most narrow solution to the problem, including muting log messages is a decent thing to consider.

e.g. measure this and you'll see hello world requests with millis extra overhead. https://github.com/openzipkin/brave-example/tree/master/webflux6-micrometer

try {
pendingSpans.finish(context, timestamp);
} finally {
lock.unlock();
}
}

@Override public void abandon() {
pendingSpans.abandon(context);
lock.lock();
try {
pendingSpans.abandon(context);
} finally {
lock.unlock();
}
}

@Override public void flush() {
pendingSpans.flush(context);
lock.lock();
try {
pendingSpans.flush(context);
} finally {
lock.unlock();
}
}

@Override public String toString() {
Expand Down
7 changes: 6 additions & 1 deletion brave/src/main/java/brave/internal/Platform.java
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
import java.security.SecureRandom;
import java.util.Enumeration;
import java.util.Random;
import java.util.concurrent.locks.ReentrantLock;
import java.util.logging.Level;
import java.util.logging.LogRecord;
import java.util.logging.Logger;
Expand All @@ -35,6 +36,7 @@
public abstract class Platform implements Clock {
private static final Platform PLATFORM = findPlatform();

private final ReentrantLock lock = new ReentrantLock();
volatile String linkLocalIp;

/** Returns the same value as {@link System#nanoTime()}. */
Expand All @@ -47,10 +49,13 @@ public abstract class Platform implements Clock {
@Nullable public String linkLocalIp() {
// uses synchronized variant of double-checked locking as getting the endpoint can be expensive
if (linkLocalIp != null) return linkLocalIp;
synchronized (this) {
lock.lock();
try {
if (linkLocalIp == null) {
linkLocalIp = produceLinkLocalIp();
}
} finally {
lock.unlock();
}
return linkLocalIp;
}
Expand Down
10 changes: 7 additions & 3 deletions brave/src/main/java/brave/internal/extra/Extra.java
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright 2013-2020 The OpenZipkin Authors
* Copyright 2013-2024 The OpenZipkin Authors
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
* in compliance with the License. You may obtain a copy of the License at
Expand All @@ -17,6 +17,7 @@
import brave.propagation.TraceContext;
import brave.propagation.TraceContextOrSamplingFlags;
import java.util.Arrays;
import java.util.concurrent.locks.ReentrantLock;

/**
* Holds extended state in {@link TraceContext#extra()} or {@link TraceContextOrSamplingFlags#extra()}.
Expand All @@ -39,7 +40,7 @@ public abstract class Extra<E extends Extra<E, F>, F extends ExtraFactory<E, F>>
* Updates like {@link #tryToClaim(long, long)} lock on this object, as should any non-atomic
* {@link #state} updates.
*/
protected final Object lock = new Object();
protected final ReentrantLock lock = new ReentrantLock();
/**
* Lock on {@link #lock} when comparing existing values for a state update that happens after
* {@link #mergeStateKeepingOursOnConflict(Extra)}.
Expand Down Expand Up @@ -74,13 +75,16 @@ protected Extra(F factory) {

/** Fields are extracted before a context is created. We need to lazy set the context */
final boolean tryToClaim(long traceId, long spanId) {
synchronized (lock) {
lock.lock();
try {
if (this.traceId == 0L) {
this.traceId = traceId;
this.spanId = spanId;
return true;
}
return this.traceId == traceId && this.spanId == spanId;
} finally {
lock.unlock();
}
}

Expand Down
5 changes: 4 additions & 1 deletion brave/src/main/java/brave/internal/extra/MapExtra.java
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,8 @@ protected boolean put(K key, @Nullable V value) {
return false;
}

synchronized (lock) {
lock.lock();
try {
Object[] prior = state();

// double-check lost race in dynamic case
Expand All @@ -105,6 +106,8 @@ protected boolean put(K key, @Nullable V value) {
newState[i + 1] = value;
this.state = newState;
return true;
} finally {
lock.unlock();
}
}

Expand Down