-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JdkConnector: JerseyInvocation stuck after DestinationConnectionPool thrown an IllegalStateException #4810
Comments
In the stacktrace there is: |
Sure, it's a thin wrapper around a real provider to ensure that the client OpenTracing span is closed even when response filter is not executed (ex. failed DNS resolution). Never found a better way to do that with JAX-RS. It doubt it could be the cause of this failure but I will re-review it. I had no time to dig into this issue yet, but my gut feeling is that something could be wrong in the locking, or expected happens-befores, performed by DestinationConnectionPool. However, I have no reproducer yet and didn't had time to deep dive into the implementation. /**
* Workaround JAX-RS 2.1 client filter limitations regarding OpenTracing instrumentation.
*
* <p>
* JAX-RS filters suffer from a design issue: on error case response filter are not invoked, see
* https://github.com/eclipse-ee4j/jaxrs-api/issues/684. It prevents OpenTracing instrumentation from closing the span.
* This provider works-around this issue. It ensures that spans are finished and error attached to the span.
*/
final class RectifyOpentracingSpanProvider implements ConnectorProvider {
private final ConnectorProvider delegate;
RectifyOpentracingSpanProvider( ConnectorProvider delegate ) {
this.delegate = delegate;
}
@Override
public Connector getConnector( Client client, Configuration runtimeConfig ) {
var realConnector = delegate.getConnector( client, runtimeConfig );
return new RectifyOpentracingConnector( realConnector );
}
static class RectifyOpentracingConnector implements Connector {
private final Connector delegate;
RectifyOpentracingConnector( Connector delegate ) {
this.delegate = delegate;
}
@Override
public ClientResponse apply( ClientRequest request ) {
try {
return delegate.apply( request );
} catch( ProcessingException pe ) {
var cause = pe.getCause( );
SpanWrapper spanWrapper = CastUtils.cast( request.getProperty( PROPERTY_NAME ), SpanWrapper.class );
if( spanWrapper != null && !spanWrapper.isFinished( ) ) {
var span = spanWrapper.get( );
Tags.ERROR.set( span, true );
if( cause != null ) {
span.log( Map.of(
"event", Tags.ERROR.getKey( ),
"error.object", cause ) );
}
spanWrapper.finish( );
}
throw pe;
}
}
@Override
public Future<?> apply( ClientRequest request, AsyncConnectorCallback delegateCallback ) {
return delegate.apply( request, new AsyncConnectorCallback( ) {
@Override
public void response( ClientResponse response ) {
delegateCallback.response( response );
}
@Override
public void failure( Throwable cause ) {
SpanWrapper spanWrapper = CastUtils.cast( request.getProperty( PROPERTY_NAME ), SpanWrapper.class );
if( spanWrapper != null && !spanWrapper.isFinished( ) ) {
var span = spanWrapper.get( );
Tags.ERROR.set( span, true );
span.log( Map.of(
"event", Tags.ERROR.getKey( ),
"error.object", cause ) );
spanWrapper.finish( );
}
delegateCallback.failure( cause);
}
} );
}
@Override
public String getName( ) {
return delegate.getName( );
}
@Override
public void close( ) {
delegate.close( );
}
}
} |
This issue is very tricky and requires some discussion. Based in the stack traces, it is using the synchronous method of the JdkConnector. The problem is that inside of it, it manages the response in asynchronous way, an it waits for the response of it. This is where the thread is stuck in your stack trace. I have tried to make it really synchronous. Removing the Future inside of the JdkConnector#apply, but this does not help, because later the request is managed by HttpConnectionPool anyway in an asynchronous way. This HttpConnectionPool is taking care of the running connections. They are identified by URI, so 2 different requests using the same URI could reuse the same connection, making it to perform better. I think in this issue there is a race condition when processing more than 1 request with the same URI. There are 3 possible "solutions": |
Thanks for the feedback, it matches my intuition.
Connection reuse is a must-have to me. Creating a new connection with TLS handshake for each call is way too expansive. As a fun fact, we are migrating away from HttpUrlConnectorProvider because we discovered that TLS streams got terminated on TLS alerts under heavy load. Effective behavior was closer to
While it makes sense and would likely be better than being stuck forever, I have to evaluate if it enables application to better react to such issue. Figuring out an adequate timeout could also be an issue. More generally, what production grade connector would the Jersey team recommend using? I had like to avoid "heavy" dependencies, especially if the connector doesn't support HTTP/2. |
We just deployed a few more applications with jdk-connector enabled and this race condition seems much more common than I expected. It took down more than 5 pods, with small load, in less than 12h. This seems a showstopper and jdk-connector should not be used until this get fixed. |
@cykl I am trying to reproduce the issue with the next test, but I had no 'bad luck' so far. Could you check with a thread dump if you have more threads hanged?. This is the test, one hangs all the requests and the other is more random.
|
All production occurrences I have investigated had only one thread blocked. But, this can easily be explained by the architecture of the applications. They are batch jobs parallelizing HTTP calls using https://github.com/pivovarit/parallel-collectors. When a call blocks, current batch will timeout within seconds / minutes and abort or get stuck giving almost no chance for another call to block. |
When I enabled SSL in the test I found a deadlock that is relatively easy to reproduce. Maybe this is related to your issue or maybe not, but at least I have something to work on. I attach the stacks of the blocked threads. It is happening when the test is shutting down:
|
👍 I do confirm that all our calls issued by jdk-connector use TLS. |
I was able to reproduce the issue, without TLS, using following terrible code. As consistency issue is likely related to a connection being closed or replaced, I set connection idle timeout to a very low value to ensure a high connection churn. We use a 60s timeout in our applications. The good news is that it that it doesn't seem that hard to reproduce (it usually occurs after ~60s). The not so good news is that I observed other dubious behaviors:
Because of 2, I don't have a good predicate to detect that the issue we are looking for occured, and you must manually check the error output for the presence of "Request not found" or set a break point. Note: Using OkHttp's MockWebServer & ParallelCollectors was the quickest way for me to write such code, my apologizies for the headhache it might cause. I didn't spend time to port it into Jersey codebase as I am currently unable to write some kind of real unit test. package com.greencomnetworks.jersey.client;
import com.google.common.util.concurrent.ThreadFactoryBuilder;
import com.pivovarit.collectors.ParallelCollectors;
import java.io.IOException;
import java.time.Duration;
import java.util.Arrays;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Executors;
import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;
import java.util.function.Supplier;
import java.util.stream.IntStream;
import javax.ws.rs.ProcessingException;
import javax.ws.rs.client.Client;
import javax.ws.rs.client.ClientBuilder;
import okhttp3.mockwebserver.Dispatcher;
import okhttp3.mockwebserver.MockResponse;
import okhttp3.mockwebserver.MockWebServer;
import okhttp3.mockwebserver.RecordedRequest;
import org.glassfish.jersey.client.ClientConfig;
import org.glassfish.jersey.jdk.connector.JdkConnectorProperties;
import org.glassfish.jersey.jdk.connector.JdkConnectorProvider;
import org.junit.jupiter.api.Test;
public class Reproducer {
@Test
void reproducer( ) throws IOException, ExecutionException, InterruptedException, TimeoutException {
var parallelism = 16;
var cnxTimeout = Duration.ofMillis( 50 );
try (var server = new MockWebServer( )) {
server.setDispatcher( new Dispatcher( ) {
public MockResponse dispatch( RecordedRequest recordedRequest ) {
return new MockResponse( );
}
} );
Supplier<Client> clientFactory = ( ) -> {
return ClientBuilder.newBuilder( )
.withConfig( new ClientConfig( )
.property( JdkConnectorProperties.MAX_CONNECTIONS_PER_DESTINATION, parallelism )
.property( JdkConnectorProperties.CONNECTION_IDLE_TIMEOUT,
( int ) cnxTimeout.toMillis( ) )
.connectorProvider( new JdkConnectorProvider( ) ) )
.build( );
};
var executor = Executors.newFixedThreadPool( parallelism, new ThreadFactoryBuilder( )
.setNameFormat( "client-%d" )
.build( ) );
Client client = null;
for( int iteration = 0; iteration < Integer.MAX_VALUE; iteration++ ) {
if( client == null ) {
client = clientFactory.get( );
}
var target = client.target( server.url( "/" ).toString( ) );
try {
IntStream.range( 0, 512 )
.mapToObj( Integer::toString )
.collect( ParallelCollectors.parallel(
i -> {
try {
target.request( ).get( );
} catch( ProcessingException e ) {
if( e.getMessage( ).contains( "Current state: CLOSED" ) ) {
System.out.println( e.getMessage( ) ); // Not what we are looking for
} else {
throw e;
}
}
return i;
},
executor,
parallelism ) )
.get( 15, TimeUnit.SECONDS );
// Might help to trigger the race condition as it likely happen on cnx close
var sleepDuration =
( long ) ( ThreadLocalRandom.current( ).nextLong( cnxTimeout.toMillis( ) ) * 1.1 );
Thread.sleep( sleepDuration );
} catch( TimeoutException e ) {
System.err.printf(
"Maybe reproduced #4810 at iteration %d, check logs for \"Request not found\"\n",
iteration );
Thread.getAllStackTraces( ).forEach( ( thread, elements ) -> {
if( thread.getName( ).startsWith( "client-" ) && elements.length > 10 ) {
System.out.println( thread );
Arrays.stream( elements )
.forEach( element -> System.out.printf( "\t%s\n", element ) );
}
} );
client.close( );
client = null;
}
}
}
}
} |
I am reproducing it too. I have to trigger more requests than threads we have in the fixed thread pool.
In the iteration 762 I'm having the issue. |
I found there are items in DestinationConnectionPool#pendingRequests when the threads were blocked forever. This matches with what you pointed in the issue. That field was a HashMap and it is accessed concurrently. After modifying it to ConcurrentHashMap my tests are passing. @cykl would you help me to test my changes with your reproducer, please?. You need to do the next:
|
I just ran my reproducer a few times using 2.34 and your branch. I can confirm that your fix seems to make the "Request not found" issue vanish. However, I still observe many blocked client threads. No stack trace or error message is printed; client threads are just waiting for the future to complete. There is likely another synchronization issue somewhere in the connector. I was observing this issue with my reproducer using 2.34 but never saw it in the wild. Note that I added Thread[client-12,5,main]
java.base@11.0.10/jdk.internal.misc.Unsafe.park(Native Method)
java.base@11.0.10/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
java.base@11.0.10/java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1796)
java.base@11.0.10/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128)
java.base@11.0.10/java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1823)
java.base@11.0.10/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1998)
app//org.glassfish.jersey.jdk.connector.internal.JdkConnector.apply(JdkConnector.java:72)
app//org.glassfish.jersey.client.ClientRuntime.invoke(ClientRuntime.java:297)
app//org.glassfish.jersey.client.JerseyInvocation.lambda$invoke$0(JerseyInvocation.java:662)
app//org.glassfish.jersey.client.JerseyInvocation$$Lambda$507/0x0000000800313040.call(Unknown Source)
app//org.glassfish.jersey.client.JerseyInvocation.call(JerseyInvocation.java:697)
app//org.glassfish.jersey.client.JerseyInvocation.lambda$runInScope$3(JerseyInvocation.java:691)
app//org.glassfish.jersey.client.JerseyInvocation$$Lambda$510/0x0000000800340440.call(Unknown Source)
app//org.glassfish.jersey.internal.Errors.process(Errors.java:292)
app//org.glassfish.jersey.internal.Errors.process(Errors.java:274)
app//org.glassfish.jersey.internal.Errors.process(Errors.java:205)
app//org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:390)
app//org.glassfish.jersey.client.JerseyInvocation.runInScope(JerseyInvocation.java:691)
app//org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:661)
app//org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:413)
app//org.glassfish.jersey.client.JerseyInvocation$Builder.get(JerseyInvocation.java:313)
app//com.greencomnetworks.jersey.client.Reproducer2.lambda$reproducer$1(Reproducer2.java:70)
app//com.greencomnetworks.jersey.client.Reproducer2$$Lambda$340/0x000000080019e040.apply(Unknown Source)
app//com.pivovarit.collectors.AsyncParallelCollector.lambda$null$1(AsyncParallelCollector.java:62)
app//com.pivovarit.collectors.AsyncParallelCollector$$Lambda$348/0x000000080019c440.get(Unknown Source)
app//com.pivovarit.collectors.Dispatcher.lambda$completionTask$2(Dispatcher.java:90)
app//com.pivovarit.collectors.Dispatcher$$Lambda$349/0x000000080019b840.run(Unknown Source)
java.base@11.0.10/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
java.base@11.0.10/java.util.concurrent.FutureTask.run(FutureTask.java:264)
app//com.pivovarit.collectors.Dispatcher.lambda$withFinally$4(Dispatcher.java:117)
app//com.pivovarit.collectors.Dispatcher$$Lambda$352/0x000000080019a840.run(Unknown Source)
java.base@11.0.10/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.base@11.0.10/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base@11.0.10/java.lang.Thread.run(Thread.java:834) |
Thank you for your tests. Yes, it seems there is something else to fix. I'm going to leave the test running for hours and lets see if it gets stuck at some point. |
One of our application got a thread stuck waiting for a future that will never complete after what seems to be an internal corruption in jdk-connector. Exception occurred at 03AM and thread was still stuck at 09AM.
Client is configured with a jdk-connector, MAX_CONNECTIONS_PER_DESTINATION = 32, CONNECTION_IDLE_TIMEOUT, 60_0000. I don't have proper metrics, but client was fairly busy when it happened.
This application was using 2.32. I just skimmed through git log -p and didn't see any change that would suggest that a newer version could fix the issue.
The text was updated successfully, but these errors were encountered: