Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] SpanData validation failed for validator org.opensearch.test.telemetry.tracing.validators.AllSpansAreEndedProperly flakyness #11310

Open
reta opened this issue Nov 22, 2023 · 11 comments
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Telemetry:Tracing PRs or issues specific to telemetry tracing framework

Comments

@reta
Copy link
Collaborator

reta commented Nov 22, 2023

Describe the bug

It seems like we have some failure scenarios that are not covered by tracing instrumentation framework, randomnly failing the validation checks:

java.lang.AssertionError: SpanData validation failed for validator org.opensearch.test.telemetry.tracing.validators.AllSpansAreEndedProperly/nMockSpanData{spanID='85028fb97ab9d4ee', parentSpanID='6bb0d2126212ce2e', traceID='75ff3affa0cc1c0c', spanName='dispatchedShardOperationOnPrimary', startEpochNanos=1801185791106, endEpochNanos=0, hasEnded=false, attributes={index=restart_while_indexing, refresh_policy=false, thread.name=opensearch[node_t0][write][T#1], bulk_request_items=1, shard_id=0, node_id=m8PJAwK_RzCo2kkho3PcxA}, stackTrace=[java.base/java.lang.Thread.getStackTrace(Thread.java:2450), org.opensearch.test.telemetry.tracing.StrictCheckSpanProcessor.toMockSpanData(StrictCheckSpanProcessor.java:63), org.opensearch.test.telemetry.tracing.StrictCheckSpanProcessor.onStart(StrictCheckSpanProcessor.java:34), org.opensearch.test.telemetry.tracing.MockTracingTelemetry.createSpan(MockTracingTelemetry.java:35), org.opensearch.telemetry.tracing.DefaultTracer.createSpan(DefaultTracer.java:94), org.opensearch.telemetry.tracing.DefaultTracer.startSpan(DefaultTracer.java:55), org.opensearch.telemetry.tracing.WrappedTracer.startSpan(WrappedTracer.java:43), org.opensearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:231), org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908), org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52), java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144), java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642), java.base/java.lang.Thread.run(Thread.java:1583)]}

To Reproduce

./gradlew ':server:internalClusterTest' --tests "org.opensearch.discovery.ClusterDisruptionIT" -Dtests.seed=22054582CD9D3248

Expected behavior
The checks must always pass

Plugins
Standard

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • CI

Additional context

@reta reta added bug Something isn't working untriaged labels Nov 22, 2023
@reta
Copy link
Collaborator Author

reta commented Nov 22, 2023

@Gaganjuneja FYI

@Gaganjuneja
Copy link
Contributor

I will take a look. Thanks!

@Gaganjuneja
Copy link
Contributor

@mgodwan @rayshrey Need help here. This is flaky for a test case testRestartNodeWhileIndexing and span name dispatchedShardOperationOnPrimary. How can we make sure that the span is ended when node restarts?

@sarthakaggarwal97
Copy link
Contributor

@Gaganjuneja

It does not look like the test in itself is the issue. In my experiments, I can observe that the moment I turn of the update mappings from the test (by keeping the fields constant in the source), the tests starts to pass consistently.
The test is a bit hard to reproduce, and unable to do so with a seed.

This seems to be a very corner case, where we are unable to close the span in cases of mapping update during node restarts. The span stays in the memory without any execution thread associated to it.

There would be two options that we can follow from here on:

  1. If it is possible to get an API from the request tracing framework, where we can ignore rare cases like such as node restarts from the span checks. (preferred)
  2. Implement a way to drain the spans from cluster state for such cases.

@Gaganjuneja
Copy link
Contributor

@sarthakaggarwal97

If it is possible to get an API from the request tracing framework, where we can ignore rare cases like such as node restarts from the span checks. (preferred).

We can do that but it's very hard to make sure that its being used cautiously and not being exploited to pass the test reporting the underneath memory leak issues. @reta your thoughts here?

@reta
Copy link
Collaborator Author

reta commented Feb 5, 2024

Thanks @sarthakaggarwal97

This seems to be a very corner case, where we are unable to close the span in cases of mapping update during node restarts. The span stays in the memory without any execution thread associated to it.

This is concerning behaviour: to me, such edge cases lead to the failure flows that are not instrumented. It would be great to understand why the restart (which basically should be clean stop / start sequence) keeps some components in "running" state (not aware of restart). We should not be looking for shortcuts here.

@sarthakaggarwal97
Copy link
Contributor

sarthakaggarwal97 commented Feb 6, 2024

@reta @Gaganjuneja

Let me elaborate the issue with two cases of spans, one that was successful, and the one that failed. In both of these cases, the mappings update was required. Would like to get your inputs on how we can proceed on this.

Happy Case

Let's consider upon active indexing, a new span 834f656ab29fa68d was created. The span was traced dispatchedShardOperationOnPrimary -> performOnPrimary -> executeBulkItemRequest

1> [2024-02-05T23:27:53,465][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in dispatchedShardOperationOnPrimary op: 834f656ab29fa68d
1> [2024-02-05T23:27:53,465][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in performOnPrimary 834f656ab29fa68d at doc id144
1> [2024-02-05T23:27:53,465][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in executeBulkItemRequest:834f656ab29fa68d ---> 144

It realises that we need to update the mappings

1> [2024-02-05T23:27:53,465][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in MAPPING_UPDATE_REQUIRED 834f656ab29fa68d

Reaches the onResponse of Mapping Updates

1> [2024-02-05T23:27:53,506][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in MAPPING_UPDATE_REQUIRED onResponse 834f656ab29fa68d
> [2024-02-05T23:27:53,509][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in waitForMappingUpdate onResponse 834f656ab29fa68d

Post mappings update, executor is attached, and the executeBulkItemRequest is continued

1> [2024-02-05T23:27:53,509][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Item Done Listener: 834f656ab29fa68d
  1> [2024-02-05T23:27:53,509][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Executor invoked: 834f656ab29fa68d
  1> [2024-02-05T23:27:53,509][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Executor invoked: 834f656ab29fa68d
  1> [2024-02-05T23:27:53,509][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in executeBulkItemRequest:834f656ab29fa68d ---> 144
  1> [2024-02-05T23:27:53,510][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] 834f656ab29fa68d Before notifyOnce
  1> [2024-02-05T23:27:53,510][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] 834f656ab29fa68d listener -->org.opensearch.telemetry.tracing.listener.TraceableActionListener@3d5c9507
  1> [2024-02-05T23:27:53,510][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Before Listener invoked 834f656ab29fa68d --- Thread[#246,opensearch[node_t1][write][T#1],5,TGRP-ClusterDisruptionIT] --- 
  1> [2024-02-05T23:27:53,510][INFO ][o.o.t.t.l.TraceableActionListener] [node_t1] Closing span in onResponse: 834f656ab29fa68d --- Thread[#246,opensearch[node_t1][write][T#1],5,TGRP-ClusterDisruptionIT]
  1> [2024-02-05T23:27:53,510][INFO ][o.o.t.t.t.MockSpan       ] [node_t1] Span waiting for lock to be ended 834f656ab29fa68d  by Thread[#246,opensearch[node_t1][write][T#1],5,TGRP-ClusterDisruptionIT]
  1> [2024-02-05T23:27:53,510][INFO ][o.o.t.t.t.MockSpan       ] [node_t1] Span 834f656ab29fa68d ended by Thread[#246,opensearch[node_t1][write][T#1],5,TGRP-ClusterDisruptionIT]
  1> [2024-02-05T23:27:53,510][INFO ][o.o.t.t.t.MockSpan       ] [node_t1] 834f656ab29fa68d has endedThread[#246,opensearch[node_t1][write][T#1],5,TGRP-ClusterDisruptionIT]
  1> [2024-02-05T23:27:53,511][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Listener invoked 834f656ab29fa68d --- Thread[#246,opensearch[node_t1][write][T#1],5,TGRP-ClusterDisruptionIT] --- interrupted false--- java.base@21.0.2/java.lang.Thread.dumpThreads(Native Method)java.base@21.0.2/java.lang.Thread.getAllStackTraces(Thread.java:2521)app//org.opensearch.action.bulk.TransportShardBulkAction.lambda$performOnPrimary$5(TransportShardBulkAction.java:499)app//org.opensearch.action.bulk.TransportShardBulkAction$$Lambda/0x0000000800b317c8.run(Unknown Source)app//org.opensearch.core.action.ActionListener$5.onResponse(ActionListener.java:275)app//org.opensearch.core.action.ActionListener$7.onResponse(ActionListener.java:324)app//org.opensearch.core.action.ActionListener$6.onResponse(ActionListener.java:295)app//org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:378)app//org.opensearch.action.bulk.TransportShardBulkAction$2.finishRequest(TransportShardBulkAction.java:573)app//org.opensearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:535)app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913)app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)java.base@21.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)java.base@21.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)java.base@21.0.2/java.lang.Thread.runWith(Thread.java:1596)java.base@21.0.2/java.lang.Thread.run(Thread.java:1583)
  1> [2024-02-05T23:27:53,511][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] 834f656ab29fa68d After notifyOnce

Sad Case

Following the span based on the exception

MockSpanData{spanID='f0fcea06161f2245', parentSpanID='16086d76dc781c86', traceID='feef9ff717b884cc', spanName='dispatchedShardOperationOnPrimary', startEpochNanos=24261684302000, endEpochNanos=0, hasEnded=false, attributes={index=restart_while_indexing, refresh_policy=false, thread.name=opensearch[node_t1][write][T#3], bulk_request_items=1, shard_id=0, node_id=Lf0XLkkwScubTL0g8s-q_g}

Upon active indexing, a new span f0fcea06161f2245 was created. The span was traced dispatchedShardOperationOnPrimary -> performOnPrimary -> executeBulkItemRequest

  1> [2024-02-05T23:27:53,469][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in dispatchedShardOperationOnPrimary op: f0fcea06161f2245
  1> [2024-02-05T23:27:53,469][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in performOnPrimary f0fcea06161f2245 at doc id146
  1> [2024-02-05T23:27:53,469][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in executeBulkItemRequest:f0fcea06161f2245 ---> 146

It realizes that we need to update the mappings

  1> [2024-02-05T23:27:53,469][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in MAPPING_UPDATE_REQUIRED f0fcea06161f2245

Before the span can come into the Mappings Update Response, the node is restarted

  1> [2024-02-05T23:27:53,481][INFO ][o.o.t.InternalTestCluster] [testRestartNodeWhileIndexing] Restarting node [node_t1] 
  1> [2024-02-05T23:27:53,481][INFO ][o.o.t.InternalTestCluster] [testRestartNodeWhileIndexing] adding voting config exclusions [node_t1] prior to restart/shutdown

Then this span never comes into the Mappings Update Response.

However, the parent span ends post restart

1> [2024-02-05T23:27:53,601][INFO ][o.o.t.t.t.MockSpan       ] [node_t1] Span waiting for lock to be ended 16086d76dc781c86  by Thread[#224,opensearch[node_t1][generic][T#2],5,TGRP-ClusterDisruptionIT]
  1> [2024-02-05T23:27:53,601][INFO ][o.o.t.t.t.MockSpan       ] [node_t1] Span 16086d76dc781c86 ended by Thread[#224,opensearch[node_t1][generic][T#2],5,TGRP-ClusterDisruptionIT]
  

Attaching the logs: op.log
Span Trace Instrumentation: https://github.com/sarthakaggarwal97/OpenSearch/tree/test-flaky-tracer

@sarthakaggarwal97
Copy link
Contributor

sarthakaggarwal97 commented Feb 6, 2024

@reta @Gaganjuneja

With this, I've a couple of questions as well:

  1. Is it acceptable for the parent span to be closed before the children span are closed?

It would be great to understand why the restart (which basically should be clean stop / start sequence) keeps some components in "running" state (not aware of restart).

  1. Is there a way confirm the span stays in the memory without the executor being attached later on? I'm not sure on the lifecycle of spans during node restarts or start/stop.

@Gaganjuneja
Copy link
Contributor

Gaganjuneja commented Feb 6, 2024

@sarthakaggarwal97

Is it acceptable for the parent span to be closed before the children span are closed?

Yes, parent span can close before the children in the async scenarios.

Is there a way confirm the span stays in the memory without the executor being attached later on? I'm not sure on the lifecycle of spans during node restarts or start/stop.

Span stays in the memory till we call the endSpan. Definitely, in case of node stop (opensearch process stop) spans will be cleared. But the ideal state is where on the graceful shutdown we should close all the open spans. This is also a validation that we know how to react to the node stop/start from the outstanding indexing requests and we should have all the open requests/tasks/actions closed gracefully and release all the resources.

@sarthakaggarwal97
Copy link
Contributor

Yes, parent span can close before the children in the async scenarios.

Ack!

But the ideal state is where on the graceful shutdown we should close all the open spans.

Correct, but here's an interesting case.

The mapping update request R1 comes to cluster-manager node T1, and the node restarts. With this, or in cases where the node T1 fails to join the cluster, the data node will retry the request to the new cluster-manager node T2.
Post the node restart, the span of R1 will get cleared from the thread context, but will keep on staying in the spanMap of StrictCheckSpanProcessor.

Unless there is a way to gracefully drain the spans in case of node restarts in the tests, i.e remove from the spanMap for the tests, the validation will keep on failing. We need to have such a mechanism in place to fix this flaky test.

@reta
Copy link
Collaborator Author

reta commented Feb 6, 2024

@sarthakaggarwal97 thanks for the detailed flow

Before the span can come into the Mappings Update Response, the node is restarted

Aha, so basically what we have here:

  • the transport action is being executed (MAPPING_UPDATE_REQUIRED)
  • node is going through a clean shutdown procedure (before restart)
  • the thread pool should be asked to be shutdown
    • the transport action should be interrupted (with onFailure) callback?

I will try to play with your reproducer to understand where we have a gap, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Telemetry:Tracing PRs or issues specific to telemetry tracing framework
Projects
None yet
Development

No branches or pull requests

4 participants