[BUG] SpanData validation failed for validator org.opensearch.test.telemetry.tracing.validators.AllSpansAreEndedProperly flakyness #11310

reta · 2023-11-22T20:14:00Z

Describe the bug

It seems like we have some failure scenarios that are not covered by tracing instrumentation framework, randomnly failing the validation checks:

java.lang.AssertionError: SpanData validation failed for validator org.opensearch.test.telemetry.tracing.validators.AllSpansAreEndedProperly/nMockSpanData{spanID='85028fb97ab9d4ee', parentSpanID='6bb0d2126212ce2e', traceID='75ff3affa0cc1c0c', spanName='dispatchedShardOperationOnPrimary', startEpochNanos=1801185791106, endEpochNanos=0, hasEnded=false, attributes={index=restart_while_indexing, refresh_policy=false, thread.name=opensearch[node_t0][write][T#1], bulk_request_items=1, shard_id=0, node_id=m8PJAwK_RzCo2kkho3PcxA}, stackTrace=[java.base/java.lang.Thread.getStackTrace(Thread.java:2450), org.opensearch.test.telemetry.tracing.StrictCheckSpanProcessor.toMockSpanData(StrictCheckSpanProcessor.java:63), org.opensearch.test.telemetry.tracing.StrictCheckSpanProcessor.onStart(StrictCheckSpanProcessor.java:34), org.opensearch.test.telemetry.tracing.MockTracingTelemetry.createSpan(MockTracingTelemetry.java:35), org.opensearch.telemetry.tracing.DefaultTracer.createSpan(DefaultTracer.java:94), org.opensearch.telemetry.tracing.DefaultTracer.startSpan(DefaultTracer.java:55), org.opensearch.telemetry.tracing.WrappedTracer.startSpan(WrappedTracer.java:43), org.opensearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:231), org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908), org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52), java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144), java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642), java.base/java.lang.Thread.run(Thread.java:1583)]}

To Reproduce

./gradlew ':server:internalClusterTest' --tests "org.opensearch.discovery.ClusterDisruptionIT" -Dtests.seed=22054582CD9D3248

Expected behavior
The checks must always pass

Plugins
Standard

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

CI

Additional context

The text was updated successfully, but these errors were encountered:

reta · 2023-11-22T20:14:09Z

@Gaganjuneja FYI

Gaganjuneja · 2023-11-23T02:46:53Z

I will take a look. Thanks!

Gaganjuneja · 2024-01-23T07:54:02Z

@mgodwan @rayshrey Need help here. This is flaky for a test case testRestartNodeWhileIndexing and span name dispatchedShardOperationOnPrimary. How can we make sure that the span is ended when node restarts?

sarthakaggarwal97 · 2024-02-05T12:43:52Z

@Gaganjuneja

It does not look like the test in itself is the issue. In my experiments, I can observe that the moment I turn of the update mappings from the test (by keeping the fields constant in the source), the tests starts to pass consistently.
The test is a bit hard to reproduce, and unable to do so with a seed.

This seems to be a very corner case, where we are unable to close the span in cases of mapping update during node restarts. The span stays in the memory without any execution thread associated to it.

There would be two options that we can follow from here on:

If it is possible to get an API from the request tracing framework, where we can ignore rare cases like such as node restarts from the span checks. (preferred)
Implement a way to drain the spans from cluster state for such cases.

Gaganjuneja · 2024-02-05T14:16:13Z

@sarthakaggarwal97

If it is possible to get an API from the request tracing framework, where we can ignore rare cases like such as node restarts from the span checks. (preferred).

We can do that but it's very hard to make sure that its being used cautiously and not being exploited to pass the test reporting the underneath memory leak issues. @reta your thoughts here?

reta · 2024-02-05T16:40:32Z

Thanks @sarthakaggarwal97

This seems to be a very corner case, where we are unable to close the span in cases of mapping update during node restarts. The span stays in the memory without any execution thread associated to it.

This is concerning behaviour: to me, such edge cases lead to the failure flows that are not instrumented. It would be great to understand why the restart (which basically should be clean stop / start sequence) keeps some components in "running" state (not aware of restart). We should not be looking for shortcuts here.

sarthakaggarwal97 · 2024-02-06T06:09:54Z

@reta @Gaganjuneja

Let me elaborate the issue with two cases of spans, one that was successful, and the one that failed. In both of these cases, the mappings update was required. Would like to get your inputs on how we can proceed on this.

Happy Case

Let's consider upon active indexing, a new span 834f656ab29fa68d was created. The span was traced dispatchedShardOperationOnPrimary -> performOnPrimary -> executeBulkItemRequest

1> [2024-02-05T23:27:53,465][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in dispatchedShardOperationOnPrimary op: 834f656ab29fa68d
1> [2024-02-05T23:27:53,465][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in performOnPrimary 834f656ab29fa68d at doc id144
1> [2024-02-05T23:27:53,465][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in executeBulkItemRequest:834f656ab29fa68d ---> 144

It realises that we need to update the mappings

1> [2024-02-05T23:27:53,465][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in MAPPING_UPDATE_REQUIRED 834f656ab29fa68d

Reaches the onResponse of Mapping Updates

1> [2024-02-05T23:27:53,506][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in MAPPING_UPDATE_REQUIRED onResponse 834f656ab29fa68d
> [2024-02-05T23:27:53,509][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in waitForMappingUpdate onResponse 834f656ab29fa68d

Post mappings update, executor is attached, and the executeBulkItemRequest is continued

1> [2024-02-05T23:27:53,509][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Item Done Listener: 834f656ab29fa68d
  1> [2024-02-05T23:27:53,509][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Executor invoked: 834f656ab29fa68d
  1> [2024-02-05T23:27:53,509][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Executor invoked: 834f656ab29fa68d
  1> [2024-02-05T23:27:53,509][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in executeBulkItemRequest:834f656ab29fa68d ---> 144
  1> [2024-02-05T23:27:53,510][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] 834f656ab29fa68d Before notifyOnce
  1> [2024-02-05T23:27:53,510][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] 834f656ab29fa68d listener -->org.opensearch.telemetry.tracing.listener.TraceableActionListener@3d5c9507
  1> [2024-02-05T23:27:53,510][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Before Listener invoked 834f656ab29fa68d --- Thread[#246,opensearch[node_t1][write][T#1],5,TGRP-ClusterDisruptionIT] --- 
  1> [2024-02-05T23:27:53,510][INFO ][o.o.t.t.l.TraceableActionListener] [node_t1] Closing span in onResponse: 834f656ab29fa68d --- Thread[#246,opensearch[node_t1][write][T#1],5,TGRP-ClusterDisruptionIT]
  1> [2024-02-05T23:27:53,510][INFO ][o.o.t.t.t.MockSpan       ] [node_t1] Span waiting for lock to be ended 834f656ab29fa68d  by Thread[#246,opensearch[node_t1][write][T#1],5,TGRP-ClusterDisruptionIT]
  1> [2024-02-05T23:27:53,510][INFO ][o.o.t.t.t.MockSpan       ] [node_t1] Span 834f656ab29fa68d ended by Thread[#246,opensearch[node_t1][write][T#1],5,TGRP-ClusterDisruptionIT]
  1> [2024-02-05T23:27:53,510][INFO ][o.o.t.t.t.MockSpan       ] [node_t1] 834f656ab29fa68d has endedThread[#246,opensearch[node_t1][write][T#1],5,TGRP-ClusterDisruptionIT]
  1> [2024-02-05T23:27:53,511][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Listener invoked 834f656ab29fa68d --- Thread[#246,opensearch[node_t1][write][T#1],5,TGRP-ClusterDisruptionIT] --- interrupted false--- java.base@21.0.2/java.lang.Thread.dumpThreads(Native Method)java.base@21.0.2/java.lang.Thread.getAllStackTraces(Thread.java:2521)app//org.opensearch.action.bulk.TransportShardBulkAction.lambda$performOnPrimary$5(TransportShardBulkAction.java:499)app//org.opensearch.action.bulk.TransportShardBulkAction$$Lambda/0x0000000800b317c8.run(Unknown Source)app//org.opensearch.core.action.ActionListener$5.onResponse(ActionListener.java:275)app//org.opensearch.core.action.ActionListener$7.onResponse(ActionListener.java:324)app//org.opensearch.core.action.ActionListener$6.onResponse(ActionListener.java:295)app//org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:378)app//org.opensearch.action.bulk.TransportShardBulkAction$2.finishRequest(TransportShardBulkAction.java:573)app//org.opensearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:535)app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913)app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)java.base@21.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)java.base@21.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)java.base@21.0.2/java.lang.Thread.runWith(Thread.java:1596)java.base@21.0.2/java.lang.Thread.run(Thread.java:1583)
  1> [2024-02-05T23:27:53,511][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] 834f656ab29fa68d After notifyOnce

Sad Case

Following the span based on the exception

MockSpanData{spanID='f0fcea06161f2245', parentSpanID='16086d76dc781c86', traceID='feef9ff717b884cc', spanName='dispatchedShardOperationOnPrimary', startEpochNanos=24261684302000, endEpochNanos=0, hasEnded=false, attributes={index=restart_while_indexing, refresh_policy=false, thread.name=opensearch[node_t1][write][T#3], bulk_request_items=1, shard_id=0, node_id=Lf0XLkkwScubTL0g8s-q_g}

Upon active indexing, a new span f0fcea06161f2245 was created. The span was traced dispatchedShardOperationOnPrimary -> performOnPrimary -> executeBulkItemRequest

  1> [2024-02-05T23:27:53,469][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in dispatchedShardOperationOnPrimary op: f0fcea06161f2245
  1> [2024-02-05T23:27:53,469][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in performOnPrimary f0fcea06161f2245 at doc id146
  1> [2024-02-05T23:27:53,469][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in executeBulkItemRequest:f0fcea06161f2245 ---> 146

It realizes that we need to update the mappings

  1> [2024-02-05T23:27:53,469][INFO ][o.o.a.b.TransportShardBulkAction] [node_t1] Span in MAPPING_UPDATE_REQUIRED f0fcea06161f2245

Before the span can come into the Mappings Update Response, the node is restarted

  1> [2024-02-05T23:27:53,481][INFO ][o.o.t.InternalTestCluster] [testRestartNodeWhileIndexing] Restarting node [node_t1] 
  1> [2024-02-05T23:27:53,481][INFO ][o.o.t.InternalTestCluster] [testRestartNodeWhileIndexing] adding voting config exclusions [node_t1] prior to restart/shutdown

Then this span never comes into the Mappings Update Response.

However, the parent span ends post restart

1> [2024-02-05T23:27:53,601][INFO ][o.o.t.t.t.MockSpan       ] [node_t1] Span waiting for lock to be ended 16086d76dc781c86  by Thread[#224,opensearch[node_t1][generic][T#2],5,TGRP-ClusterDisruptionIT]
  1> [2024-02-05T23:27:53,601][INFO ][o.o.t.t.t.MockSpan       ] [node_t1] Span 16086d76dc781c86 ended by Thread[#224,opensearch[node_t1][generic][T#2],5,TGRP-ClusterDisruptionIT]

Attaching the logs: op.log
Span Trace Instrumentation: https://github.com/sarthakaggarwal97/OpenSearch/tree/test-flaky-tracer

sarthakaggarwal97 · 2024-02-06T06:18:47Z

@reta @Gaganjuneja

With this, I've a couple of questions as well:

Is it acceptable for the parent span to be closed before the children span are closed?

It would be great to understand why the restart (which basically should be clean stop / start sequence) keeps some components in "running" state (not aware of restart).

Is there a way confirm the span stays in the memory without the executor being attached later on? I'm not sure on the lifecycle of spans during node restarts or start/stop.

Gaganjuneja · 2024-02-06T06:42:33Z

@sarthakaggarwal97

Is it acceptable for the parent span to be closed before the children span are closed?

Yes, parent span can close before the children in the async scenarios.

Is there a way confirm the span stays in the memory without the executor being attached later on? I'm not sure on the lifecycle of spans during node restarts or start/stop.

Span stays in the memory till we call the endSpan. Definitely, in case of node stop (opensearch process stop) spans will be cleared. But the ideal state is where on the graceful shutdown we should close all the open spans. This is also a validation that we know how to react to the node stop/start from the outstanding indexing requests and we should have all the open requests/tasks/actions closed gracefully and release all the resources.

sarthakaggarwal97 · 2024-02-06T08:07:20Z

Yes, parent span can close before the children in the async scenarios.

Ack!

But the ideal state is where on the graceful shutdown we should close all the open spans.

Correct, but here's an interesting case.

The mapping update request R1 comes to cluster-manager node T1, and the node restarts. With this, or in cases where the node T1 fails to join the cluster, the data node will retry the request to the new cluster-manager node T2.
Post the node restart, the span of R1 will get cleared from the thread context, but will keep on staying in the spanMap of StrictCheckSpanProcessor.

Unless there is a way to gracefully drain the spans in case of node restarts in the tests, i.e remove from the spanMap for the tests, the validation will keep on failing. We need to have such a mechanism in place to fix this flaky test.

reta · 2024-02-06T17:07:56Z

@sarthakaggarwal97 thanks for the detailed flow

Before the span can come into the Mappings Update Response, the node is restarted

Aha, so basically what we have here:

the transport action is being executed (MAPPING_UPDATE_REQUIRED)
node is going through a clean shutdown procedure (before restart)
the thread pool should be asked to be shutdown
- the transport action should be interrupted (with onFailure) callback?

I will try to play with your reproducer to understand where we have a gap, thank you.

reta added bug Something isn't working untriaged labels Nov 22, 2023

reta added flaky-test Random test failure that succeeds on second run and removed untriaged labels Nov 22, 2023

reta mentioned this issue Nov 22, 2023

Bump OpenTelemetry from 1.31.0 to 1.32.0 and OpenTelemetry Semconv from 1.21.0-alpha to 1.23.1-alpha #11305

Merged

8 tasks

ankitkala assigned Gaganjuneja Dec 16, 2023

andrross added the Telemetry:Tracing PRs or issues specific to telemetry tracing framework label Feb 21, 2024

reta mentioned this issue Mar 22, 2024

[Backport] [2.x] Added missed API visibility annotations for public / experimental APIs (#12864) #12867

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] SpanData validation failed for validator org.opensearch.test.telemetry.tracing.validators.AllSpansAreEndedProperly flakyness #11310

[BUG] SpanData validation failed for validator org.opensearch.test.telemetry.tracing.validators.AllSpansAreEndedProperly flakyness #11310

reta commented Nov 22, 2023

reta commented Nov 22, 2023

Gaganjuneja commented Nov 23, 2023

Gaganjuneja commented Jan 23, 2024

sarthakaggarwal97 commented Feb 5, 2024

Gaganjuneja commented Feb 5, 2024

reta commented Feb 5, 2024 •

edited

Loading

sarthakaggarwal97 commented Feb 6, 2024 •

edited

Loading

sarthakaggarwal97 commented Feb 6, 2024 •

edited

Loading

Gaganjuneja commented Feb 6, 2024 •

edited

Loading

sarthakaggarwal97 commented Feb 6, 2024

reta commented Feb 6, 2024

[BUG] SpanData validation failed for validator org.opensearch.test.telemetry.tracing.validators.AllSpansAreEndedProperly flakyness #11310

[BUG] SpanData validation failed for validator org.opensearch.test.telemetry.tracing.validators.AllSpansAreEndedProperly flakyness #11310

Comments

reta commented Nov 22, 2023

reta commented Nov 22, 2023

Gaganjuneja commented Nov 23, 2023

Gaganjuneja commented Jan 23, 2024

sarthakaggarwal97 commented Feb 5, 2024

Gaganjuneja commented Feb 5, 2024

reta commented Feb 5, 2024 • edited Loading

sarthakaggarwal97 commented Feb 6, 2024 • edited Loading

Happy Case

Sad Case

sarthakaggarwal97 commented Feb 6, 2024 • edited Loading

Gaganjuneja commented Feb 6, 2024 • edited Loading

sarthakaggarwal97 commented Feb 6, 2024

reta commented Feb 6, 2024

reta commented Feb 5, 2024 •

edited

Loading

sarthakaggarwal97 commented Feb 6, 2024 •

edited

Loading

sarthakaggarwal97 commented Feb 6, 2024 •

edited

Loading

Gaganjuneja commented Feb 6, 2024 •

edited

Loading