Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix problems in tracing e2e test #5280

Merged
merged 2 commits into from
Nov 11, 2022

Conversation

tjquinno
Copy link
Member

Resolves #5081

TL;DR:

There are flaws in the test.

  1. The test code incorrectly uses trace ID and span ID at different points as the key for the EVENTS_MAP. It should always be span ID.
  2. The tracers receive reports about 5 spans, not 4 as the EXPECTED_TRACE_EVENTS_COUNT constant says.
  3. The ancestry check does not start with the correct span.

This PR addresses all of these.


Details:

  1. EVENTS_MAP is accessed using inconsistent keys.

    The map holds spans. In some places the test code uses the trace ID as the key to the map (where the tracer adds entries, where the e2e test method removes the span it started explicitly) but in other places it uses the span ID (assertSpanChain, findAndRemoveSpan). These IDs are not interchangeable.

  2. The countdown latch is not initialized correctly.

    In my local runs, 5 spans are reported to the two tracers--3 to the client tracer and 2 to the server tracer. The test uses a countdown latch to coordinate between the span reporter code in the tracers (which counts down the latch each time a span is reported) and the main test. But the latch is initialized to 4, not 5.

  3. The ancestry check should start with the parent of the explicitly-created span (local variable start in the test), not that span itself.

Some temporary debug output I added after each put to the map shows the tracer service name, the remaining count down latch value, the trace ID, the span ID, the parent span ID, and the span name for each of the 5 spans reported:

Service test-client recorded trace # 3: 23bdf1f2e29c791b/span 7e88b005b6288f23/parent cafac57456eb17dc/name null; map size is now 1
Service test-server recorded trace # 2: 23bdf1f2e29c791b/span 81f1c94dbeaa4c13/parent c0f8e7d0db00b2e5/name content-write; map size is now 2
Service test-server recorded trace # 1: 23bdf1f2e29c791b/span c0f8e7d0db00b2e5/parent 0f8ac71b04f7d052/name http request; map size is now 3
Service test-client recorded trace # 0: 23bdf1f2e29c791b/span c0f8e7d0db00b2e5/parent 0f8ac71b04f7d052/name get; map size is now 3
Service test-client recorded trace # 0: 23bdf1f2e29c791b/span 0f8ac71b04f7d052/parent 7e88b005b6288f23/name client-call; map size is now 4

Item 1 means that the map contains at most one entry because the trace ID is the same for all reported spans. After the spans have been reported to the tracers, the main test code removes that single entry. The later assertSpanChain can almost never fail because it almost never has any spans to look at. The "almost" is related to item 2.

item 2 means that the main test thread is allowed to proceed any time after the latch count goes to 0, potentially before the last span has been reported and added to the map. I believe this is the race condition that triggers both types of intermittent failures described in the issue.

During most test runs, the fifth span is reported and added to the map before the main test thread reaches the code after the EVENTS_LATCH.await which uses the EVENTS_MAP. Then the main test thread removes the single entry from the map, there are no entries to check the descendants of, and the test passes.

But sometimes the main test code removes the one map entry, then the tracer records the fifth span and adds it to the map.

  • If that span is added before or during the main test's call to assertSpanChain then the ancestry check fails. This is the failure Joe and I saw.
  • If that fifth span is added to the map very shortly after assertSpanChain has finished, then the remaining assertion in the test which checks for an empty map fails. This is the failure Daniel saw.

As for item 3...

I changed the test code to resolve items 1 and 2. Here is more debug output from the rest of the test that checks descendants after I made those changes:

Removed span 'start' in test with id 0f8ac71b04f7d052
Search & remove of span with parent ID 0f8ac71b04f7d052 found span with ID c0f8e7d0db00b2e5
Recursing with just-removed span c0f8e7d0db00b2e5
Search & remove of span with parent ID c0f8e7d0db00b2e5 found span with ID 81f1c94dbeaa4c13
Recursing with just-removed span 81f1c94dbeaa4c13
Search & remove for span with parent ID 81f1c94dbeaa4c13 found no match
Recursing with just-removed span nope, there was no just-removed span

After the descendants check removes span 81f1c94dbeaa4c13 and recurses, the map is still not empty because of the first span reported to the tracers, the one with the null name that's the parent of the explicitly-started span. So the descendants check tries to find the parent which is not in the map, and the test fails.

Because the parent span of the one the test explicitly creates (the start variable) is reported to the tracer and is therefore added to the map, the main test should start the descendants check with the parent of start instead of with start itself.

Signed-off-by: tim.quinn@oracle.com <tim.quinn@oracle.com>
@tjquinno tjquinno self-assigned this Oct 28, 2022
@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Oct 28, 2022
Copy link
Member

@tomas-langer tomas-langer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as we need to stop having intermittent failures.
Tracing will need to be revisited as a whole once we upgrade to latest MP

@tomas-langer tomas-langer merged commit da72641 into helidon-io:main Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants