Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplementing log correlation #2428

Merged
merged 46 commits into from
Mar 15, 2022
Merged

Conversation

eyalkoren
Copy link
Contributor

@eyalkoren eyalkoren commented Jan 27, 2022

What does this PR do?

Closes #2412
Closes #2374
Closes #2369
Closes #2387

Implementing a new approach for log correlation (which is about adding active trace.id, transaction.id and error.id to log events automatically). The current mechanism makes MDC updates when trace events are being activated or deactivated. This is efficient, but there are cases where the logging library is not available within the context of event activation, or there is access to the wrong one.
Instead, the new approach is to do JIT-MDC-updates, within the logging library, based on the active tracing events, through instrumentation.

General concept

Conceptually, the most robust way is to add the tracing IDs to the MDC right before the logging event is created and remove it right after for the following reasons:

  • current state is guaranteed to be correct
  • the relevant MDC is guaranteed to be available at this context
  • even if the event is logged with delay on a different thread, the logging framework will make sure to capture the proper status when creating it

Related disadvantages are:

  • logging events are not part of the logging API, so their creation may be done in multiple ways and using unstable internal APIs
  • there may be a performance hit in comparison to adding once and removing once per tracing event activation/deactivation

By instrumenting the public logging API themselves, we can benefit from all listed advantages and avoid the first disadvantage.
As for performance- since we use the actual logging framework best practice to decorate logging events with thread-contextual data, we can rely on optimizations done by the framework, so if there is a performance hit, it should be minimal. However, below there are some framework-specific pointers in case we want to optimize.

Prior research (documenting for later reference)

The current approach is to instrument public logging APIs. If this proves to be too impactful in terms of performance, following are some library-specific options for optimization.

log4j2

Trying to capture the LoggingEvent creation is not trivial because of API instability, multiple creation options (constructor, static factory method and builder at least) and because events may be reusable.
One option is to try and instrument at least two org.apache.logging.log4j.core.impl.ContextDataFactory#createContextData() overloads, but it can get complicated as the returned StringMaps may be either mutable or immutable (see org.apache.logging.log4j.util.StringMap#freeze()).
It is difficult to find a single method that is also stable across versions through which all logging APIs go through, and not sure this would improve anything.

logback

The Logback MDC implementation tracks map updates and reads, and creates a copy of the map every time there is an update after a read (see ch.qos.logback.classic.util.LogbackMDCAdapter). This means that our approach will incur more map copy actions.
The logging event's immutable context map gets populated either when the event is logged, or when it is decided to defer its logging. ch.qos.logback.classic.spi.LoggingEvent#getMDCPropertyMap() may be a good instrumentation target as it seems stable across versions I believe it returns a mutable collection.
By doing so we can avoid updates of the MDC-underlying map.

To-do list

  • Remove apm-log-correlation-plugin
  • Merge apm-error-logging-plugin into one plugin with the shading plugin
  • Rename the shading plugin to something general (eg logging-integration)
  • Rearrange packages within the unified plugin, so that ECS-reformatting, correlation and error-capturing is separated
  • Get rid of co.elastic.apm.agent.impl.transaction.TraceContext#applicationClassLoader
  • Make nesting check global so that if two logging frameworks are used at the same time, we only update the MDC once
  • Make sure we don't do MDC updates when the the logging API is called with an insufficient level (meaning - the event won't be logged)
  • Test Error creation and correlation for log4j2's fatal and catching APIs, and log4j1's fatal API
  • Check whether Fix JUL ECS logging no longer converting plain text logs to JSON #2387 is related and resolved by this PR
  • Add trace metadata to agent log if log_format_file=JSON - will be handled through Add trace correlation to agent logs #2495
  • Make log correlation on by default
  • Remove the enable_log_correlation config

Checklist

  • I have updated CHANGELOG.asciidoc
  • I have added tests that prove my fix is effective on log4j2 (oldest and newest supported versions)
  • I have added tests that prove my fix is effective on logback (oldest and newest supported versions)
  • I have added tests that prove my fix is effective on log4j1
  • I have added tests for JBoss logging framework
  • I have added tests for slf4j
  • I added tests for log correlation within agent logs - will be handled through Add trace correlation to agent logs #2495
  • I have made corresponding changes to the documentation - error event capturing for log4j1 was, no direct support for slf4j log correlation anymore

@cla-checker-service
Copy link

cla-checker-service bot commented Jan 27, 2022

💚 CLA has been signed

@apmmachine
Copy link
Contributor

apmmachine commented Jan 27, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-03-15T15:57:39.555+0000

  • Duration: 49 min 36 sec

Test stats 🧪

Test Results
Failed 0
Passed 2745
Skipped 20
Total 2765

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • run benchmark tests : Run the benchmark tests.

  • run jdk compatibility tests : Run the JDK Compatibility tests.

  • run integration tests : Run the Agent Integration tests.

  • run end-to-end tests : Run the APM-ITs.

  • run windows tests : Build & tests on windows.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@eyalkoren eyalkoren force-pushed the log-correlation-remake branch from 6cb20fe to 6e2ae2b Compare January 27, 2022 16:00
@eyalkoren
Copy link
Contributor Author

run integration tests

@felixbarny
Copy link
Member

The current mechanism makes MDC updates when trace events are being activated or deactivated. This is efficient […]
there may be a performance hit in comparison to adding once and removing once per tracing event activation/deactivation

The current mechanism is not more efficient in all cases. In fact, it’s often less efficient. Generally, the more the application logs, the more efficient the current strategy with the MdcActivationListener is. The less an application logs, the more efficient it is to only add the correlation ids on demand. An extreme case would be a bad n+1 query that does 10k calls to Redis. With the current mechanism, the hook would be executed for every span activation/deactivation, even if there are no log events in the scope of these spans.

Due to that, I think we can even enable log correlation by default now and maybe also add the span.id to the MDC. That’s because with the new mechanism, the Strings for the correlation ids only get allocated on demand - if and when they’re actually needed.

Also, I think adding to ThreadContext before each log is not significantly more or less efficient than using ContextDataInjector or ContextDataProvider as they also require allocating a new Map/StringMap for each enabled log event.


However, I’d like to challenge the approach to instrument all logger methods, such as trace, debug, info, etc.
The biggest concern I have is around performance and the ability for these methods to get inlined. Specifically the methods with vararg parameters (such as org.slf4j.Logger#info(java.lang.String, java.lang.Object…)) rely on being inlined. Also, we’ll have to check for re-entrancy and whether the log level is enabled. Maybe not a huge cost but it may add up, especially for trace logs. Matching and instrumenting a lot of methods is also more expensive.

I think there are good alternative methods we can instrument that are executed after checking whether the log level is enabled. It looks like these methods are always executed and are within a few frames of the logger method calls:

  • Logback: ch.qos.logback.classic.Logger#callAppenders
  • Log4j2:
    • org.apache.logging.log4j.core.impl.LogEventFactory#createEvent
    • org.apache.logging.log4j.core.impl.LocationAwareLogEventFactory#createEvent
  • Log4j1: org.apache.log4j.Category::callAppenders
  • JUL: java.util.logging.Logger#log(java.util.logging.LogRecord)

@tobiasstadler
Copy link
Contributor

I just wanted to mention that log correlation is (still) working for me with this changes.

@eyalkoren
Copy link
Contributor Author

eyalkoren commented Jan 31, 2022

Also, I think adding to ThreadContext before each log is not significantly more or less efficient than using ContextDataInjector or ContextDataProvider as they also require allocating a new Map/StringMap for each enabled log event.

That's absolutely right, but it's different when it's the user's choice to add this overhead. Either way, I don't think this aspect of overhead would be an issue, it's more points to consider and possibly optimize.

However, I’d like to challenge the approach to instrument all logger methods, such as trace, debug, info, etc.

No need to challenge, the description of this PR explains why event creation is ideal in terms of JIT-timing. I instrumented the API in order to test the general approach and overcome the concerns described above. It's still in draft state because we are still exploring our options.
There is considerable value in simplicity, and the fact that very simple instrumentation was sufficient to demonstrate support for these three logging libraries has its value. That is not to say that simplicity outweighs performance, the opposite is true, but we should give up the simplest solution only if there is indeed an overhead concern, and I am not absolutely sure this would be the case eventually.

One big advantage of instrumenting based on the APIs, is the instrumentation of slf4j.

@eyalkoren
Copy link
Contributor Author

I just came back to this for a brief review, adding comments for discussion: I do like the simplicity and robustness (e.g. stability across versions and independency of actual implementation) of the API-based instrumentation, but I agree overhead is a major factor, so let's review the separate concerns:

The biggest concern I have is around performance and the ability for these methods to get inlined.

That's a valid concern indeed. Since our advices are not inlined anymore, we add less bytecode today. It's hard to say what the effect would be, but we can do some rough estimation. I just compared ch.qos.logback.classic.Logger#info(java.lang.String, java.lang.Object...) before and after and the increase is from 9 bytecode instructions to 33. From what I could find, the default value for -XX:FreqInlineSize for Linux 64-bit is 325, so I think we should be OK. As far as I can see in log4j2 and Logback, these API methods I instrumented are not nested, so this bytecode is added once within each code path, even when including slf4j.

Also, we’ll have to check for re-entrancy ...

Reentrance guard is already implemented in the proposed AbstractLogCorrelationHelper.

... and whether the log level is enabled.

Whether log level is enabled is nice to test, although adding to thread-local maps is quite cheap, so not sure it worth the effort. If we decide it is, then we can use the related APIs for that.

Matching and instrumenting a lot of methods is also more expensive.

True, but we apply method matching only for a limited number of classes. From what I could see, even if it's 10X comparing to single-method matching, it's still a matter of few milliseconds for the entire library.

@eyalkoren
Copy link
Contributor Author

Summarizing offline discussion:

  • We decided not to provide a generic support for slf4j for now, and instead support the underlining frameworks only. This allows us to do a more performant and framework-specific instrumentation, like @felixbarny suggested above
  • We will add support for JULI in a different PR
  • We want to add ECS-logging support and then reformatting and log correlation to JBoss logging - always used as a facade for the basic logging implementation, for MDC it either delegates to log4j2 (ThreadContext), or implements internally (when used with JULI). This means an issue in ECS-logging (maybe only testing) and an issue for reformatting and log correlation within the agent.

@eyalkoren eyalkoren force-pushed the log-correlation-remake branch from 562f72c to a806af3 Compare February 16, 2022 06:36
@eyalkoren
Copy link
Contributor Author

run integration tests

@tobiasstadler
Copy link
Contributor

Can you make a PR on this branch?

See eyalkoren#5

@tobiasstadler
Copy link
Contributor

tobiasstadler commented Mar 8, 2022

Can you explain what is not working?

JBoss LogManger is a logging implementation bases on JUL. Since there is no support for JUL at the moment, JBoss LogManager needs its one implementation for log correlation.

@eyalkoren
Copy link
Contributor Author

eyalkoren commented Mar 8, 2022

JBoss LogManger is a logging implementation bases on JUL. Since there is no support for JUL at the moment, JBoss LogManager needs its one implementation for log correlation.

Hmm, I assumed a logger is obtained through Logger.getLogger(), in which case the LogManager is covered by the instrumentation of org.jboss.logging.JBossLogManagerLogger, but this doesn't cover LogManager.getLogManager().getLogger()...
Is that only valid to older versions? If so- how old?

@SylvainJuge SylvainJuge added the await-release Mark issues that depend on next release, or PRs that are planned to be included label Mar 14, 2022
Copy link
Member

@SylvainJuge SylvainJuge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finished the review yet, I only have a few comments/questions so far.
That's definitely an impressive refactor and the end result is definitely worth the effort as everything now looks better and easier to extend and maintain.

Some tests are disabled with @Disabled then executed only through Junit4 integration tests, I wonder if we could have a better option for this in the future as 1) it may result in having lots of tests that appear to be disabled in CI but are not in practice as they are also integration tests, and 2) it creates a bit of duplication so having a generic Junit4-wrapper would be nice if such thing exists.

@@ -44,6 +44,8 @@

private static final Logger logger = LoggerFactory.getLogger(ErrorCapture.class);

private static final ThreadLocal<ErrorCapture> activeError = new ThreadLocal<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] Would it be relevant to use co.elastic.apm.agent.sdk.weakconcurrent.WeakConcurrent#buildThreadLocal instead of a plain ThreadLocal ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You reminded me that this one made me think I need to take another look into our thread local usages in general.
In this specific case, I don't think it is required, as it is set and removed atomically within a single method execution. Once a method bytecode starts to execute, I don't think any of its dependencies may change before it ends.
But it may be something we want to set general rules for and make sure to follow all across.

private final Tracer tracer = GlobalTracer.get();

@Override
protected boolean addToMdc() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] could be made private as no sub-class is overriding addToMdc or removeFromMdc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It implements a protected abstract method that needs to have protected access at least.

Comment on lines -94 to -96
public ElementMatcher.Junction<ClassLoader> getClassLoaderMatcher() {
return not(ofType(nameStartsWith("co.elastic.apm.")));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] do you have further details on why this exclusion was previously required ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was required before and now as well for log4j2, because we load it within the agent. However, while it didn't interfere with tests of other frameworks before, it does now, so I removed it from the abstract and implemented it specifically for log4j2, where we test with TestClassWithDependencyRunner, which uses co.elastic.test.ChildFirstURLClassLoader especially to escape these matchers.

@eyalkoren
Copy link
Contributor Author

Some tests are disabled with @disabled then executed only through Junit4 integration tests, I wonder if we could have a better option for this in the future as 1) it may result in having lots of tests that appear to be disabled in CI but are not in practice as they are also integration tests, and 2) it creates a bit of duplication so having a generic Junit4-wrapper would be nice if such thing exists.

Yes, we have to improve on this front, but I think it's out of the scope of this PR

Copy link
Member

@SylvainJuge SylvainJuge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this is really nice and I only have a few minor comments.

Another minor thing that might be nice to add if you have time is to make sure to properly document n javadoc which methods are instrumented in *Instrumentation classes to make it easier to navigate to the instrumented classes.

}

@Test
void testShadePathComputation() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] seems a good candidate for changing the computeReformattedLogFilePathWithConfiguredDir entry point and test to use Path instead of a raw file path in a String that requires to rewrite path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get this one. IIUC it's only something to change within the test, so I'll leave as is.
Let me know if you meant something within the actual implementation that you think should be changed.

@eyalkoren eyalkoren enabled auto-merge (squash) March 15, 2022 15:58
@eyalkoren eyalkoren merged commit c8e7936 into elastic:main Mar 15, 2022
@eyalkoren eyalkoren deleted the log-correlation-remake branch March 16, 2022 05:32
@efdknittlfrank
Copy link

Awesome, looking forward to the next release of the agent! Thank you for all your incredible work ❤️

@SylvainJuge SylvainJuge removed the await-release Mark issues that depend on next release, or PRs that are planned to be included label May 5, 2022
@alejavigo
Copy link

Maybe this change in 1.30.0 has to do with my problem here? :
#3563

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants