Reimplementing log correlation #2428

eyalkoren · 2022-01-27T15:00:37Z

What does this PR do?

Closes #2412
Closes #2374
Closes #2369
Closes #2387

Implementing a new approach for log correlation (which is about adding active trace.id, transaction.id and error.id to log events automatically). The current mechanism makes MDC updates when trace events are being activated or deactivated. This is efficient, but there are cases where the logging library is not available within the context of event activation, or there is access to the wrong one.
Instead, the new approach is to do JIT-MDC-updates, within the logging library, based on the active tracing events, through instrumentation.

General concept

Conceptually, the most robust way is to add the tracing IDs to the MDC right before the logging event is created and remove it right after for the following reasons:

current state is guaranteed to be correct
the relevant MDC is guaranteed to be available at this context
even if the event is logged with delay on a different thread, the logging framework will make sure to capture the proper status when creating it

Related disadvantages are:

logging events are not part of the logging API, so their creation may be done in multiple ways and using unstable internal APIs
there may be a performance hit in comparison to adding once and removing once per tracing event activation/deactivation

By instrumenting the public logging API themselves, we can benefit from all listed advantages and avoid the first disadvantage.
As for performance- since we use the actual logging framework best practice to decorate logging events with thread-contextual data, we can rely on optimizations done by the framework, so if there is a performance hit, it should be minimal. However, below there are some framework-specific pointers in case we want to optimize.

Prior research (documenting for later reference)

The current approach is to instrument public logging APIs. If this proves to be too impactful in terms of performance, following are some library-specific options for optimization.

log4j2

Trying to capture the LoggingEvent creation is not trivial because of API instability, multiple creation options (constructor, static factory method and builder at least) and because events may be reusable.
One option is to try and instrument at least two org.apache.logging.log4j.core.impl.ContextDataFactory#createContextData() overloads, but it can get complicated as the returned StringMaps may be either mutable or immutable (see org.apache.logging.log4j.util.StringMap#freeze()).
It is difficult to find a single method that is also stable across versions through which all logging APIs go through, and not sure this would improve anything.

logback

The Logback MDC implementation tracks map updates and reads, and creates a copy of the map every time there is an update after a read (see ch.qos.logback.classic.util.LogbackMDCAdapter). This means that our approach will incur more map copy actions.
The logging event's immutable context map gets populated either when the event is logged, or when it is decided to defer its logging. ch.qos.logback.classic.spi.LoggingEvent#getMDCPropertyMap() may be a good instrumentation target as it seems stable across versions I believe it returns a mutable collection.
By doing so we can avoid updates of the MDC-underlying map.

To-do list

Checklist

I have updated CHANGELOG.asciidoc
I have added tests that prove my fix is effective on log4j2 (oldest and newest supported versions)
I have added tests that prove my fix is effective on logback (oldest and newest supported versions)
I have added tests that prove my fix is effective on log4j1
I have added tests for JBoss logging framework
I have added tests for slf4j
~~I added tests for log correlation within agent logs~~ - will be handled through Add trace correlation to agent logs #2495
I have made corresponding changes to the documentation - error event capturing for log4j1 was, no direct support for slf4j log correlation anymore

cla-checker-service · 2022-01-27T15:00:41Z

💚 CLA has been signed

apmmachine · 2022-01-27T15:39:02Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-03-15T15:57:39.555+0000
Duration: 49 min 36 sec

Test stats 🧪

Test	Results
Failed	0
Passed	2745
Skipped	20
Total	2765

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
run benchmark tests : Run the benchmark tests.
run jdk compatibility tests : Run the JDK Compatibility tests.
run integration tests : Run the Agent Integration tests.
run end-to-end tests : Run the APM-ITs.
run windows tests : Build & tests on windows.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

...-plugin/src/main/java/co/elastic/apm/agent/log4j2/Log4j2TraceCorrelationInstrumentation.java

eyalkoren · 2022-01-27T16:02:17Z

run integration tests

felixbarny · 2022-01-28T07:33:44Z

The current mechanism makes MDC updates when trace events are being activated or deactivated. This is efficient […]
there may be a performance hit in comparison to adding once and removing once per tracing event activation/deactivation

The current mechanism is not more efficient in all cases. In fact, it’s often less efficient. Generally, the more the application logs, the more efficient the current strategy with the MdcActivationListener is. The less an application logs, the more efficient it is to only add the correlation ids on demand. An extreme case would be a bad n+1 query that does 10k calls to Redis. With the current mechanism, the hook would be executed for every span activation/deactivation, even if there are no log events in the scope of these spans.

Due to that, I think we can even enable log correlation by default now and maybe also add the span.id to the MDC. That’s because with the new mechanism, the Strings for the correlation ids only get allocated on demand - if and when they’re actually needed.

Also, I think adding to ThreadContext before each log is not significantly more or less efficient than using ContextDataInjector or ContextDataProvider as they also require allocating a new Map/StringMap for each enabled log event.

However, I’d like to challenge the approach to instrument all logger methods, such as trace, debug, info, etc.
The biggest concern I have is around performance and the ability for these methods to get inlined. Specifically the methods with vararg parameters (such as org.slf4j.Logger#info(java.lang.String, java.lang.Object…)) rely on being inlined. Also, we’ll have to check for re-entrancy and whether the log level is enabled. Maybe not a huge cost but it may add up, especially for trace logs. Matching and instrumenting a lot of methods is also more expensive.

I think there are good alternative methods we can instrument that are executed after checking whether the log level is enabled. It looks like these methods are always executed and are within a few frames of the logger method calls:

Logback: ch.qos.logback.classic.Logger#callAppenders
Log4j2:
- org.apache.logging.log4j.core.impl.LogEventFactory#createEvent
- org.apache.logging.log4j.core.impl.LocationAwareLogEventFactory#createEvent
Log4j1: org.apache.log4j.Category::callAppenders
JUL: java.util.logging.Logger#log(java.util.logging.LogRecord)

tobiasstadler · 2022-01-28T08:33:22Z

I just wanted to mention that log correlation is (still) working for me with this changes.

eyalkoren · 2022-01-31T15:16:23Z

Also, I think adding to ThreadContext before each log is not significantly more or less efficient than using ContextDataInjector or ContextDataProvider as they also require allocating a new Map/StringMap for each enabled log event.

That's absolutely right, but it's different when it's the user's choice to add this overhead. Either way, I don't think this aspect of overhead would be an issue, it's more points to consider and possibly optimize.

However, I’d like to challenge the approach to instrument all logger methods, such as trace, debug, info, etc.

No need to challenge, the description of this PR explains why event creation is ideal in terms of JIT-timing. I instrumented the API in order to test the general approach and overcome the concerns described above. It's still in draft state because we are still exploring our options.
There is considerable value in simplicity, and the fact that very simple instrumentation was sufficient to demonstrate support for these three logging libraries has its value. That is not to say that simplicity outweighs performance, the opposite is true, but we should give up the simplest solution only if there is indeed an overhead concern, and I am not absolutely sure this would be the case eventually.

One big advantage of instrumenting based on the APIs, is the instrumentation of slf4j.

eyalkoren · 2022-02-13T15:34:34Z

I just came back to this for a brief review, adding comments for discussion: I do like the simplicity and robustness (e.g. stability across versions and independency of actual implementation) of the API-based instrumentation, but I agree overhead is a major factor, so let's review the separate concerns:

The biggest concern I have is around performance and the ability for these methods to get inlined.

That's a valid concern indeed. Since our advices are not inlined anymore, we add less bytecode today. It's hard to say what the effect would be, but we can do some rough estimation. I just compared ch.qos.logback.classic.Logger#info(java.lang.String, java.lang.Object...) before and after and the increase is from 9 bytecode instructions to 33. From what I could find, the default value for -XX:FreqInlineSize for Linux 64-bit is 325, so I think we should be OK. As far as I can see in log4j2 and Logback, these API methods I instrumented are not nested, so this bytecode is added once within each code path, even when including slf4j.

Also, we’ll have to check for re-entrancy ...

Reentrance guard is already implemented in the proposed AbstractLogCorrelationHelper.

... and whether the log level is enabled.

Whether log level is enabled is nice to test, although adding to thread-local maps is quite cheap, so not sure it worth the effort. If we decide it is, then we can use the related APIs for that.

Matching and instrumenting a lot of methods is also more expensive.

True, but we apply method matching only for a limited number of classes. From what I could see, even if it's 10X comparing to single-method matching, it's still a matter of few milliseconds for the entire library.

eyalkoren · 2022-02-14T08:44:05Z

Summarizing offline discussion:

We decided not to provide a generic support for slf4j for now, and instead support the underlining frameworks only. This allows us to do a more performant and framework-specific instrumentation, like @felixbarny suggested above
We will add support for JULI in a different PR
We want to add ECS-logging support and then reformatting and log correlation to JBoss logging - always used as a facade for the basic logging implementation, for MDC it either delegates to log4j2 (ThreadContext), or implements internally (when used with JULI). This means an issue in ECS-logging (maybe only testing) and an issue for reformatting and log correlation within the agent.

…stions' into log-correlation-remake

eyalkoren · 2022-02-22T12:55:39Z

run integration tests

...lugin-common/src/main/java/co/elastic/apm/agent/log/shader/AbstractLogCorrelationHelper.java

tobiasstadler · 2022-03-08T15:41:03Z

Can you make a PR on this branch?

See eyalkoren#5

tobiasstadler · 2022-03-08T15:42:47Z

Can you explain what is not working?

JBoss LogManger is a logging implementation bases on JUL. Since there is no support for JUL at the moment, JBoss LogManager needs its one implementation for log correlation.

eyalkoren · 2022-03-08T16:17:42Z

JBoss LogManger is a logging implementation bases on JUL. Since there is no support for JUL at the moment, JBoss LogManager needs its one implementation for log correlation.

Hmm, I assumed a logger is obtained through Logger.getLogger(), in which case the LogManager is covered by the instrumentation of org.jboss.logging.JBossLogManagerLogger, but this doesn't cover LogManager.getLogManager().getLogger()...
Is that only valid to older versions? If so- how old?

...j2-plugin/src/main/java/co/elastic/apm/agent/log4j2/correlation/CorrelationIdMapAdapter.java

…yalkoren

…tion Log correlation for JBoss LogManager

SylvainJuge

I haven't finished the review yet, I only have a few comments/questions so far.
That's definitely an impressive refactor and the end result is definitely worth the effort as everything now looks better and easier to extend and maintain.

Some tests are disabled with @Disabled then executed only through Junit4 integration tests, I wonder if we could have a better option for this in the future as 1) it may result in having lots of tests that appear to be disabled in CI but are not in practice as they are also integration tests, and 2) it creates a bit of duplication so having a generic Junit4-wrapper would be nice if such thing exists.

apm-agent-core/src/main/java/co/elastic/apm/agent/impl/ActivationListener.java

SylvainJuge · 2022-03-14T15:13:23Z

apm-agent-core/src/main/java/co/elastic/apm/agent/impl/error/ErrorCapture.java

@@ -44,6 +44,8 @@

    private static final Logger logger = LoggerFactory.getLogger(ErrorCapture.class);

+    private static final ThreadLocal<ErrorCapture> activeError = new ThreadLocal<>();


[question] Would it be relevant to use co.elastic.apm.agent.sdk.weakconcurrent.WeakConcurrent#buildThreadLocal instead of a plain ThreadLocal ?

You reminded me that this one made me think I need to take another look into our thread local usages in general.
In this specific case, I don't think it is required, as it is set and removed atomically within a single method execution. Once a method bytecode starts to execute, I don't think any of its dependencies may change before it ends.
But it may be something we want to set general rules for and make sure to follow all across.

apm-agent-core/src/main/java/co/elastic/apm/agent/logging/LoggingConfiguration.java

SylvainJuge · 2022-03-14T15:42:21Z

...on/src/main/java/co/elastic/apm/agent/loginstr/correlation/AbstractLogCorrelationHelper.java

+        private final Tracer tracer = GlobalTracer.get();
+
+        @Override
+        protected boolean addToMdc() {


[minor] could be made private as no sub-class is overriding addToMdc or removeFromMdc.

It implements a protected abstract method that needs to have protected access at least.

SylvainJuge · 2022-03-14T15:47:31Z

...in/java/co/elastic/apm/agent/loginstr/error/AbstractLoggerErrorCapturingInstrumentation.java

-    public ElementMatcher.Junction<ClassLoader> getClassLoaderMatcher() {
-        return not(ofType(nameStartsWith("co.elastic.apm.")));
-    }


[question] do you have further details on why this exclusion was previously required ?

It was required before and now as well for log4j2, because we load it within the agent. However, while it didn't interfere with tests of other frameworks before, it does now, so I removed it from the abstract and implemented it specifically for log4j2, where we test with TestClassWithDependencyRunner, which uses co.elastic.test.ChildFirstURLClassLoader especially to escape these matchers.

...pm-logging-plugin-common/src/main/java/co/elastic/apm/agent/loginstr/reformatting/Utils.java

...ng-plugin-common/src/test/java/co/elastic/apm/agent/loginstr/LoggingInstrumentationTest.java

…log-correlation-remake

eyalkoren · 2022-03-15T12:41:05Z

Some tests are disabled with @disabled then executed only through Junit4 integration tests, I wonder if we could have a better option for this in the future as 1) it may result in having lots of tests that appear to be disabled in CI but are not in practice as they are also integration tests, and 2) it creates a bit of duplication so having a generic Junit4-wrapper would be nice if such thing exists.

Yes, we have to improve on this front, but I think it's out of the scope of this PR

SylvainJuge

LGTM, this is really nice and I only have a few minor comments.

Another minor thing that might be nice to add if you have time is to make sure to properly document n javadoc which methods are instrumented in *Instrumentation classes to make it easier to navigate to the instrumented classes.

...c/main/java/co/elastic/apm/agent/log4j2/error/Log4j2LoggerErrorCapturingInstrumentation.java

...lugin/src/test/java/co/elastic/apm/agent/log4j2/correlation/CorrelationIdMapAdapterTest.java

SylvainJuge · 2022-03-15T13:09:44Z

...ogging-plugin-common/src/test/java/co/elastic/apm/agent/loginstr/reformatting/UtilsTest.java

+    }
+
+    @Test
+    void testShadePathComputation() {


[minor] seems a good candidate for changing the computeReformattedLogFilePathWithConfiguredDir entry point and test to use Path instead of a raw file path in a String that requires to rewrite path.

I didn't get this one. IIUC it's only something to change within the test, so I'll leave as is.
Let me know if you meant something within the actual implementation that you think should be changed.

...ng-plugin-common/src/test/java/co/elastic/apm/agent/loginstr/LoggingInstrumentationTest.java

efdknittlfrank · 2022-03-17T07:49:47Z

Awesome, looking forward to the next release of the agent! Thank you for all your incredible work ❤️

alejavigo · 2024-03-14T18:50:40Z

Maybe this change in 1.30.0 has to do with my problem here? :
#3563

github-actions bot added the agent-java label Jan 27, 2022

felixbarny reviewed Jan 27, 2022

View reviewed changes

...-plugin/src/main/java/co/elastic/apm/agent/log4j2/Log4j2TraceCorrelationInstrumentation.java Outdated Show resolved Hide resolved

Reimplementing log correlation

6e2ae2b

eyalkoren force-pushed the log-correlation-remake branch from 6cb20fe to 6e2ae2b Compare January 27, 2022 16:00

felixbarny added 2 commits January 28, 2022 09:36

Add TraceIdentifierMapAdapter for optimizing Log4j2 correlation

c7fdd4e

Avoid instrumenting direct logger methods

c4174a3

felixbarny mentioned this pull request Jan 28, 2022

Log correlation remake suggestions eyalkoren/apm-agent-java#4

Merged

SylvainJuge added the size:large Large (L) tasks label Jan 31, 2022

eyalkoren mentioned this pull request Feb 6, 2022

Support for setting the application class loader for a transaction via the public api #2444

Closed

5 tasks

eyalkoren added 2 commits February 16, 2022 08:17

Merge remote-tracking branch 'upstream/main' into log-correlation-remake

0ba26ff

Updating configuration.asciidoc

a806af3

eyalkoren force-pushed the log-correlation-remake branch from 562f72c to a806af3 Compare February 16, 2022 06:36

eyalkoren added 3 commits February 21, 2022 16:24

Merge remote-tracking branch 'felixbarny/log-correlation-remake-sugge…

08870c8

…stions' into log-correlation-remake

Incorporating @felixbarny suggestions

470ee5e

Merge remote-tracking branch 'upstream/main' into log-correlation-remake

de62239

felixbarny reviewed Feb 22, 2022

View reviewed changes

...lugin-common/src/main/java/co/elastic/apm/agent/log/shader/AbstractLogCorrelationHelper.java Outdated Show resolved Hide resolved

...lugin-common/src/main/java/co/elastic/apm/agent/log/shader/AbstractLogCorrelationHelper.java Outdated Show resolved Hide resolved

eyalkoren added 4 commits February 22, 2022 15:56

Remove unnecessary return value

2c279d1

Adding logging correlation to JBoss logger with JULI

efe5c17

Replace ErrorCapture activation mechanism

2f05ba9

Updating configuration.asciidoc

71fb87b

Added log correlation for JBoss LogManager

fd540a5

Relying on activeSpan rather than currentTransaction

060a178

felixbarny reviewed Mar 9, 2022

View reviewed changes

...j2-plugin/src/main/java/co/elastic/apm/agent/log4j2/correlation/CorrelationIdMapAdapter.java Outdated Show resolved Hide resolved

SylvainJuge mentioned this pull request Mar 12, 2022

Fix classloader for cross-plugin dependencies #2509

Merged

4 tasks

eyalkoren added 3 commits March 14, 2022 08:43

Fix and test log4j2 transaction ID setting

ddf6100

Removing enable_log_correlation and adjusting docs

d0a6997

log4j1 updates on supported techs doc

a3319d0

SylvainJuge added the await-release Mark issues that depend on next release, or PRs that are planned to be included label Mar 14, 2022

tobiasstadler and others added 2 commits March 14, 2022 16:24

Combined jboss-logging and jboss-logmanager plugins as requested by e…

2eea707

…yalkoren

Merge pull request #5 from tobiasstadler/jboss-logmanager-log-correla…

f873595

…tion Log correlation for JBoss LogManager

SylvainJuge reviewed Mar 14, 2022

View reviewed changes

eyalkoren added 2 commits March 15, 2022 14:37

Applying review suggestions

e485576

Merge remote-tracking branch 'eyalkoren/log-correlation-remake' into …

e3cc8bd

…log-correlation-remake

SylvainJuge approved these changes Mar 15, 2022

View reviewed changes

eyalkoren added 5 commits March 15, 2022 15:28

Merge remote-tracking branch 'upstream/main' into log-correlation-remake

f38fa32

Removing stale filtering in PackagingTest

3387cfc

Merge remote-tracking branch 'upstream/main' into log-correlation-remake

7015de9

Applying review suggestions

9eeba04

Adding to CHANGELOG

3a22078

eyalkoren enabled auto-merge (squash) March 15, 2022 15:58

eyalkoren merged commit c8e7936 into elastic:main Mar 15, 2022

eyalkoren deleted the log-correlation-remake branch March 16, 2022 05:32

SylvainJuge removed the await-release Mark issues that depend on next release, or PRs that are planned to be included label May 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplementing log correlation #2428

Reimplementing log correlation #2428

eyalkoren commented Jan 27, 2022 •

edited by zube bot

Loading

cla-checker-service bot commented Jan 27, 2022 •

edited

Loading

apmmachine commented Jan 27, 2022 •

edited

Loading

Build stats

Test stats 🧪

eyalkoren commented Jan 27, 2022

felixbarny commented Jan 28, 2022

tobiasstadler commented Jan 28, 2022

eyalkoren commented Jan 31, 2022 •

edited

Loading

eyalkoren commented Feb 13, 2022

eyalkoren commented Feb 14, 2022

eyalkoren commented Feb 22, 2022

tobiasstadler commented Mar 8, 2022

tobiasstadler commented Mar 8, 2022 •

edited

Loading

eyalkoren commented Mar 8, 2022 •

edited

Loading

SylvainJuge left a comment

SylvainJuge Mar 14, 2022

eyalkoren Mar 15, 2022

SylvainJuge Mar 14, 2022

eyalkoren Mar 15, 2022

SylvainJuge Mar 14, 2022

eyalkoren Mar 15, 2022

eyalkoren commented Mar 15, 2022

SylvainJuge left a comment

SylvainJuge Mar 15, 2022

eyalkoren Mar 15, 2022

efdknittlfrank commented Mar 17, 2022

alejavigo commented Mar 14, 2024

		@@ -44,6 +44,8 @@

		private static final Logger logger = LoggerFactory.getLogger(ErrorCapture.class);

		private static final ThreadLocal<ErrorCapture> activeError = new ThreadLocal<>();

Reimplementing log correlation #2428

Reimplementing log correlation #2428

Conversation

eyalkoren commented Jan 27, 2022 • edited by zube bot Loading

What does this PR do?

General concept

Prior research (documenting for later reference)

log4j2

logback

To-do list

Checklist

cla-checker-service bot commented Jan 27, 2022 • edited Loading

apmmachine commented Jan 27, 2022 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

eyalkoren commented Jan 27, 2022

felixbarny commented Jan 28, 2022

tobiasstadler commented Jan 28, 2022

eyalkoren commented Jan 31, 2022 • edited Loading

eyalkoren commented Feb 13, 2022

eyalkoren commented Feb 14, 2022

eyalkoren commented Feb 22, 2022

tobiasstadler commented Mar 8, 2022

tobiasstadler commented Mar 8, 2022 • edited Loading

eyalkoren commented Mar 8, 2022 • edited Loading

SylvainJuge left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eyalkoren commented Mar 15, 2022

SylvainJuge left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

efdknittlfrank commented Mar 17, 2022

alejavigo commented Mar 14, 2024

eyalkoren commented Jan 27, 2022 •

edited by zube bot

Loading

cla-checker-service bot commented Jan 27, 2022 •

edited

Loading

apmmachine commented Jan 27, 2022 •

edited

Loading

eyalkoren commented Jan 31, 2022 •

edited

Loading

tobiasstadler commented Mar 8, 2022 •

edited

Loading

eyalkoren commented Mar 8, 2022 •

edited

Loading