Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing AppTraces in Log Analytics Workspace #2129

Closed
stevendick-work opened this issue Feb 18, 2022 · 22 comments
Closed

Missing AppTraces in Log Analytics Workspace #2129

stevendick-work opened this issue Feb 18, 2022 · 22 comments

Comments

@stevendick-work
Copy link

Expected behavior

We have an application that is logging sequential progresson of a work package comprising 18819 items.

For example, we log the following at INFO level:

2022-02-17 07:47:24.257 INFO 1 — [-StreamThread-6] c.s.g.i.b.p.k.t.p.PackageDoneTransformer : {"RUN_ID": 1071282}: Status update (17202/18819) received [{"PACKAGE_ID": "1071282|Close|Close|IFA|20220701|300532312|", "RUN_ID": 1071282, "CREINS_ID": null, "TREATY": null, "REPUN_1": null, "REPUN_2": null, "CoverageId": null, "BusinessDate": null, "CB_TO_DATE": null, "PackageCount": 18819, "RecordCountTotal": 83, "RecordCountProcessed": 0, "RecordCountError": 0, "OutputCountAms": 0, "OutputCountStp": 0, "OutputCountTav": 0, "LastUpdated": null, "Errors": []}]

This was scraped from the pod log output in AKS.

Actual behavior

The above example never makes it to the Log Analytics Workspace.

We do see log statements from before and after.

Item 17201 is logged, but we don't see any items logged again until 17217.

What we've checked

The loss of log messages does not coincide with data cap resets on either the LAW or App insights agent.

Sampling is 100% on the App Insights instance.

We don't see any App Insights network/connection errors reported in the pod log.

I have previously seen that there can be extra latency with ingestion that causes the TimeGenerated to be later than expected, but I don't see any evidence of this.

Goal

What else can we check that might contribute to the missing logging?

Is this approach to data capture with the App Insights agent not recommended as we cannot guarantee 100% log data capture?

System information

Please provide the following information:

  • SDK Version: 3.2.3
  • OS type and version: AKS with Debian 11-based containers (Google Distroless)
  • Using spring-boot? Yes
@heyams
Copy link
Contributor

heyams commented Feb 18, 2022

did you check App Insights' traces table? can you share your ikey? i can check our internal stats to see if there is any daily quota exceeded for your App Insights resource. LA workspace and App Insights have different quota limits.

@trask
Copy link
Member

trask commented Feb 18, 2022

hi @stevendick-swissre, can you try with the latest (3.2.6)?

3.2.4 started logging when telemetry overflowed an internal queue and got dropped

3.2.5 fixed an issue that was discovered by this new logging (#2062)

(no change in 3.2.6 that I think would impact this, but worth going to it if already bumping)

@stevendick-work
Copy link
Author

As we had logging from immediately before and after the missing entries, I am assuming no data cap was breached.

I have confirmed that the LAW did not have a data cap breach by checking the _logoperation() function.

How can I check if a data cap on the App Insights instance is exceeded?

I don't want to share the ikey in a public forum or via email.

I have asked the developer to upgrade to v3.2.6.

@stevendick-work
Copy link
Author

We updated to 3.2.6 but reported the same problem.

We did see some new output from the App Insights agent:

2022-02-21 12:17:21.606Z WARN c.m.a.a.i.t.BatchSpanProcessor - In the last 5 minutes, the following operation has failed 295785 times (out of 571670): Queuing span:

  • Max general export queue capacity of 2048 has been hit, dropping a telemetry record (max general export queue capacity can be increased in the applicationinsights.json configuration file, e.g. { "preview": { "generalExportQueueCapacity": 4096 } } (295785 times)

This means that the agent failed to publish to App Insights? This suggests a network issue or problem on App Insights?

@trask
Copy link
Member

trask commented Feb 21, 2022

this means that the single export thread in the JVM can't keep up with the volume of telemetry data being produced. up to 2048 telemetry records are buffered (by default), but once that limit is reached, if it still can't keep up it starts dropping telemetry records.

from the warning message, it looks like you are sending 571670 telemetry records over a 5 minute window, which is ~2000 telemetry records per second.

and it looks like roughly half of those records are being dropped.

can you check the distribution of records that are being ingested? it may help to know the distribution across requests, dependencies, traces, customEvents, and anything else, at least to confirm that this volume is expected.

customMetrics should have their own (larger) queue now in 3.2.6, but your warning message points to the "general" queue, which is the queue for all other telemetry records.

another factor is how long the ingestion service takes to respond, since the single export thread is throttled on waiting for the response. if you can email me your instrumentation key, I can check our internal data to see if this is likely an issue or not, though I suspect at a sustained rate of ~2000 telemetry records per second that we may just need to bump the number of export threads

@stevendick-work
Copy link
Author

Given we're using the agent, is the option to configure the number of export threads exposed?

How do we check the distribution of records being ingested?

@trask
Copy link
Member

trask commented Feb 23, 2022

here's a query to check the distribution of records, if you can run it over a timeframe that represents one of the heavy 5 minute windows that would be ideal:

union *
| where timestamp >= ago(1d)
| summarize count(), ingestedGB=sum(_BilledSize) / 1.E9 by itemType

if you can email me your instrumentation key (trask.stalnaker@microsoft.com), I can check what your ingestion service response times are, which will give us another clue about the best solution here

@stevendick-work
Copy link
Author

app-insights-ingested.csv
This was generated 2022-02-23 09:00 CET. We have testing cycles running at the moment, so I will have a look to see if volumes are different over the previous days.

@stevendick-work
Copy link
Author

Here's a chart from the last 7 days:

image

@trask
Copy link
Member

trask commented Feb 24, 2022

thanks @stevendick-swissre

can you try with this SNAPSHOT build? https://github.com/microsoft/ApplicationInsights-Java/suites/5427505526/artifacts/171896362

it still only uses a single export thread, but that thread should no longer wait for the response from the ingestion service before proceeding to export the next batch

@stevendick-work
Copy link
Author

stevendick-work commented Feb 24, 2022 via email

@trask
Copy link
Member

trask commented Mar 13, 2022

hey @stevendick-swissre, I think #2181 is a really good improvement that we can include in an upcoming release.

In a simple local logging test, I can get ~30x(!) more throughput without dropping any telemetry (still using the single thread, just not blocking on responses from the ingestion service).

would you be able to test this out if we release 3.2.9-BETA with this one change?

@stevendick-work
Copy link
Author

I think it likely we can test this, but I'm waiting on feedback from the developer. The potential issue is the impacted component may not need to load data at volume again due to where we are in the testing cycle.

@asproll
Copy link

asproll commented Mar 21, 2022

Hi @trask, I can test this if you can point me to the 3.2.9-BETA release.

@stevendick-work
Copy link
Author

asproll is the developer in question on our project who discovered the issue.

@trask
Copy link
Member

trask commented Mar 22, 2022

great, we will plan to release 3.2.9-BETA tomorrow

@trask
Copy link
Member

trask commented Mar 25, 2022

@asproll @stevendick-swissre 3.2.9 is released: https://github.com/microsoft/ApplicationInsights-Java/releases

@asproll
Copy link

asproll commented Mar 25, 2022

Hi @trask , thanks a lot. We will test this as soon as we can and let you know.

@asproll
Copy link

asproll commented Mar 25, 2022

@trask, I was able to successfully test the release. It solved our problem with the missing log statements. Now all the expected traces were logged. Thanks a lot.

One new thing I noticed, we keep getting the following message in the log, not sure what it means and if its indicating some real problem:

WARN c.m.a.a.i.t.BatchSpanProcessor - In the last 5 minutes, the following operation has failed 1 times (out of 13): Add async export: * Max number of concurrent exports 1 has been hit, may see some export throttling due to this (1 times)

@trask
Copy link
Member

trask commented Mar 25, 2022

@asproll awesome, thanks for reporting that warning, you can safely ignore that, we will fix it #2208

@ghost
Copy link

ghost commented Apr 1, 2022

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 7 days. It will be closed if no further activity occurs within 7 days of this comment.

@stevendick-work
Copy link
Author

Closing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants