Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix errors identified from telemetry #5279

Merged
merged 3 commits into from
Mar 8, 2024

Conversation

andrewlock
Copy link
Member

Summary of changes

  • Tries to fix (or at least, isolate) a FieldAccessException in MongoDb integration
  • Fixes an ArgumentNullException in the AWS SQS integration

Reason for change

Both of these errors were flagged by telemetry

Implementation details

Seeing error logs related to the MongoDb integration:

System.FieldAccessException
   at REDACTED
   at Datadog.Trace.ClrProfiler.AutoInstrumentation.MongoDb.MongoDbIntegration.CreateScope[TConnection](Object wireProtocol, TConnection connection)
   at REDACTED
   at MongoDB.Driver.Core.WireProtocol.CommandUsingCommandMessageWireProtocol`1.ExecuteAsync(IConnection connection, CancellationToken cancellationToken)

My theory is that it's duck-chaining related, but tbh it's a weird one, so done two things:

  • Remove the duck chaining so that we can explicitly call TryDuckCast instead of letting the duckchaining do a standard DuckCast<>
  • Extract the potential sources to local functions to get clearer call stacks.

My prime suspect is actually the BSON parsing, but that's all in a try-catch, so I don't see how it can be 🤷‍♂️

The AWS one is much simpler - some properties were unexpectedly null. Luckily they're List<string> so we can just new them up ourselves

Test coverage

Integration tests should test it still works, but it won't test that these fix the issues, but we're not exactly sure what the issue was so...

@andrewlock andrewlock added type:bug area:tracer The core tracer library (Datadog.Trace, does not include OpenTracing, native code, or integrations) area:automatic-instrumentation Automatic instrumentation managed C# code (Datadog.Trace.ClrProfiler.Managed) identified-by:telemetry labels Mar 6, 2024
@andrewlock andrewlock requested a review from a team as a code owner March 6, 2024 15:48
Copy link
Member

@tonyredondo tonyredondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a question.

}
catch (Exception ex)
{
Log.Warning(ex, "Unable to access IWireProtocol.Command properties.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok to return true after the exception?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is how it was working before - the return false is only if we don't want to create a span at all 👍

@datadog-ddstaging
Copy link

datadog-ddstaging bot commented Mar 6, 2024

Datadog Report

Branch report: andrew/fix-more-telemetry-errors
Commit report: 30e1628
Test service: dd-trace-dotnet

✅ 0 Failed, 335377 Passed, 1602 Skipped, 57m 10.9s Wall Time
❄️ 1 New Flaky

New Flaky Tests (1)

  • HttpClient_SubmitsTraces - Datadog.Trace.ClrProfiler.IntegrationTests.HttpMessageHandlerTests - Last Failure

    Expand for error
     Expected url "http://localhost:39475/8a800dd5-b025-4b12-ab6e-274491acf3c1/" to end with "?key1=value1&<redacted>".
    

@andrewlock
Copy link
Member Author

andrewlock commented Mar 6, 2024

Execution-Time Benchmarks Report ⏱️

Execution-time results for samples comparing the following branches/commits:

Execution-time benchmarks measure the whole time it takes to execute a program. And are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are shown in red. The following thresholds were used for comparing the execution times:

  • Welch test with statistical test for significance of 5%
  • Only results indicating a difference greater than 5% and 5 ms are considered.

Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard.

Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph).

gantt
    title Execution time (ms) FakeDbCommand (.NET Framework 4.6.2) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (5279) - mean (72ms)  : 65, 79
     .   : milestone, 72,
    master - mean (74ms)  : 66, 83
     .   : milestone, 74,

    section CallTarget+Inlining+NGEN
    This PR (5279) - mean (986ms)  : 965, 1007
     .   : milestone, 986,
    master - mean (996ms)  : 972, 1020
     .   : milestone, 996,

Loading
gantt
    title Execution time (ms) FakeDbCommand (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (5279) - mean (111ms)  : 108, 114
     .   : milestone, 111,
    master - mean (111ms)  : 108, 115
     .   : milestone, 111,

    section CallTarget+Inlining+NGEN
    This PR (5279) - mean (711ms)  : 693, 729
     .   : milestone, 711,
    master - mean (717ms)  : 686, 749
     .   : milestone, 717,

Loading
gantt
    title Execution time (ms) FakeDbCommand (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (5279) - mean (97ms)  : 92, 101
     .   : milestone, 97,
    master - mean (94ms)  : 91, 97
     .   : milestone, 94,

    section CallTarget+Inlining+NGEN
    This PR (5279) - mean (667ms)  : 643, 691
     .   : milestone, 667,
    master - mean (676ms)  : 641, 712
     .   : milestone, 676,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET Framework 4.6.2) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (5279) - mean (188ms)  : 184, 192
     .   : milestone, 188,
    master - mean (188ms)  : 186, 190
     .   : milestone, 188,

    section CallTarget+Inlining+NGEN
    This PR (5279) - mean (1,060ms)  : 1037, 1082
     .   : milestone, 1060,
    master - mean (1,064ms)  : 1037, 1091
     .   : milestone, 1064,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (5279) - mean (269ms)  : 263, 274
     .   : milestone, 269,
    master - mean (271ms)  : 268, 275
     .   : milestone, 271,

    section CallTarget+Inlining+NGEN
    This PR (5279) - mean (862ms)  : 835, 889
     .   : milestone, 862,
    master - mean (871ms)  : 843, 899
     .   : milestone, 871,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (5279) - mean (260ms)  : 256, 265
     .   : milestone, 260,
    master - mean (261ms)  : 257, 265
     .   : milestone, 261,

    section CallTarget+Inlining+NGEN
    This PR (5279) - mean (852ms)  : 822, 882
     .   : milestone, 852,
    master - mean (857ms)  : 833, 881
     .   : milestone, 857,

Loading

@andrewlock
Copy link
Member Author

Throughput/Crank Report:zap:

Throughput results for AspNetCoreSimpleController comparing the following branches/commits:

Cases where throughput results for the PR are worse than latest master (5% drop or greater), results are shown in red.

Note that these results are based on a single point-in-time result for each branch. For full results, see one of the many, many dashboards!

gantt
    title Throughput Linux x64 (Total requests) 
    dateFormat  X
    axisFormat %s
    section Baseline
    This PR (5279) (10.878M)   : 0, 10878354
    master (11.007M)   : 0, 11006674
    benchmarks/2.9.0 (10.795M)   : 0, 10795198

    section Automatic
    This PR (5279) (7.509M)   : 0, 7509325
    master (7.556M)   : 0, 7555579
    benchmarks/2.9.0 (8.048M)   : 0, 8047571

    section Trace stats
    This PR (5279) (7.806M)   : 0, 7806212
    master (7.855M)   : 0, 7855147

    section Manual
    This PR (5279) (9.436M)   : 0, 9435974
    master (9.556M)   : 0, 9555551

    section Manual + Automatic
    This PR (5279) (7.133M)   : 0, 7132744
    master (7.141M)   : 0, 7140945

    section Version Conflict
    This PR (5279) (6.567M)   : 0, 6566655
    master (6.452M)   : 0, 6451971

Loading
gantt
    title Throughput Linux arm64 (Total requests) 
    dateFormat  X
    axisFormat %s
    section Baseline
    This PR (5279) (9.706M)   : 0, 9705591
    master (9.496M)   : 0, 9495623
    benchmarks/2.9.0 (9.570M)   : 0, 9570288

    section Automatic
    This PR (5279) (6.601M)   : 0, 6600950
    master (6.550M)   : 0, 6550088

    section Trace stats
    This PR (5279) (6.983M)   : 0, 6982557
    master (6.930M)   : 0, 6929825

    section Manual
    This PR (5279) (8.161M)   : 0, 8161209
    master (8.294M)   : 0, 8293659

    section Manual + Automatic
    This PR (5279) (6.128M)   : 0, 6127596
    master (6.274M)   : 0, 6273873

    section Version Conflict
    This PR (5279) (5.731M)   : 0, 5730561
    master (5.748M)   : 0, 5748183

Loading
gantt
    title Throughput Windows x64 (Total requests) 
    dateFormat  X
    axisFormat %s
    section Baseline
    This PR (5279) (10.470M)   : 0, 10469962
    master (10.120M)   : 0, 10120352
    benchmarks/2.9.0 (10.670M)   : 0, 10669848

    section Automatic
    This PR (5279) (7.242M)   : 0, 7242386
    master (7.343M)   : 0, 7343410
    benchmarks/2.9.0 (7.700M)   : 0, 7699769

    section Trace stats
    This PR (5279) (7.600M)   : 0, 7600091
    master (7.630M)   : 0, 7629966

    section Manual
    This PR (5279) (9.287M)   : 0, 9287468
    master (8.884M)   : 0, 8883719

    section Manual + Automatic
    This PR (5279) (7.164M)   : 0, 7163924
    master (6.989M)   : 0, 6989119

    section Version Conflict
    This PR (5279) (6.501M)   : 0, 6501257
    master (6.322M)   : 0, 6322239

Loading
gantt
    title Throughput Linux x64 (ASM) (Total requests) 
    dateFormat  X
    axisFormat %s
    section Baseline
    master (7.454M)   : 0, 7454237
    benchmarks/2.9.0 (7.800M)   : 0, 7799916

    section No attack
    master (1.841M)   : 0, 1841297
    benchmarks/2.9.0 (3.202M)   : 0, 3202059

    section Attack
    master (1.454M)   : 0, 1454025
    benchmarks/2.9.0 (2.532M)   : 0, 2531516

    section Blocking
    master (3.195M)   : 0, 3194705

    section IAST default
    master (6.463M)   : 0, 6462928

    section IAST full
    master (5.595M)   : 0, 5594918

    section Base vuln
    master (0.946M)   : 0, 945501

    section IAST vuln
    master (0.849M)   : 0, 849226

Loading

Copy link
Contributor

@vandonr vandonr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for the SQS part

We're seeing exceptions like this:
```
System.FieldAccessException
   at REDACTED
   at Datadog.Trace.ClrProfiler.AutoInstrumentation.MongoDb.MongoDbIntegration.CreateScope[TConnection](Object wireProtocol, TConnection connection)
   at REDACTED
   at MongoDB.Driver.Core.WireProtocol.CommandWireProtocol`1.ExecuteAsync(IConnection connection, CancellationToken cancellationToken)
```

and the only explanation I can think of is a duck-chaining issue, so stopped doing duck chaining and being explicit instead
@andrewlock andrewlock force-pushed the andrew/fix-more-telemetry-errors branch from 60672a2 to 30e1628 Compare March 8, 2024 08:06
@andrewlock andrewlock merged commit ca1bb6e into master Mar 8, 2024
55 of 58 checks passed
@andrewlock andrewlock deleted the andrew/fix-more-telemetry-errors branch March 8, 2024 17:43
@github-actions github-actions bot added this to the vNext milestone Mar 8, 2024
link04 added a commit that referenced this pull request Mar 12, 2024
commit 832de4b
Author: Flavien Darche <11708575+e-n-0@users.noreply.github.com>
Date:   Tue Mar 12 20:24:21 2024 +0000

    [ASM][IAST] Configure maximum IAST Ranges (#5292)

    * Add configuration key

    * Use a RangeList in some case to not exceed the max number

    * Revert some code + implem correct merge

    * Fix + Add unit and integration tests

    * Usual macos fix for snapshot

    * Fix snapshots hashs

    * Update snapshots (remove other tests as they can't apply different env var values in same run)

    * Apply comment

    * Re-integrate integration tests with multiple processes (new fixture)

    * Add test case for setting MaxRangeCount to zero

commit 83f6ab1
Author: Tony Redondo <tony.redondo@datadoghq.com>
Date:   Tue Mar 12 21:20:39 2024 +0100

    [CI Visibility] - Enable snapshot testing of current testing framework implementations (#5226)

commit 233695a
Author: Daniel Romano <108014683+daniel-romano-DD@users.noreply.github.com>
Date:   Tue Mar 12 17:06:06 2024 +0100

    [IAST] Vulnerability and Evidence truncation (#5302)

    * Initial implementation

    * Updated test bundle

    * Fix test compilation error

    * Fix snapshot (from rebase)

    * Fix typo in config value. Updated tests

    * Fix typo

    * Refactor converters initialization

commit ea31cf5
Author: Anna <anna.yafi@datadoghq.com>
Date:   Tue Mar 12 16:39:09 2024 +0100

    Deactivate benchmark for legacy encoder (#5299)

commit d0d713a
Author: NachoEchevarria <53266532+NachoEchevarria@users.noreply.github.com>
Date:   Tue Mar 12 09:25:27 2024 +0100

    Set big regex timeouts for tests (#5297)

commit d5388d6
Author: Lucas Pimentel <lucas.pimentel@datadoghq.com>
Date:   Mon Mar 11 15:20:58 2024 -0400

    [Tracing] Support configuring `DD_TRACE_ENABLED` remotely (#5181)

    * add support for remote TraceEnabled setting

    * fix unrelated typo

    * add ApmTracingEnabled capability 19

    * add missing RCM capability 18

    * add mapping

    * add unit test

    * add comments to unit test

    * rename property to match RCM constant

    * include config in integration tests

    * fix test json

    * rewrite tests to use raw values instead of strings

commit 2b95f46
Author: Flavien Darche <11708575+e-n-0@users.noreply.github.com>
Date:   Mon Mar 11 17:47:55 2024 +0100

    [ASM][IAST] Support manual JSON deserialisation (Newtonsoft.Json) (#5238)

    * Add Newtonsoft.Json (non -working yet)

    * Refactor the tainting proces + add tests

    * Add the JToken Parse aspect + test

    * Rename Aspects class + Duck orignal method call

    * Add integration test

    * Fix nullability

    * Fix compilation issue for netfx

    * Change JSON formatting in ParseTests

    * Fix a test json format

    * Refactor NewtonsoftJsonAspects to static constructor

commit 0d511f9
Author: Igor Kravchenko <21974069+kr-igor@users.noreply.github.com>
Date:   Mon Mar 11 09:35:23 2024 -0500

    [DSM] - Fixes for IbmMq instrumentation (#5271)

    * Use byte properties instead of strings

    * Fixed nullability files

    * Added some debug info

    * Fixed lint issues

    * Added a bit more logs

    * Using slow byte->sbyte conversion

    * Added noop headers adapter

    * Fixed nullability files

    * Added more logs

    * Cleaned up debug logs

    * Removed symlink

    * Update tracer/src/Datadog.Trace/ClrProfiler/AutoInstrumentation/IbmMq/IbmMqHeadersAdapter.cs

    Removed debug code

    Co-authored-by: Andrew Lock <andrew.lock@datadoghq.com>

    * Update tracer/src/Datadog.Trace/ClrProfiler/AutoInstrumentation/IbmMq/IbmMqHeadersAdapter.cs

    Using Unsafe.As instead of BlockCopy

    Co-authored-by: Andrew Lock <andrew.lock@datadoghq.com>

    * Update tracer/src/Datadog.Trace/ClrProfiler/AutoInstrumentation/IbmMq/IbmMqHeadersAdapter.cs

    Use Unsafe.As instead of BlockCopy

    Co-authored-by: Andrew Lock <andrew.lock@datadoghq.com>

    * Addressed some of the comments

    * Removed context propagation options

    ---------

    Co-authored-by: Andrew Lock <andrew.lock@datadoghq.com>

commit 5684a72
Author: Zach Montoya <zach.montoya@datadoghq.com>
Date:   Fri Mar 8 20:56:30 2024 -0800

    [Tracing] Update instrumentation point for DD_TRACE_DELAY_WCF_INSTRUMENTATION_ENABLED=true (#5206)

    Updates the instrumentation point for `DD_TRACE_DELAY_WCF_INSTRUMENTATION_ENABLED=true` so that now a server span is created immediately before IDispatchMessageInspector implementations are run, so application code can access the root span from inside a IDispatchMessageInspector.AfterReceiveRequest callback.

    This PR also does some cleanup to remove unused Wcf files and it makes the entire Wcf instrumentation use nullable reference types.

commit ca1bb6e
Author: Andrew Lock <andrew.lock@datadoghq.com>
Date:   Fri Mar 8 17:43:57 2024 +0000

    Fix errors identified from telemetry (#5279)

    * Try to avoid MongoDb exception

    We're seeing exceptions like this:
    ```
    System.FieldAccessException
       at REDACTED
       at Datadog.Trace.ClrProfiler.AutoInstrumentation.MongoDb.MongoDbIntegration.CreateScope[TConnection](Object wireProtocol, TConnection connection)
       at REDACTED
       at MongoDB.Driver.Core.WireProtocol.CommandWireProtocol`1.ExecuteAsync(IConnection connection, CancellationToken cancellationToken)
    ```

    and the only explanation I can think of is a duck-chaining issue, so stopped doing duck chaining and being explicit instead

    * Add local functions to try to isolate problems

    * Fix ArgumentNullException in AWS SQS integration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:automatic-instrumentation Automatic instrumentation managed C# code (Datadog.Trace.ClrProfiler.Managed) area:integrations area:tracer The core tracer library (Datadog.Trace, does not include OpenTracing, native code, or integrations) identified-by:telemetry type:bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants