Enable sampling based on upstream sampling decision #1200

lmolkova · 2019-08-27T01:53:40Z

This change allows respecting upstream sampling decision from W3C trace-context wile sampling.
This is applicable in the following scenarios:

in queue batching scenarios
correlation with tracing tools that use different sampling algorithm (OpenTelemtery/Census)

This change will not affect cases when upstream sampling decision is not propagated (upstream is not W3C-enabled).

The first consumer for this change is Azure Functions: they will enable EventHubs instrumentation and start generating 'batching' operations with links and sampled-in decision.

EventHub, ServiceBus and other messaging services and their SDKs support batching: i.e. messages could be batched together to optimize connections and network bandwidth when messages are sent.

It's also a common pattern to process messages in batches. Check out

Azure function EventHubs trigger example with EventData[] this is the default EventHub function template
EventHub processor Host as a recommended way to process EventHub messages.

User can still do per-message processing and may do custom instumentation to help with tracing.

However, we want to provide customers with reasonable tracing features without them changing the pattern of processing or customizing AI telemetry.

So batching pattern:

each message carries unique context (or none) that is described as W3C traceparent which contains traceId (AppInsihts operationId) parentId and sampling flag.
this context describes a transaction in which message was created.
when batch processing starts, we want to start new AI operation (with new ID) and link all related messages contexts to this operation.
UX will know how to discover linked operations a show them. It will also know how to search for batched processing operation from each message transactions.
Assuming above is done, the one thing that is missing is consistent sampling: we want to sample in batched processing operation consistently with transactions in which messages were created:
- if any of them is sampled in, we want batched operation to be sampled in too (but under configured rate)
- if none is sampled in, we will not give additional weight to such item.

Similar problem: when OpenTelemtery/Census initiates a transaction, it propagates W3C trace-context that contains a sampling flag. Today AI SDK ignores it, which means that even though the context is propagated and respected, AI will make a brand new sampling decision and if both services sample in with 1% rate, transactions sampled in by both will have 0.01% rate.

lmolkova · 2019-08-27T01:54:46Z

PublicAPI/Microsoft.ApplicationInsights.dll/net45/PublicAPI.Unshipped.txt

@@ -31,27 +31,31 @@ Microsoft.ApplicationInsights.DataContracts.ExceptionTelemetry.ItemTypeFlag.get
 Microsoft.ApplicationInsights.DataContracts.PageViewPerformanceTelemetry.ItemTypeFlag.get -> Microsoft.ApplicationInsights.DataContracts.SamplingTelemetryItemTypes
 Microsoft.ApplicationInsights.DataContracts.PageViewTelemetry.ItemTypeFlag.get -> Microsoft.ApplicationInsights.DataContracts.SamplingTelemetryItemTypes
 Microsoft.ApplicationInsights.DataContracts.RequestTelemetry.ItemTypeFlag.get -> Microsoft.ApplicationInsights.DataContracts.SamplingTelemetryItemTypes
-Microsoft.ApplicationInsights.DataContracts.ISupportAdvancedSampling.IsSampledOutAtHead.get -> bool


changed bool IsSampledOutAtHead (introduced in beta-1) to more elaborate and extendable SamplingDecision ProactiveSamplingDecision

lmolkova · 2019-08-27T01:56:04Z

Test/Microsoft.ApplicationInsights.Test/Shared/Metrics/MetricsExamples.cs

@@ -1,4 +1,4 @@
-#pragma warning disable CA1716, 612, 618  // Namespace naming, obsolete TelemetryConfigration.Active


these tests became broken on my machine in VS 2019, I fixed them by replacing TelemetryConfiguration.Active usage with creating config instance per test and carefully disposing configs.

Test/Microsoft.ApplicationInsights.Test/Shared/Metrics/TelemetryConfigurationExtensionsTests.cs

src/Microsoft.ApplicationInsights/Extensibility/OperationCorrelationTelemetryInitializer.cs

lmolkova · 2019-08-27T01:59:33Z

src/ServerTelemetryChannel/PreciseTimestamp.cs

@@ -0,0 +1,94 @@
+namespace Microsoft.ApplicationInsights


sampling relies on timestamps and apparently it used UtcNow - fixed to stabilize tests

cijothomas

changelog.md also

src/Microsoft.ApplicationInsights/Extensibility/OperationCorrelationTelemetryInitializer.cs

Dmitry-Matveev · 2019-09-05T02:17:23Z

src/Microsoft.ApplicationInsights/Extensibility/OperationCorrelationTelemetryInitializer.cs

+                        telemetryItem is ISupportAdvancedSampling supportSamplingTelemetry && 
+                        supportSamplingTelemetry.ProactiveSamplingDecision == SamplingDecision.None)
+                    {
+                        supportSamplingTelemetry.ProactiveSamplingDecision = SamplingDecision.SampledIn;


If item was proactively sampled out by our own algorithm (due to experimental flag "proactiveSampling" turned on), then this initializer would not run at all (all initializers are skipped for the sampled out item) and the item we would like to sample in due to activity.Recorded won't receive SampledIn status.

We discussed this offline and it can be addressed by either modifying sampledOut value in TelemetryClient.Initialize based on Activity.Recorded or by modifying original value of sampledOut in HostingDiagnosticsListener OnBeginRequest in ASP.NET Core repo.

Non-blocking as it does not affect the first set of the scenarios PR is trying to address. May become blocking later in the world of all and only W3C correlation :)

I've played with overriding sampling decision in TelemetryClient.Initialize a bit and decided that it's not the right place for it.

There is a requirement (back-compatibility) that Ms.ApplicationInsights.dll should work even if someone (erroneously) uses it without DiagnosticSource.dll, i.e. all calls to Activity.Current should be guarded with the check that Activity is available

This is only needed to override decision made by auto-collector. But auto-collector could respect Activity.Recorded and save base SDK from overriding sampling decision.

So HostingDiagnosticListener should check for Activity.Recoded

…ed on parent context

…e esp on linux

lmolkova commented Aug 27, 2019

View reviewed changes

Test/Microsoft.ApplicationInsights.Test/Shared/Metrics/TelemetryConfigurationExtensionsTests.cs Show resolved Hide resolved

lmolkova commented Aug 27, 2019

View reviewed changes

src/Microsoft.ApplicationInsights/Extensibility/OperationCorrelationTelemetryInitializer.cs Show resolved Hide resolved

lmolkova commented Aug 27, 2019

View reviewed changes

lmolkova requested review from Dmitry-Matveev and cijothomas September 4, 2019 04:50

cijothomas approved these changes Sep 5, 2019

View reviewed changes

src/Microsoft.ApplicationInsights/Extensibility/OperationCorrelationTelemetryInitializer.cs Show resolved Hide resolved

src/Microsoft.ApplicationInsights/Extensibility/OperationCorrelationTelemetryInitializer.cs Show resolved Hide resolved

Dmitry-Matveev approved these changes Sep 5, 2019

View reviewed changes

Liudmila Molkova added 21 commits September 4, 2019 22:19

Adaptive sampling with preference to proactively sampled in items bas…

88c4f27

…ed on parent context

Stabilize tests

4953b0f

Stabilize tests

d69798f

Stabilize tests

f2c5662

Stabilize tests

1bc8193

Stabilize tests

182a289

use precise timestmap, see if it helps with tests

16c1783

up

2d78837

fix tests

9fa71d5

fix tests

72a8f04

fix tests

f4c6c0a

fix tests

8038506

fix tests

a46f3d9

fix tests

0472a11

fix tests

36dc248

fix tests

e00a5bf

fix tests

062e8f6

fix tests

ef5392e

123

412c0e7

1234

e8efd77

12345

dec53ec

lmolkova added 12 commits September 4, 2019 22:19

123456

32736b7

123456

4d9aca8

clean up

5725b3e

up

d705f64

double rate

45191ec

fix constanthigh test

a0701b6

clean up

cc96413

up

8af09c1

up

08a54ae

up

7ee61ed

revert constant rate test changes

a22a390

changelog

a587d2f

lmolkova force-pushed the lmolkova/sampling1 branch from 3a24d05 to a587d2f Compare September 5, 2019 06:29

Do not run sampling tests on .NET Core 1.1 - they are not stable ther…

1d9dd7a

…e esp on linux

lmolkova merged commit c46ebc6 into develop Sep 5, 2019

lmolkova mentioned this pull request Sep 6, 2019

Do not set sampledOut flag if activity is recorded and parentId fixes microsoft/ApplicationInsights-aspnetcore#971

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable sampling based on upstream sampling decision #1200

Enable sampling based on upstream sampling decision #1200

lmolkova commented Aug 27, 2019

lmolkova Aug 27, 2019

lmolkova Aug 27, 2019 •

edited

Loading

lmolkova Aug 27, 2019

cijothomas left a comment

Dmitry-Matveev Sep 5, 2019

lmolkova Sep 5, 2019

		@@ -1,4 +1,4 @@
		#pragma warning disable CA1716, 612, 618 // Namespace naming, obsolete TelemetryConfigration.Active

Enable sampling based on upstream sampling decision #1200

Enable sampling based on upstream sampling decision #1200

Conversation

lmolkova commented Aug 27, 2019

lmolkova Aug 27, 2019

Choose a reason for hiding this comment

lmolkova Aug 27, 2019 • edited Loading

Choose a reason for hiding this comment

lmolkova Aug 27, 2019

Choose a reason for hiding this comment

cijothomas left a comment

Choose a reason for hiding this comment

Dmitry-Matveev Sep 5, 2019

Choose a reason for hiding this comment

lmolkova Sep 5, 2019

Choose a reason for hiding this comment

lmolkova Aug 27, 2019 •

edited

Loading