Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable sampling based on upstream sampling decision #1200

Merged
merged 34 commits into from
Sep 5, 2019

Conversation

lmolkova
Copy link
Member

This change allows respecting upstream sampling decision from W3C trace-context wile sampling.
This is applicable in the following scenarios:

  • in queue batching scenarios
  • correlation with tracing tools that use different sampling algorithm (OpenTelemtery/Census)

This change will not affect cases when upstream sampling decision is not propagated (upstream is not W3C-enabled).

The first consumer for this change is Azure Functions: they will enable EventHubs instrumentation and start generating 'batching' operations with links and sampled-in decision.


EventHub, ServiceBus and other messaging services and their SDKs support batching: i.e. messages could be batched together to optimize connections and network bandwidth when messages are sent.

It's also a common pattern to process messages in batches. Check out

User can still do per-message processing and may do custom instumentation to help with tracing.

However, we want to provide customers with reasonable tracing features without them changing the pattern of processing or customizing AI telemetry.

So batching pattern:

  • each message carries unique context (or none) that is described as W3C traceparent which contains traceId (AppInsihts operationId) parentId and sampling flag.
  • this context describes a transaction in which message was created.
  • when batch processing starts, we want to start new AI operation (with new ID) and link all related messages contexts to this operation.
  • UX will know how to discover linked operations a show them. It will also know how to search for batched processing operation from each message transactions.
  • Assuming above is done, the one thing that is missing is consistent sampling: we want to sample in batched processing operation consistently with transactions in which messages were created:
    • if any of them is sampled in, we want batched operation to be sampled in too (but under configured rate)
    • if none is sampled in, we will not give additional weight to such item.

Similar problem: when OpenTelemtery/Census initiates a transaction, it propagates W3C trace-context that contains a sampling flag. Today AI SDK ignores it, which means that even though the context is propagated and respected, AI will make a brand new sampling decision and if both services sample in with 1% rate, transactions sampled in by both will have 0.01% rate.

@@ -31,27 +31,31 @@ Microsoft.ApplicationInsights.DataContracts.ExceptionTelemetry.ItemTypeFlag.get
Microsoft.ApplicationInsights.DataContracts.PageViewPerformanceTelemetry.ItemTypeFlag.get -> Microsoft.ApplicationInsights.DataContracts.SamplingTelemetryItemTypes
Microsoft.ApplicationInsights.DataContracts.PageViewTelemetry.ItemTypeFlag.get -> Microsoft.ApplicationInsights.DataContracts.SamplingTelemetryItemTypes
Microsoft.ApplicationInsights.DataContracts.RequestTelemetry.ItemTypeFlag.get -> Microsoft.ApplicationInsights.DataContracts.SamplingTelemetryItemTypes
Microsoft.ApplicationInsights.DataContracts.ISupportAdvancedSampling.IsSampledOutAtHead.get -> bool
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed bool IsSampledOutAtHead (introduced in beta-1) to more elaborate and extendable SamplingDecision ProactiveSamplingDecision

@@ -1,4 +1,4 @@
#pragma warning disable CA1716, 612, 618 // Namespace naming, obsolete TelemetryConfigration.Active
Copy link
Member Author

@lmolkova lmolkova Aug 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests became broken on my machine in VS 2019, I fixed them by replacing TelemetryConfiguration.Active usage with creating config instance per test and carefully disposing configs.

@@ -0,0 +1,94 @@
namespace Microsoft.ApplicationInsights
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sampling relies on timestamps and apparently it used UtcNow - fixed to stabilize tests

Copy link
Contributor

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changelog.md also

telemetryItem is ISupportAdvancedSampling supportSamplingTelemetry &&
supportSamplingTelemetry.ProactiveSamplingDecision == SamplingDecision.None)
{
supportSamplingTelemetry.ProactiveSamplingDecision = SamplingDecision.SampledIn;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If item was proactively sampled out by our own algorithm (due to experimental flag "proactiveSampling" turned on), then this initializer would not run at all (all initializers are skipped for the sampled out item) and the item we would like to sample in due to activity.Recorded won't receive SampledIn status.

We discussed this offline and it can be addressed by either modifying sampledOut value in TelemetryClient.Initialize based on Activity.Recorded or by modifying original value of sampledOut in HostingDiagnosticsListener OnBeginRequest in ASP.NET Core repo.

Non-blocking as it does not affect the first set of the scenarios PR is trying to address. May become blocking later in the world of all and only W3C correlation :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've played with overriding sampling decision in TelemetryClient.Initialize a bit and decided that it's not the right place for it.

  1. There is a requirement (back-compatibility) that Ms.ApplicationInsights.dll should work even if someone (erroneously) uses it without DiagnosticSource.dll, i.e. all calls to Activity.Current should be guarded with the check that Activity is available

  2. This is only needed to override decision made by auto-collector. But auto-collector could respect Activity.Recorded and save base SDK from overriding sampling decision.

So HostingDiagnosticListener should check for Activity.Recoded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants