From 39ab45fef5a8bd125595a628492f922e17284a59 Mon Sep 17 00:00:00 2001 From: Bogdan Drutu Date: Fri, 16 Aug 2019 14:32:21 -0700 Subject: [PATCH 01/12] Propose sampling API changes --- text/0006-sampling.md | 156 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 156 insertions(+) create mode 100644 text/0006-sampling.md diff --git a/text/0006-sampling.md b/text/0006-sampling.md new file mode 100644 index 000000000..a9b0459a2 --- /dev/null +++ b/text/0006-sampling.md @@ -0,0 +1,156 @@ +# Sampling API + +*Status: proposed* + +## TL;DR +This section tries to summarize all the changes proposed in this RFC: + 1. Move the `Sampler` interface from the API to SDK package. Apply some minor changes to the + `Sampler` API. + 1. Add a new `SamplerHint` concept to the API package. + 1. Add capability to record `Attributes` that can be used for sampling decision during the `Span` + creation time. + 1. Add capability to start building a `Span` with a delayed `build` method. This is useful for + cases where some `Attributes` that are useful for sampling are not available when start building + the `Span`. As an example in Java the current `Span.Builder` will use as a start time for the + `Span` the moment when the builder is created and not the moment when the `build()` method is + called. + +## Motivation + +Different users of OpenTelemetry, ranging from library developers, packaged infrastructure binary +developers, application developers, operators, and telemetry system owners, have separate use cases +for OpenTelemetry that have gotten muddled in the design of the original Sampling API. Thus, we need +to clarify what APIs each should be able to depend upon, and how they will configure sampling and +OpenTelemetry according to their needs. + +![Personas](https://i.imgur.com/w1H0CfH.png) + +## Explanation + +We outline five different use cases (who may be overlapping sets of people), and how they should +interact with OpenTelemetry: + +### Library developer +Examples: gRPC, Express, Django developers. + + * They must only depend upon the OTel API and not upon the SDK. + * They are shipping source code that will be linked into others' applications. + * They have no explicit runtime control over the application. + * They know some signal about what traces may be interesting (e.g. unusual control plane requests) + or uninteresting (e.g. health-checks), but have to write fully generically. + +**Solution:** + + * On the start Span operation, the OpenTelemetry API will allow marking a span with one of three + choices for the SamplingHint, with "don't care" as the default: [`don't care`, `suggest keeping`, + `suggest discarding`] + +### Infrastructure package/binary developer +Examples: HBase, Envoy developers. + + * They are shipping self-contained binaries that may accept YAML or similar run-time configuration, + but are not expected to support extensibility/plugins beyond the default OTel SDK, OTel SDKTracer, + and OTel wire format exporter. + * They may have their own recommendations for sampling rates, but don't run the binaries in + production, only provide packaged binaries. So their sampling rate configs, and sampling strategies + need to a finite "built in" set from OpenTelemetry's SDK. + * They need to deal with upstream sampling decisions made by services calling them. + +**Solution:** + * Allow different sampling strategies by default in OTel SDK, all configurable easily via YAML or + future flags, etc.: + * Trust parent sampling decision (trusting & propagating parent SpanContext SampleBit) + * Always keep + * Never keep + * Keep with 1/N probability + +### Application developer +These are the folks we've been thinking the most about for OTel in general. + + * They have full control over the OTel implementation or SDK configuration. When using the SDK they + can configure custom exporters, custom code/samplers, etc. + * They can choose to implement runtime configuration via a variety of means (e.g. baking in feature + flags, reading YAML files, etc.), or even configure the library in code. + * They make heavy usage of OTel for instrumenting application-specific behavior, beyond what may be + provided by the libraries they use such as gRPC, Django, etc. + +**Solution:** + * Allow application developers to link in custom samplers or write their own when using the + official SDK. + * These might include dynamic per-field sampling to achieve a target rate + (e.g. https://github.com/honeycombio/dynsampler-go) + * Sampling decisions are made within the start Span operation, after attributes relevant to the + span have been added to the Span start operation but before a concrete Span object exists (so that + either a NoOpSpan can be made, or an actual Span instance can be produced depending upon the + sampler's decision). + * Span.IsRecording() needs to be present to allow costly span attribute/log computation to be + skipped if the span is a NoOp span. + +### Application operator +Often the same people as the application developers, but not necessarily + + * They care about adjusting sampling rates and strategies to meet operational needs, debugging, + and cost. + +**Solution:** + * Use config files or feature flags written by the application developers to control the + application sampling logic. + * Use the config files to configure libraries and infrastructure package behavior. + +### Telemetry infrastructure owner +They are the people who provide an implementation for the OTel API by using the SDK with custom +`Exporter`s, `Sampler`s, hooks, etc. or by writing a custom implementation, as well as running the +infrastructure for collecting exported traces. + + * They care about a variety of things, including efficiency, cost effectiveness, and being able to + gather spans in a way that makes sense for them. + +**Solution:** + * Infrastructure owners receive information attached to the span, after sampling hooks have already + been run. + +## Internal details +The interface for the Sampler class takes in: + * `TraceID` + * `SpanID` + * Parent `SpanContext` if any + * `Links` + * Initial set of `Attributes` for the `Span` being constructed + +It produces as an output: +* A boolean indicating whether to sample or drop the span. +* The new set of initial span Attributes (or passes along the SpanAttributes unmodified) +* (under discussion in separate RFC) the SamplingRate float. + +## Trade-offs + * We considered, instead of using the `SpanBuilder`, setting the sampler on the Span constructor, and + requiring any `Attributes` to be populated prior to the start of the span's default start time. + * We considered, instead of using the `SpanBuilder`, setting the `Sampler` and the `Attributes` + used for the sampler before running an explicit MakeSamplingDecision() on the span. Attempts to + create a child of the span would fail if MakeSamplingDecision() had not yet been run. + * We considered allowing the sampling decision to be arbitrarily delayed. + +## Prior art and alternatives +Prior art for Zipkin, and other Dapper based systems: all client-side sampling decisions are made at +head. Thus, we need to retain compatibility with this. + +## Open questions +This RFC does not necessarily resolve the question of how to propagate sampling rate values between +different spans and processes. A separate RFC will be opened to cover this case. + +## Future possibilities +In the future, we propose that library developers may be able to defer the decision on whether to +recommend the trace be sampled or not sampled until mid-way through execution; + +## Related Issues + * [opentelemetry-specification/189](https://github.com/open-telemetry/opentelemetry-specification/issues/189) + * [opentelemetry-specification/187](https://github.com/open-telemetry/opentelemetry-specification/issues/187) + * [opentelemetry-specification/164](https://github.com/open-telemetry/opentelemetry-specification/issues/164) + * [opentelemetry-specification/125](https://github.com/open-telemetry/opentelemetry-specification/issues/125) + * [opentelemetry-specification/87](https://github.com/open-telemetry/opentelemetry-specification/issues/87) + * [opentelemetry-specification/66](https://github.com/open-telemetry/opentelemetry-specification/issues/66) + * [opentelemetry-specification/65](https://github.com/open-telemetry/opentelemetry-specification/issues/65) + * [opentelemetry-specification/53](https://github.com/open-telemetry/opentelemetry-specification/issues/53) + * [opentelemetry-specification/33](https://github.com/open-telemetry/opentelemetry-specification/issues/33) + * [opentelemetry-specification/32](https://github.com/open-telemetry/opentelemetry-specification/issues/32) + * [opentelemetry-specification/31](https://github.com/open-telemetry/opentelemetry-specification/issues/31) From 533bcd3bc54cdc14ae9bae776b8dc32879173bc1 Mon Sep 17 00:00:00 2001 From: Bogdan Drutu Date: Sat, 17 Aug 2019 11:04:32 -0700 Subject: [PATCH 02/12] Add details about SamplingHint, Sampler interface and default samplers. --- text/0006-sampling.md | 137 ++++++++++++++++++++++++++++++++++-------- 1 file changed, 111 insertions(+), 26 deletions(-) diff --git a/text/0006-sampling.md b/text/0006-sampling.md index a9b0459a2..98650fac6 100644 --- a/text/0006-sampling.md +++ b/text/0006-sampling.md @@ -14,6 +14,8 @@ This section tries to summarize all the changes proposed in this RFC: the `Span`. As an example in Java the current `Span.Builder` will use as a start time for the `Span` the moment when the builder is created and not the moment when the `build()` method is called. + 1. Remove `addLink` APIs from the `Span` interface, and allow recording links only during the span + construction time. ## Motivation @@ -33,7 +35,7 @@ interact with OpenTelemetry: ### Library developer Examples: gRPC, Express, Django developers. - * They must only depend upon the OTel API and not upon the SDK. + * They must only depend upon the OpenTelemetry API and not upon the SDK. * They are shipping source code that will be linked into others' applications. * They have no explicit runtime control over the application. * They know some signal about what traces may be interesting (e.g. unusual control plane requests) @@ -42,37 +44,32 @@ Examples: gRPC, Express, Django developers. **Solution:** * On the start Span operation, the OpenTelemetry API will allow marking a span with one of three - choices for the SamplingHint, with "don't care" as the default: [`don't care`, `suggest keeping`, - `suggest discarding`] + choices for the [SamplingHint](#samplinghint). ### Infrastructure package/binary developer Examples: HBase, Envoy developers. * They are shipping self-contained binaries that may accept YAML or similar run-time configuration, - but are not expected to support extensibility/plugins beyond the default OTel SDK, OTel SDKTracer, - and OTel wire format exporter. + but are not expected to support extensibility/plugins beyond the default OpenTelemetry SDK, + OpenTelemetry SDKTracer, and OpenTelemetry wire format exporter. * They may have their own recommendations for sampling rates, but don't run the binaries in production, only provide packaged binaries. So their sampling rate configs, and sampling strategies need to a finite "built in" set from OpenTelemetry's SDK. * They need to deal with upstream sampling decisions made by services calling them. **Solution:** - * Allow different sampling strategies by default in OTel SDK, all configurable easily via YAML or - future flags, etc.: - * Trust parent sampling decision (trusting & propagating parent SpanContext SampleBit) - * Always keep - * Never keep - * Keep with 1/N probability + * Allow different sampling strategies by default in OpenTelemetry SDK, all configurable easily via + YAML or future flags. See [default samplers](#default-samplers). ### Application developer -These are the folks we've been thinking the most about for OTel in general. +These are the folks we've been thinking the most about for OpenTelemetry in general. - * They have full control over the OTel implementation or SDK configuration. When using the SDK they - can configure custom exporters, custom code/samplers, etc. + * They have full control over the OpenTelemetry implementation or SDK configuration. When using the + SDK they can configure custom exporters, custom code/samplers, etc. * They can choose to implement runtime configuration via a variety of means (e.g. baking in feature flags, reading YAML files, etc.), or even configure the library in code. - * They make heavy usage of OTel for instrumenting application-specific behavior, beyond what may be - provided by the libraries they use such as gRPC, Django, etc. + * They make heavy usage of OpenTelemetry for instrumenting application-specific behavior, beyond + what may be provided by the libraries they use such as gRPC, Django, etc. **Solution:** * Allow application developers to link in custom samplers or write their own when using the @@ -98,9 +95,9 @@ Often the same people as the application developers, but not necessarily * Use the config files to configure libraries and infrastructure package behavior. ### Telemetry infrastructure owner -They are the people who provide an implementation for the OTel API by using the SDK with custom -`Exporter`s, `Sampler`s, hooks, etc. or by writing a custom implementation, as well as running the -infrastructure for collecting exported traces. +They are the people who provide an implementation for the OpenTelemetry API by using the SDK with +custom `Exporter`s, `Sampler`s, hooks, etc. or by writing a custom implementation, as well as +running the infrastructure for collecting exported traces. * They care about a variety of things, including efficiency, cost effectiveness, and being able to gather spans in a way that makes sense for them. @@ -110,21 +107,106 @@ infrastructure for collecting exported traces. been run. ## Internal details -The interface for the Sampler class takes in: + +### Sampling flags +OpenTelemetry API has two flags/properties: + * `RecordEvents` + * This property is exposed in the `Span` interface. + * If `true` the current `Span` records tracing events (attributes, events, status, etc.), + otherwise all tracing events are dropped. + * Users can use this property to determine if expensive trace events can be avoided. + * `SampledFlag` + * This flag is propagated via the `TraceOptions` (may be renamed to `TraceFlags` in a different + PR) to the child Spans. For more details see the w3c definition [here][trace-flags]. + * In Dapper based systems this is equivalent to `Span` being `sampled` and exported. + +The flag combination `SampledFlag == false` and `RecordEvents == true` means that the current `Span` +does record tracing events, but most likely the child `Span` will not. This combination is +necessary because: + * Allow users to control recording for individual Spans. + * OpenCensus has this to support z-pages, so we need to keep backwards compatibility. + +The flag combination `SampledFlag == true` and `RecordEvents == false` can cause gaps in the +distributed trace, and because of this OpenTelemetry API should NOT allow this combination. + +It is safe to assume that users of the API should only access the `RecordEvents` property when +instrumenting code and never access `SampledFlag` unless used in context propagators. + +### SamplingHint +This is a new concept added in the OpenTelemetry API that allows to suggest sampling hints to the +implementation of the API: + * `UNSPECIFIED` + * This is the default option. + * No suggestion is made. + * `NOT_RECORD` + * Suggest to not `RecordEvents = false` and not propagate `SampledFlag = false`. + * `RECORD` + * Suggest `RecordEvents = true` and `SampledFlag = false`. + * `RECORD_AND_PROPAGATE` + * Suggest to `RecordEvents = true` and propagate `SampledFlag = true`. + +### Sampler interface +The interface for the Sampler class that is available only in the OpenTelemetry SDK: * `TraceID` * `SpanID` * Parent `SpanContext` if any + * `SamplerHint` * `Links` * Initial set of `Attributes` for the `Span` being constructed -It produces as an output: -* A boolean indicating whether to sample or drop the span. -* The new set of initial span Attributes (or passes along the SpanAttributes unmodified) -* (under discussion in separate RFC) the SamplingRate float. +It produces as an output called `SamplingResult`: + * A `SamplingDecision` enum [`NOT_RECORD`, `RECORD`, `RECORD_AND_PROPAGATE`]. + * A set of span Attributes that will also be added to the `Span`. + * These attributes will be added after the initial set of `Attributes`. + * (under discussion in separate RFC) the SamplingRate float. + +### Default Samplers +These are the default samplers implemented in the OpenTelemetry SDK: + * ALWAYS_ON + * Ignores all values in SamplingHint. + * ALWAYS_OFF + * Ignores all values in SamplingHint. + * ALWAYS_PARENT + * Ignores all values in SamplingHint. + * Trust parent sampling decision (trusting and propagating parent `SampledFlag`). + * For root Spans (no parent available) returns `NOT_RECORD`. + * Probability + * Allows users to configure to ignore or not the SamplingHint for every value different than + `UNSPECIFIED`. + * Default is to NOT ignore `NOT_RECORD` and `RECORD_AND_PROPAGATE` but ignores `RECORD`. + * Allows users to configure to ignore the parent `SampledFlag`. + * Allows users to configure if probability applies only for "root spans", "root spans and remote + parent", or "all spans". + * Default is to apply only for "root spans and remote parent". + * Remote parent property should be added to the SpanContext see specs [PR/216][specs-pr-216] + * Sample with 1/N probability + +**Root Span Decision:** + +|Decision|ALWAYS_ON|ALWAYS_OFF|ALWAYS_PARENT|Probability| +|---|---|---|---|---| +|RecordEvents|`True`|`False`|`False`|`SamplingHint==RECORD OR SampledFlag()`| +|SampledFlag|`True`|`False`|`False`|`SamplingHint==RECORD_AND_PROPAGATE OR Probability`| + +**Child Span Decision:** + +|Decision|ALWAYS_ON|ALWAYS_OFF|ALWAYS_PARENT|Probability| +|---|---|---|---|---| +|RecordEvents|`True`|`False`|`ParentSampledFlag`|`SamplingHint==RECORD OR SampledFlag()`| +|SampledFlag|`True`|`False`|`ParentSampledFlag`|`ParentSampledFlag OR SamplingHint==RECORD_AND_PROPAGATE OR Probability`| + +### Links +This RFC proposes that Links will be recorded only during the start `Span` operation, because: +* Link's `SampledFlag` can be used in the sampling decision. +* OpenTracing supports adding references only during the `Span` creation. +* OpenCensus supports adding links at any moment, but this was mostly used to record child Links +which are not supported in OpenTelemetry. +* Allowing links to be recorded after the sampling decision is made will cause samplers to not +work correctly and unexpected behaviors for sampling. ## Trade-offs - * We considered, instead of using the `SpanBuilder`, setting the sampler on the Span constructor, and - requiring any `Attributes` to be populated prior to the start of the span's default start time. + * We considered, instead of using the `SpanBuilder`, setting the sampler on the Span constructor, + and requiring any `Attributes` to be populated prior to the start of the span's default start time. * We considered, instead of using the `SpanBuilder`, setting the `Sampler` and the `Attributes` used for the sampler before running an explicit MakeSamplingDecision() on the span. Attempts to create a child of the span would fail if MakeSamplingDecision() had not yet been run. @@ -154,3 +236,6 @@ recommend the trace be sampled or not sampled until mid-way through execution; * [opentelemetry-specification/33](https://github.com/open-telemetry/opentelemetry-specification/issues/33) * [opentelemetry-specification/32](https://github.com/open-telemetry/opentelemetry-specification/issues/32) * [opentelemetry-specification/31](https://github.com/open-telemetry/opentelemetry-specification/issues/31) + +[trace-flags]: https://github.com/w3c/trace-context/blob/master/spec/20-http_header_format.md#trace-flags +[specs-pr-216]: https://github.com/open-telemetry/opentelemetry-specification/pull/216 \ No newline at end of file From 0802449b7024d79a0407796748c33822c4f16ea8 Mon Sep 17 00:00:00 2001 From: Bogdan Drutu Date: Sat, 17 Aug 2019 13:28:19 -0700 Subject: [PATCH 03/12] Add details about start span operation. --- text/0006-sampling.md | 91 +++++++++++++++++++++++++++++++++++-------- 1 file changed, 75 insertions(+), 16 deletions(-) diff --git a/text/0006-sampling.md b/text/0006-sampling.md index 98650fac6..a6c9f0341 100644 --- a/text/0006-sampling.md +++ b/text/0006-sampling.md @@ -107,6 +107,19 @@ running the infrastructure for collecting exported traces. been run. ## Internal details +In Dapper based systems (or systems without a deferred sampling decision) all exported spans are +stored to the backend, thus some of these systems usually don't scale to a high volume of traces, +or the cost to store all the Spans may be too high. In order to support this use-case and to +ensure the quality of the data we send, OpenTelemetry needs to natively support sampling with some +requirements: + * Send as many complete traces as possible. Sending just a subset of the spans from a trace is + less useful because in this case the interaction between the spans may miss. + * Allow application operator to configure the sampling frequency. + +For new modern systems that need to collect all the Spans and later may or may not do a deferred +sampling decision, the OpenTelemetry needs to natively support a way to configure the library to +collect and export all the Spans. This is possible even though OpenTelemetry supports sampling by +setting a default config to always collect all the spans. ### Sampling flags OpenTelemetry API has two flags/properties: @@ -183,17 +196,21 @@ These are the default samplers implemented in the OpenTelemetry SDK: **Root Span Decision:** -|Decision|ALWAYS_ON|ALWAYS_OFF|ALWAYS_PARENT|Probability| -|---|---|---|---|---| -|RecordEvents|`True`|`False`|`False`|`SamplingHint==RECORD OR SampledFlag()`| -|SampledFlag|`True`|`False`|`False`|`SamplingHint==RECORD_AND_PROPAGATE OR Probability`| +|Sampler|RecordEvents|SampledFlag| +|---|---|---| +|ALWAYS_ON|`True`|`True`| +|ALWAYS_OFF|`False`|`False`| +|ALWAYS_PARENT|`False`|`False`| +|Probability|`SamplingHint==RECORD OR SampledFlag`|`SamplingHint==RECORD_AND_PROPAGATE OR Probability`| **Child Span Decision:** -|Decision|ALWAYS_ON|ALWAYS_OFF|ALWAYS_PARENT|Probability| -|---|---|---|---|---| -|RecordEvents|`True`|`False`|`ParentSampledFlag`|`SamplingHint==RECORD OR SampledFlag()`| -|SampledFlag|`True`|`False`|`ParentSampledFlag`|`ParentSampledFlag OR SamplingHint==RECORD_AND_PROPAGATE OR Probability`| +|Sampler|RecordEvents|SampledFlag| +|---|---|---| +|ALWAYS_ON|`True`|`True`| +|ALWAYS_OFF|`False`|`False`| +|ALWAYS_PARENT|`ParentSampledFlag`|`ParentSampledFlag`| +|Probability|`SamplingHint==RECORD OR SampledFlag`|`ParentSampledFlag OR SamplingHint==RECORD_AND_PROPAGATE OR Probability`| ### Links This RFC proposes that Links will be recorded only during the start `Span` operation, because: @@ -204,13 +221,54 @@ which are not supported in OpenTelemetry. * Allowing links to be recorded after the sampling decision is made will cause samplers to not work correctly and unexpected behaviors for sampling. -## Trade-offs - * We considered, instead of using the `SpanBuilder`, setting the sampler on the Span constructor, - and requiring any `Attributes` to be populated prior to the start of the span's default start time. - * We considered, instead of using the `SpanBuilder`, setting the `Sampler` and the `Attributes` - used for the sampler before running an explicit MakeSamplingDecision() on the span. Attempts to - create a child of the span would fail if MakeSamplingDecision() had not yet been run. - * We considered allowing the sampling decision to be arbitrarily delayed. +### When does sampling happen? +The sampling decision will happen before a real `Span` object is returned to the user, because: + * If child spans are created they need to know the 'SampledFlag'. + * If `SpanContext` is propagated on the wire the 'SampledFlag' needs to be set. + * If user records any tracing event the `Span` object needs to know if the data are kept or not. + It may be possible to always collect all the events until the sampling decision is made but this is + an important optimization. + +There are two important use-cases to be considered: + * All information that may be used for sampling decisions are available at the moment when the + logical `Span` operation should start. This is the most common case. + * Some information that may be used for sampling decision are NOT available at the moment when the + logical `Span` operation should start (e.g. `http.route` may be determine later). + +The current [span creation logic][span-creation] it facilitate very well the first use-case, but +the second use-case requires users to record the logical `start_time` and collect all the +information necessarily to start the `Span` in custom objects, then when all the information are +available call the span creation API. + +The RFC proposes that: + * For languages where a `Builder` pattern is used to construct a `Span`, to allow users to create a + `Builder` where the start time of the Span is considered when the `Builder` is created. + * For languages where no intermediate object is used to construct a `Span`, to allow users maybe + via a `StartSpanOption` object to start a `Span`. The `StartSpanOption` allows users to set all + the start `Span` properties. + +**Alternatives considerations:** + * We considered, instead of requiring that sampling decision happens before the `Span` is + created to add an explicit `MakeSamplingDecision(SamplingHint)` on the `Span`. Attempts to create + a child `Span`, or to access the `SpanContext` would fail if `MakeSamplingDecision()` had not yet + been run. + * Pros: + * Simplifies the case when all the attributes that may be used for sampling are not available + when the logical `Span` operation should start. + * Cons: + * The most common case would have required an extra API call. + * Error prone, users may forget to call the extra API. + * Unexpected and hard to find errors if user tries to create a child `Span` before calling + MakeSamplingDecision(). + * We considered allowing the sampling decision to be arbitrarily delayed, but guaranteed before + any child `Span` is created, or `SpanContext` is accessed, or before `Span.end()` finished. + * Pros: + * Similar and smaller API that supports both use-cases defined ahead. + * Cons: + * If `SamplingHint` needs to also be delayed recorded then an extra API on Span is required + to set this. + * Does not allow optimization to not record tracing events, all tracing events MUST be + recorded before the sampling decision is made. ## Prior art and alternatives Prior art for Zipkin, and other Dapper based systems: all client-side sampling decisions are made at @@ -238,4 +296,5 @@ recommend the trace be sampled or not sampled until mid-way through execution; * [opentelemetry-specification/31](https://github.com/open-telemetry/opentelemetry-specification/issues/31) [trace-flags]: https://github.com/w3c/trace-context/blob/master/spec/20-http_header_format.md#trace-flags -[specs-pr-216]: https://github.com/open-telemetry/opentelemetry-specification/pull/216 \ No newline at end of file +[specs-pr-216]: https://github.com/open-telemetry/opentelemetry-specification/pull/216 +[span-creation]: https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/api-tracing.md#span-creation \ No newline at end of file From f860e25413a49989eaea1379e2c62d8331278a1c Mon Sep 17 00:00:00 2001 From: Bogdan Drutu Date: Mon, 19 Aug 2019 17:03:41 -0700 Subject: [PATCH 04/12] Update spelling text/0006-sampling.md Co-Authored-By: Yang Song --- text/0006-sampling.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/text/0006-sampling.md b/text/0006-sampling.md index a6c9f0341..03c94e2c0 100644 --- a/text/0006-sampling.md +++ b/text/0006-sampling.md @@ -117,7 +117,7 @@ requirements: * Allow application operator to configure the sampling frequency. For new modern systems that need to collect all the Spans and later may or may not do a deferred -sampling decision, the OpenTelemetry needs to natively support a way to configure the library to +sampling decision, OpenTelemetry needs to natively support a way to configure the library to collect and export all the Spans. This is possible even though OpenTelemetry supports sampling by setting a default config to always collect all the spans. @@ -297,4 +297,4 @@ recommend the trace be sampled or not sampled until mid-way through execution; [trace-flags]: https://github.com/w3c/trace-context/blob/master/spec/20-http_header_format.md#trace-flags [specs-pr-216]: https://github.com/open-telemetry/opentelemetry-specification/pull/216 -[span-creation]: https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/api-tracing.md#span-creation \ No newline at end of file +[span-creation]: https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/api-tracing.md#span-creation From 9cbc0f3184ca53df161bc8206432145db7d9b84a Mon Sep 17 00:00:00 2001 From: Bogdan Drutu Date: Mon, 19 Aug 2019 17:03:55 -0700 Subject: [PATCH 05/12] Update spelling text/0006-sampling.md Co-Authored-By: Yang Song --- text/0006-sampling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0006-sampling.md b/text/0006-sampling.md index 03c94e2c0..d39b29cce 100644 --- a/text/0006-sampling.md +++ b/text/0006-sampling.md @@ -235,7 +235,7 @@ There are two important use-cases to be considered: * Some information that may be used for sampling decision are NOT available at the moment when the logical `Span` operation should start (e.g. `http.route` may be determine later). -The current [span creation logic][span-creation] it facilitate very well the first use-case, but +The current [span creation logic][span-creation] facilitates very well the first use-case, but the second use-case requires users to record the logical `start_time` and collect all the information necessarily to start the `Span` in custom objects, then when all the information are available call the span creation API. From d25d4335a0255424bf630d06bb077fcb279059dd Mon Sep 17 00:00:00 2001 From: Bogdan Drutu Date: Mon, 19 Aug 2019 17:24:53 -0700 Subject: [PATCH 06/12] Add span name to the Sampler interface. Clarify how the sampling flags are exposed. --- text/0006-sampling.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/text/0006-sampling.md b/text/0006-sampling.md index d39b29cce..3fbd751e1 100644 --- a/text/0006-sampling.md +++ b/text/0006-sampling.md @@ -36,6 +36,7 @@ interact with OpenTelemetry: Examples: gRPC, Express, Django developers. * They must only depend upon the OpenTelemetry API and not upon the SDK. + * For testing only they may depend on the SDK with InMemoryExporter. * They are shipping source code that will be linked into others' applications. * They have no explicit runtime control over the application. * They know some signal about what traces may be interesting (e.g. unusual control plane requests) @@ -124,13 +125,13 @@ setting a default config to always collect all the spans. ### Sampling flags OpenTelemetry API has two flags/properties: * `RecordEvents` - * This property is exposed in the `Span` interface. + * This property is exposed in the `Span` interface (e.g. `Span.isRecordingEvents()`). * If `true` the current `Span` records tracing events (attributes, events, status, etc.), otherwise all tracing events are dropped. * Users can use this property to determine if expensive trace events can be avoided. - * `SampledFlag` - * This flag is propagated via the `TraceOptions` (may be renamed to `TraceFlags` in a different - PR) to the child Spans. For more details see the w3c definition [here][trace-flags]. + * `SampledFlag` + * This flag is propagated via the `TraceOptions` to the child Spans (e.g. + `TraceOptions.isSampled()`). For more details see the w3c definition [here][trace-flags]. * In Dapper based systems this is equivalent to `Span` being `sampled` and exported. The flag combination `SampledFlag == false` and `RecordEvents == true` means that the current `Span` @@ -165,6 +166,7 @@ The interface for the Sampler class that is available only in the OpenTelemetry * Parent `SpanContext` if any * `SamplerHint` * `Links` + * Span name * Initial set of `Attributes` for the `Span` being constructed It produces as an output called `SamplingResult`: @@ -235,9 +237,9 @@ There are two important use-cases to be considered: * Some information that may be used for sampling decision are NOT available at the moment when the logical `Span` operation should start (e.g. `http.route` may be determine later). -The current [span creation logic][span-creation] facilitates very well the first use-case, but -the second use-case requires users to record the logical `start_time` and collect all the -information necessarily to start the `Span` in custom objects, then when all the information are +The current [span creation logic][span-creation] facilitates very well the first use-case, but +the second use-case requires users to record the logical `start_time` and collect all the +information necessarily to start the `Span` in custom objects, then when all the properties are available call the span creation API. The RFC proposes that: From 34fb2c95fc76dfb43624cd0321f74c88df552e00 Mon Sep 17 00:00:00 2001 From: Bogdan Drutu Date: Tue, 20 Aug 2019 11:19:58 -0700 Subject: [PATCH 07/12] Use ascii for personas --- text/0006-sampling.md | 33 +++++++++++++++++++++++++++++++-- 1 file changed, 31 insertions(+), 2 deletions(-) diff --git a/text/0006-sampling.md b/text/0006-sampling.md index 3fbd751e1..80af4c6fe 100644 --- a/text/0006-sampling.md +++ b/text/0006-sampling.md @@ -25,8 +25,37 @@ for OpenTelemetry that have gotten muddled in the design of the original Samplin to clarify what APIs each should be able to depend upon, and how they will configure sampling and OpenTelemetry according to their needs. -![Personas](https://i.imgur.com/w1H0CfH.png) - +``` + + +----------+ +-----------+ + grpc | Library | | | + Django | Devs +---------->| OTel API | + Express | | +------>| | + +----------+ | +--->+-----------+ +---------+ + | | ^ | OTel | + | | | +->| Proxy +---+ + | | | | | | | + +----------+ | | +-----+-----+------------+ | +---------+ | + | | | | | | OTel Wire | | | + Hbase | Infra | | | | | Export |+-+ v + Envoy | Binary +---+ | | OTel | | | +----v-----+ + | Devs | | | SDK +------------+ | | | + +----------+---------->| | | +---------->| Backend | + +------>| | Custom | +---------->| | + | | | | Export | | +----------+ + +----------+ | | | | |+-+ ^ + | +---+ | +-----------+------------+ | + | App +------+ ^ ^ | + | Devs + | | +------------+-+ + | | | | | | + +----------+ +---+----+ +----------+ Telemetry | + | SRE | | Owner | + | | | | + +--------+ +--------------+ + Lightstep + Honeycomb + +``` ## Explanation We outline five different use cases (who may be overlapping sets of people), and how they should From 80bd99afbcc23e4139b6777751dc26a3cb2fc240 Mon Sep 17 00:00:00 2001 From: Bogdan Drutu Date: Tue, 20 Aug 2019 11:41:37 -0700 Subject: [PATCH 08/12] Add spankind to the sampler input. --- text/0006-sampling.md | 1 + 1 file changed, 1 insertion(+) diff --git a/text/0006-sampling.md b/text/0006-sampling.md index 80af4c6fe..b5173743e 100644 --- a/text/0006-sampling.md +++ b/text/0006-sampling.md @@ -196,6 +196,7 @@ The interface for the Sampler class that is available only in the OpenTelemetry * `SamplerHint` * `Links` * Span name + * `SpanKind` * Initial set of `Attributes` for the `Span` being constructed It produces as an output called `SamplingResult`: From 6fbbc4111294524240f4fc1d88fde0c33a5e51d9 Mon Sep 17 00:00:00 2001 From: Bogdan Drutu Date: Thu, 22 Aug 2019 14:29:56 -0700 Subject: [PATCH 09/12] Remove delayed span creation proposal from this RFC --- text/0006-sampling.md | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/text/0006-sampling.md b/text/0006-sampling.md index b5173743e..75c527aeb 100644 --- a/text/0006-sampling.md +++ b/text/0006-sampling.md @@ -9,11 +9,6 @@ This section tries to summarize all the changes proposed in this RFC: 1. Add a new `SamplerHint` concept to the API package. 1. Add capability to record `Attributes` that can be used for sampling decision during the `Span` creation time. - 1. Add capability to start building a `Span` with a delayed `build` method. This is useful for - cases where some `Attributes` that are useful for sampling are not available when start building - the `Span`. As an example in Java the current `Span.Builder` will use as a start time for the - `Span` the moment when the builder is created and not the moment when the `build()` method is - called. 1. Remove `addLink` APIs from the `Span` interface, and allow recording links only during the span construction time. @@ -272,14 +267,21 @@ the second use-case requires users to record the logical `start_time` and collec information necessarily to start the `Span` in custom objects, then when all the properties are available call the span creation API. -The RFC proposes that: - * For languages where a `Builder` pattern is used to construct a `Span`, to allow users to create a - `Builder` where the start time of the Span is considered when the `Builder` is created. - * For languages where no intermediate object is used to construct a `Span`, to allow users maybe - via a `StartSpanOption` object to start a `Span`. The `StartSpanOption` allows users to set all - the start `Span` properties. +The RFC proposes that we keep the current [span creation logic][span-creation] as it is and we will +address the delayed sampling in a different RFC when that becomes a high priority. **Alternatives considerations:** + * We considered, to offer a delayed span construction mechanism: + * For languages where a `Builder` pattern is used to construct a `Span`, to allow users to + create a `Builder` where the start time of the Span is considered when the `Builder` is created. + * For languages where no intermediate object is used to construct a `Span`, to allow users maybe + via a `StartSpanOption` object to start a `Span`. The `StartSpanOption` allows users to set all + the start `Span` properties. + * Pros: + * Would resolve the second use-case posted above. + * Cons: + * We could not identify too many real case examples for the second use-case and decided to + postpone the decision to avoid premature decisions. * We considered, instead of requiring that sampling decision happens before the `Span` is created to add an explicit `MakeSamplingDecision(SamplingHint)` on the `Span`. Attempts to create a child `Span`, or to access the `SpanContext` would fail if `MakeSamplingDecision()` had not yet From eeab6a2d1342abb58ad5e7c0aadaaa4b1ab32310 Mon Sep 17 00:00:00 2001 From: Bogdan Drutu Date: Thu, 22 Aug 2019 14:39:37 -0700 Subject: [PATCH 10/12] Remove unspecified type from sampler hint. --- text/0006-sampling.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/text/0006-sampling.md b/text/0006-sampling.md index 75c527aeb..4e446560b 100644 --- a/text/0006-sampling.md +++ b/text/0006-sampling.md @@ -173,9 +173,6 @@ instrumenting code and never access `SampledFlag` unless used in context propaga ### SamplingHint This is a new concept added in the OpenTelemetry API that allows to suggest sampling hints to the implementation of the API: - * `UNSPECIFIED` - * This is the default option. - * No suggestion is made. * `NOT_RECORD` * Suggest to not `RecordEvents = false` and not propagate `SampledFlag = false`. * `RECORD` @@ -183,6 +180,10 @@ implementation of the API: * `RECORD_AND_PROPAGATE` * Suggest to `RecordEvents = true` and propagate `SampledFlag = true`. +The default option for the span creation is to not have any suggestion (or suggestion is not +specified). This can be implemented by using `null` as the default option or any language specific +mechanism to achieve the same result. + ### Sampler interface The interface for the Sampler class that is available only in the OpenTelemetry SDK: * `TraceID` From 13158f39c23f1d7753beacebd679119664a72728 Mon Sep 17 00:00:00 2001 From: Bogdan Drutu Date: Fri, 23 Aug 2019 07:38:11 -0700 Subject: [PATCH 11/12] Update text/0006-sampling.md Co-Authored-By: Tristan Sloughter --- text/0006-sampling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0006-sampling.md b/text/0006-sampling.md index 4e446560b..16f15f565 100644 --- a/text/0006-sampling.md +++ b/text/0006-sampling.md @@ -84,7 +84,7 @@ Examples: HBase, Envoy developers. **Solution:** * Allow different sampling strategies by default in OpenTelemetry SDK, all configurable easily via - YAML or future flags. See [default samplers](#default-samplers). + YAML or feature flags. See [default samplers](#default-samplers). ### Application developer These are the folks we've been thinking the most about for OpenTelemetry in general. From b127beae884165de94b322a77474a4e17d08dbda Mon Sep 17 00:00:00 2001 From: Bogdan Drutu Date: Mon, 26 Aug 2019 12:09:32 -0700 Subject: [PATCH 12/12] Add clarification that Sampler is always called. --- text/0006-sampling.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/text/0006-sampling.md b/text/0006-sampling.md index 16f15f565..743d41a9b 100644 --- a/text/0006-sampling.md +++ b/text/0006-sampling.md @@ -200,7 +200,7 @@ It produces as an output called `SamplingResult`: * A set of span Attributes that will also be added to the `Span`. * These attributes will be added after the initial set of `Attributes`. * (under discussion in separate RFC) the SamplingRate float. - + ### Default Samplers These are the default samplers implemented in the OpenTelemetry SDK: * ALWAYS_ON @@ -271,6 +271,8 @@ available call the span creation API. The RFC proposes that we keep the current [span creation logic][span-creation] as it is and we will address the delayed sampling in a different RFC when that becomes a high priority. +The SDK must call the `Sampler` every time a `Span` is created during the start span operation. + **Alternatives considerations:** * We considered, to offer a delayed span construction mechanism: * For languages where a `Builder` pattern is used to construct a `Span`, to allow users to