Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing: Delegate envoy tracing decision to tracer to determine if should override sampling #2787

Merged
merged 6 commits into from
Mar 27, 2018

Conversation

objectiser
Copy link
Contributor

@objectiser objectiser commented Mar 13, 2018

The sampling decision is currently used to determine whether a span should be created/started. The decision is propagated to downstream services via the x-trace-id header.

The problem is that, if the x-trace-id is not propagated along with other trace context (e.g. the service does not propagate it due to using a zipkin compatible tracer in the service that does not know about this header), then the decision is lost in the downstream service and it will make its own sampling decision. This will lead to inconsistent fragmented traces.

The other problem is that any B3-sampled header received with a request will be ignored by the local sampling decision.

This PR deals with the situation by delegating the envoy proxy's tracing (sampling) decision down to the tracers themselves. There are essentially three tracers, which handle the decision in different ways:

  1. LightStep - this tracer only creates/starts a span if the tracing decision indicates it should - otherwise no span is created - as now. It is the responsibility of any proxied services to propagate the x-trace-id to ensure that decision is passed to downstream services.

  2. Zipkin - this tracer propagates its own sampling decision via the X-B3-Sampled header. If that header is not present, such as when creating the root span or it hasn't been propagated with the trace context, then the proxy's tracing (sampling) decision will be used to set the X-B3-Sampled header which will subsequently be passed to downstream services.

  3. Other OpenTracing compatible tracers - the only way to influence the sampling of a trace instance in the OpenTracing standard is through the use of the sampling.priority tag - so in this situation if the proxy sampling decision is false, the sampling.priority tag will be set to 0 when creating the span. It is then the tracer's responsibility to use or ignore that hint.

Risk Level: Medium

Testing: unit tests

Docs Changes:
none

Release Notes:
This change delegates the sampling decision to the tracer, which may have additional context that needs to be taken into account when determining the actual sampling state. If no such state already exists for a trace instance, then the sampling decision supplied by the Envoy proxy will be used.

@objectiser objectiser force-pushed the envoysampling branch 3 times, most recently from 24d43fe to a53fb24 Compare March 13, 2018 15:49
@mattklein123
Copy link
Member

Per new release note policy please add a release note with any applicable feature docs to https://github.com/envoyproxy/data-plane-api/blob/master/docs/root/intro/version_history.rst. Thank you!

@mattklein123
Copy link
Member

@objectiser is this still WIP or should this get reviewed?

@objectiser objectiser changed the title WIP tracing: Delegate envoy tracing decision to tracer to determine if should override sampling tracing: Delegate envoy tracing decision to tracer to determine if should override sampling Mar 17, 2018
@objectiser
Copy link
Contributor Author

@mattklein123 It should be ok to review now thanks.

@objectiser
Copy link
Contributor Author

@rnburn Would you be able to have a look at this PR, as it impacts the LightStep and OpenTracing drivers?

@@ -17,18 +18,20 @@ class SpanContext {
/**
* Default constructor. Creates an empty context.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if empty, please default to false if you must default. I'd hate to see trace id 0 accidentally sent to a zipkin endpoint due to a code error. False already says "don't send this"

bool sampled() const { return sampled_; }

/**
* @return the sampled flag as a 1-character string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style choice, but the docs here make me feel like using a literal somehow and then just proving below with a unit test. I don't know if there is an ability to use a literal '1' and '0' in cpp, so if that's not possible ignore me.

@@ -302,7 +302,8 @@ class Span : public ZipkinBase {
* Default constructor. Creates an empty span.
*/
Span()
: trace_id_(0), name_(), id_(0), debug_(false), monotonic_start_time_(0), tracer_(nullptr) {}
: trace_id_(0), name_(), id_(0), debug_(false), sampled_(true), monotonic_start_time_(0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again default to sampled false on dummy data

* Note that a function is needed because the string used to build the regex
* cannot be initialized statically.
*/
static const std::regex& notSampledRegex() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most don't use regex, rather '1' or 'true' literal comparison, anything else false (if present). regex tends to be a lot more expensive and warranted if there's a complex decision. If there's secondary reasons for regex, ignoreme

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this has already been changed in the main PR


// Only context header set is B3 sampled to indicate trace should not be sampled (using legacy
// 'false' value)
const std::string notSampled = "false";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invert this condition. more important to test 'true' than 'false'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added another test for 'true' case

EXPECT_TRUE(zipkin_span->span().sampled());
}

TEST_F(ZipkinDriverTest, NoB3ContextSampledFalse) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should a test class include one for absence of XB3Sampled from inbound headers? (I don't see it but might be blind)

Copy link
Contributor Author

@objectiser objectiser Mar 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following test does - the trace and span id are set, but the sampling decision is taken from envoy. I will add a nullptr check to make it clearer.

objectiser added a commit to objectiser/envoy that referenced this pull request Mar 19, 2018
@objectiser
Copy link
Contributor Author

objectiser commented Mar 19, 2018

@mattklein123 Rebased on the latest master - not sure if this one could make it into 1.6.0 aswell? If possible, then I'll create the release note.

@mattklein123
Copy link
Member

@objectiser yeah if we think this is ready I can go through later today. Please do the release note PR. Thanks!

@rnburn
Copy link
Contributor

rnburn commented Mar 19, 2018

Hey @objectiser, I'm going to add support for sampling.priority to lighstep's tracer. I think things would be a little cleaner if we only used that to specify sampling so that we don't need a special case for the LightStep tracer.

I'll also add constants for the standard tags to OT-C ++ so that they don't need to be hard-coded (opentracing/opentracing-cpp#38).

@objectiser
Copy link
Contributor Author

@rnburn That sounds good in the longer term, but if Matt wants to get this change into the 1.6.0 release, then it would be better to merge the current PR and then simplify it once LightStep supports the tag - as it would be a localised change?

@mattklein123
Copy link
Member

@objectiser @rnburn I think there is no urgency to get this into 1.6.0. If there is a cleaner way of doing this it would be great if the two if you can work together to make it happen. Thank you!

@objectiser
Copy link
Contributor Author

@mattklein123 Although not urgent, I think the current way of working will lead to problems when users start using sampling with services that are also instrumented with zipkin compatible tracer. For example, if the inbound proxy decides not to sample, currently it will not create a span and therefore there will be no B3 sampled header passed to the service - and therefore the service's tracer may decide to start a trace.

@rnburn's suggestion essentially means that these lines will be removed (along with the useTagForSamplingDecision method). This could be achieved in a separate PR once the new version of the LightStep tracer is ready.

@rnburn When would the updated LightStep tracer and OT C++ be available?

@rnburn
Copy link
Contributor

rnburn commented Mar 20, 2018

@objectiser - I can get the updates in today.

But I can also put the changes them in a later PR, if you'd prefer to go ahead with this as is.

@objectiser
Copy link
Contributor Author

@rnburn If you could get the changes in today that would be great.

@mandarjog
Copy link
Contributor

@objectiser does this mean http_conn_manager no longer decides what to trace? Will this affect the sampling percentage settings that were implemented there or are all tracers expected to implement them?

@objectiser
Copy link
Contributor Author

objectiser commented Mar 20, 2018

@mandarjog no, if the sampling decision for a particular trace instance has not yet been made, then it will use the decision determined by the http_conn_manager.

This PR is about ensuring that the sampling decision propagated to downstream service's via the tracer specific context (e.g. x-b3-sampled header) is not inadvertantly overridden by the downstream envoy proxy just because the service did not also forward the x-trace-id header - which may happen if a service uses a zipkin compatible tracer that handles the trace context propagation, but does not know about the x-tracer-id header.

The other issue as outlined previously is if the envoy proxy sampling decision is false, but the service being proxied also uses a zipkin compatible tracer, then the service's tracer may make its own sampling decision as no x-b3-sampled header was passed in.

@mattklein123
Copy link
Member

@objectiser @rnburn sorry, what's the TL/DR on this? Are we going to wait for a parallel @rnburn change to land? If we can have the good solution happen within a week or so should we just wait for that?

@objectiser
Copy link
Contributor Author

@mattklein123 Sounds like @rnburn could have his changes today/tomorrow - however they won't make a big difference to this PR, so would suggest reviewing this PR as is and once ready we either:

(i) merge and do the LightStep related changes in a separate PR
(ii) if the LightStep / OT C++ changes are ready in time, then incorporate those

…ould override sampling

Signed-off-by: Gary Brown <gary@brownuk.com>
@objectiser
Copy link
Contributor Author

objectiser commented Mar 26, 2018

Hi @mattklein123 would it be possible to review this PR soon and then when the lightstep changes are available we'll do a separate PR?

@mattklein123
Copy link
Member

@objectiser yes I will review today.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, some small comments. @rnburn can you review also? Thank you.

@@ -13,6 +13,19 @@ namespace Tracing {

enum class OperationName { Ingress, Egress };

enum class Reason {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that you just moved this code, but can you do us the favor of documenting each of these enum values now that it's in public interface code and other tracers might need to implement?

ClientForced,
};

struct Decision {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, can we add some docs?


struct Decision {
Reason reason;
bool is_tracing;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your code, but can we rename to "traced" potentially? (Or "perform_trace" or something like that)

@@ -88,7 +101,8 @@ class Driver {
* Start driver specific span.
*/
virtual SpanPtr startSpan(const Config& config, Http::HeaderMap& request_headers,
const std::string& operation_name, SystemTime start_time) PURE;
const std::string& operation_name, SystemTime start_time,
const Tracing::Decision& tracing_decision) PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would pass Tracing::Decision by value as it's a small object.

active_span_ = connection_manager_.tracer_.startSpan(*this, *request_headers_, request_info_,
tracing_decision);

if (!active_span_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember now, but when does this return nullptr? I thought we always have a tracer and this will always return a span (even if the null span).

Copy link
Contributor Author

@objectiser objectiser Mar 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, current behaviour at start of traceRequest() is to return if not being sampled, before the span is created. But this check will be removed when lightstep supports the sampling tag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. Got it. Thanks.

// If tracing decision is no, and sampling decision is not communicated via tags, then
// return a null span to indicate that tracing is not being performed.
if (!tracing_decision.is_tracing && !useTagForSamplingDecision()) {
return nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is related to my other comment. Should we return NullSpan here? I'm not sure if there are other points in the code that might be assuming span is never nullptr? Can you check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just replicating the current nullptr span if is_tracing is false. But this code will be removed once lightstep supports the sampling tag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry for the delay. I put in lightstep/lightstep-tracer-cpp#94, but there were some concerns around how to handle the propagation of spans with no sampling for lightstep.

I'll follow up with a separate PR once it gets sorted out.

@@ -164,8 +170,13 @@ SpanPtr OpenTracingDriver::startSpan(const Config&, Http::HeaderMap& request_hea
tracerStats().span_context_extraction_error_.inc();
}
}
active_span = tracer.StartSpan(operation_name, {opentracing::ChildOf(parent_span_ctx.get()),
opentracing::StartTimestamp(start_time)});
opentracing::StartSpanOptions options;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rnburn Can you review this part?

…Decision by value, changed is_tracing to traced

Signed-off-by: Gary Brown <gary@brownuk.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@objectiser LGTM, thanks. I would like to wait for a @rnburn review before merging. Hopefully he can take a look sometime soon.

opentracing::StartTimestamp(start_time)});
opentracing::StartSpanOptions options;
opentracing::ChildOf(parent_span_ctx.get()).Apply(options);
opentracing::StartTimestamp(start_time).Apply(options);
Copy link
Contributor

@rnburn rnburn Mar 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will work; though it would be more direct to write

options.references.emplace_back(opentracing::SpanReferenceType::ChildOfRef, parent_span_context.get());
options.start_system_timestamp = start_time;

Those objects are really just to make specifying options with the StartSpan method convenient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had trouble making this change - compilation errors. Possibly this could be done as part of your PR?

Copy link
Contributor

@rnburn rnburn Mar 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a typo (corrected now)

options.references.emplace_back(opentracing::SpanReferenceType::ChildOf, parent_span_context.get());

should have been

options.references.emplace_back(opentracing::SpanReferenceType::ChildOfRef, parent_span_context.get());

I can verify the code tomorrow.

opentracing::ChildOf(parent_span_ctx.get()).Apply(options);
opentracing::StartTimestamp(start_time).Apply(options);
if (!tracing_decision.traced) {
opentracing::SetTag(OpenTracingTags::get().SAMPLING_PRIORITY, 0).Apply(options);
Copy link
Contributor

@rnburn rnburn Mar 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a constant for sampling priority to match what's in the other APIs so you could write

options.tags.emplace_back(opentracing::ext::sampling_priority, 0);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use the new constant.

Signed-off-by: Gary Brown <gary@brownuk.com>
Signed-off-by: Gary Brown <gary@brownuk.com>
Signed-off-by: Gary Brown <gary@brownuk.com>
@objectiser
Copy link
Contributor Author

@rnburn Thanks - that worked. However seems to be a build problem on mac with the newer version of opentracing?

@rnburn
Copy link
Contributor

rnburn commented Mar 27, 2018

@objectiser - I put in a small change to the bazel build in opentracing/opentracing-cpp#69. It should work on OSX now.

Signed-off-by: Gary Brown <gary@brownuk.com>
@mattklein123
Copy link
Member

@rnburn @objectiser I'm not totally following the back and forth. Is this ready? @rnburn do you approve?

@rnburn
Copy link
Contributor

rnburn commented Mar 27, 2018

Yes, it looks good to me. 👍

@objectiser
Copy link
Contributor Author

@rnburn Thanks for the quick fix.

@mattklein123 Hopefully good to go now.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @objectiser and @rnburn!

@mattklein123 mattklein123 merged commit 85fe233 into envoyproxy:master Mar 27, 2018
@objectiser objectiser deleted the envoysampling branch March 27, 2018 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants