-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tracing: Delegate envoy tracing decision to tracer to determine if should override sampling #2787
Conversation
24d43fe
to
a53fb24
Compare
Per new release note policy please add a release note with any applicable feature docs to https://github.com/envoyproxy/data-plane-api/blob/master/docs/root/intro/version_history.rst. Thank you! |
@objectiser is this still WIP or should this get reviewed? |
@mattklein123 It should be ok to review now thanks. |
@rnburn Would you be able to have a look at this PR, as it impacts the LightStep and OpenTracing drivers? |
@@ -17,18 +18,20 @@ class SpanContext { | |||
/** | |||
* Default constructor. Creates an empty context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if empty, please default to false if you must default. I'd hate to see trace id 0 accidentally sent to a zipkin endpoint due to a code error. False already says "don't send this"
bool sampled() const { return sampled_; } | ||
|
||
/** | ||
* @return the sampled flag as a 1-character string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style choice, but the docs here make me feel like using a literal somehow and then just proving below with a unit test. I don't know if there is an ability to use a literal '1' and '0' in cpp, so if that's not possible ignore me.
@@ -302,7 +302,8 @@ class Span : public ZipkinBase { | |||
* Default constructor. Creates an empty span. | |||
*/ | |||
Span() | |||
: trace_id_(0), name_(), id_(0), debug_(false), monotonic_start_time_(0), tracer_(nullptr) {} | |||
: trace_id_(0), name_(), id_(0), debug_(false), sampled_(true), monotonic_start_time_(0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again default to sampled false on dummy data
* Note that a function is needed because the string used to build the regex | ||
* cannot be initialized statically. | ||
*/ | ||
static const std::regex& notSampledRegex() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most don't use regex, rather '1' or 'true' literal comparison, anything else false (if present). regex tends to be a lot more expensive and warranted if there's a complex decision. If there's secondary reasons for regex, ignoreme
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this has already been changed in the main PR
|
||
// Only context header set is B3 sampled to indicate trace should not be sampled (using legacy | ||
// 'false' value) | ||
const std::string notSampled = "false"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
invert this condition. more important to test 'true' than 'false'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added another test for 'true' case
EXPECT_TRUE(zipkin_span->span().sampled()); | ||
} | ||
|
||
TEST_F(ZipkinDriverTest, NoB3ContextSampledFalse) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should a test class include one for absence of XB3Sampled from inbound headers? (I don't see it but might be blind)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following test does - the trace and span id are set, but the sampling decision is taken from envoy. I will add a nullptr check to make it clearer.
…#2787 Signed-off-by: Gary Brown <gary@brownuk.com>
a53fb24
to
8d1454e
Compare
@mattklein123 Rebased on the latest master - not sure if this one could make it into 1.6.0 aswell? If possible, then I'll create the release note. |
@objectiser yeah if we think this is ready I can go through later today. Please do the release note PR. Thanks! |
Hey @objectiser, I'm going to add support for sampling.priority to lighstep's tracer. I think things would be a little cleaner if we only used that to specify sampling so that we don't need a special case for the LightStep tracer. I'll also add constants for the standard tags to OT-C ++ so that they don't need to be hard-coded (opentracing/opentracing-cpp#38). |
@rnburn That sounds good in the longer term, but if Matt wants to get this change into the 1.6.0 release, then it would be better to merge the current PR and then simplify it once LightStep supports the tag - as it would be a localised change? |
@objectiser @rnburn I think there is no urgency to get this into 1.6.0. If there is a cleaner way of doing this it would be great if the two if you can work together to make it happen. Thank you! |
@mattklein123 Although not urgent, I think the current way of working will lead to problems when users start using sampling with services that are also instrumented with zipkin compatible tracer. For example, if the inbound proxy decides not to sample, currently it will not create a span and therefore there will be no B3 sampled header passed to the service - and therefore the service's tracer may decide to start a trace. @rnburn's suggestion essentially means that these lines will be removed (along with the @rnburn When would the updated LightStep tracer and OT C++ be available? |
@objectiser - I can get the updates in today. But I can also put the changes them in a later PR, if you'd prefer to go ahead with this as is. |
@rnburn If you could get the changes in today that would be great. |
@objectiser does this mean http_conn_manager no longer decides what to trace? Will this affect the sampling percentage settings that were implemented there or are all tracers expected to implement them? |
@mandarjog no, if the sampling decision for a particular trace instance has not yet been made, then it will use the decision determined by the This PR is about ensuring that the sampling decision propagated to downstream service's via the tracer specific context (e.g. The other issue as outlined previously is if the envoy proxy sampling decision is false, but the service being proxied also uses a zipkin compatible tracer, then the service's tracer may make its own sampling decision as no |
@objectiser @rnburn sorry, what's the TL/DR on this? Are we going to wait for a parallel @rnburn change to land? If we can have the good solution happen within a week or so should we just wait for that? |
@mattklein123 Sounds like @rnburn could have his changes today/tomorrow - however they won't make a big difference to this PR, so would suggest reviewing this PR as is and once ready we either: (i) merge and do the LightStep related changes in a separate PR |
…ould override sampling Signed-off-by: Gary Brown <gary@brownuk.com>
8d1454e
to
4160ed8
Compare
Hi @mattklein123 would it be possible to review this PR soon and then when the lightstep changes are available we'll do a separate PR? |
@objectiser yes I will review today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, some small comments. @rnburn can you review also? Thank you.
include/envoy/tracing/http_tracer.h
Outdated
@@ -13,6 +13,19 @@ namespace Tracing { | |||
|
|||
enum class OperationName { Ingress, Egress }; | |||
|
|||
enum class Reason { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize that you just moved this code, but can you do us the favor of documenting each of these enum values now that it's in public interface code and other tracers might need to implement?
include/envoy/tracing/http_tracer.h
Outdated
ClientForced, | ||
}; | ||
|
||
struct Decision { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, can we add some docs?
include/envoy/tracing/http_tracer.h
Outdated
|
||
struct Decision { | ||
Reason reason; | ||
bool is_tracing; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not your code, but can we rename to "traced" potentially? (Or "perform_trace" or something like that)
include/envoy/tracing/http_tracer.h
Outdated
@@ -88,7 +101,8 @@ class Driver { | |||
* Start driver specific span. | |||
*/ | |||
virtual SpanPtr startSpan(const Config& config, Http::HeaderMap& request_headers, | |||
const std::string& operation_name, SystemTime start_time) PURE; | |||
const std::string& operation_name, SystemTime start_time, | |||
const Tracing::Decision& tracing_decision) PURE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would pass Tracing::Decision
by value as it's a small object.
active_span_ = connection_manager_.tracer_.startSpan(*this, *request_headers_, request_info_, | ||
tracing_decision); | ||
|
||
if (!active_span_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember now, but when does this return nullptr? I thought we always have a tracer and this will always return a span (even if the null span).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, current behaviour at start of traceRequest()
is to return if not being sampled, before the span is created. But this check will be removed when lightstep supports the sampling tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok. Got it. Thanks.
// If tracing decision is no, and sampling decision is not communicated via tags, then | ||
// return a null span to indicate that tracing is not being performed. | ||
if (!tracing_decision.is_tracing && !useTagForSamplingDecision()) { | ||
return nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is related to my other comment. Should we return NullSpan here? I'm not sure if there are other points in the code that might be assuming span is never nullptr? Can you check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just replicating the current nullptr span if is_tracing
is false. But this code will be removed once lightstep supports the sampling tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry for the delay. I put in lightstep/lightstep-tracer-cpp#94, but there were some concerns around how to handle the propagation of spans with no sampling for lightstep.
I'll follow up with a separate PR once it gets sorted out.
@@ -164,8 +170,13 @@ SpanPtr OpenTracingDriver::startSpan(const Config&, Http::HeaderMap& request_hea | |||
tracerStats().span_context_extraction_error_.inc(); | |||
} | |||
} | |||
active_span = tracer.StartSpan(operation_name, {opentracing::ChildOf(parent_span_ctx.get()), | |||
opentracing::StartTimestamp(start_time)}); | |||
opentracing::StartSpanOptions options; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rnburn Can you review this part?
…Decision by value, changed is_tracing to traced Signed-off-by: Gary Brown <gary@brownuk.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@objectiser LGTM, thanks. I would like to wait for a @rnburn review before merging. Hopefully he can take a look sometime soon.
opentracing::StartTimestamp(start_time)}); | ||
opentracing::StartSpanOptions options; | ||
opentracing::ChildOf(parent_span_ctx.get()).Apply(options); | ||
opentracing::StartTimestamp(start_time).Apply(options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will work; though it would be more direct to write
options.references.emplace_back(opentracing::SpanReferenceType::ChildOfRef, parent_span_context.get());
options.start_system_timestamp = start_time;
Those objects are really just to make specifying options with the StartSpan method convenient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had trouble making this change - compilation errors. Possibly this could be done as part of your PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a typo (corrected now)
options.references.emplace_back(opentracing::SpanReferenceType::ChildOf, parent_span_context.get());
should have been
options.references.emplace_back(opentracing::SpanReferenceType::ChildOfRef, parent_span_context.get());
I can verify the code tomorrow.
opentracing::ChildOf(parent_span_ctx.get()).Apply(options); | ||
opentracing::StartTimestamp(start_time).Apply(options); | ||
if (!tracing_decision.traced) { | ||
opentracing::SetTag(OpenTracingTags::get().SAMPLING_PRIORITY, 0).Apply(options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a constant for sampling priority to match what's in the other APIs so you could write
options.tags.emplace_back(opentracing::ext::sampling_priority, 0);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use the new constant.
Signed-off-by: Gary Brown <gary@brownuk.com>
Signed-off-by: Gary Brown <gary@brownuk.com>
Signed-off-by: Gary Brown <gary@brownuk.com>
@rnburn Thanks - that worked. However seems to be a build problem on mac with the newer version of opentracing? |
@objectiser - I put in a small change to the bazel build in opentracing/opentracing-cpp#69. It should work on OSX now. |
Signed-off-by: Gary Brown <gary@brownuk.com>
@rnburn @objectiser I'm not totally following the back and forth. Is this ready? @rnburn do you approve? |
Yes, it looks good to me. 👍 |
@rnburn Thanks for the quick fix. @mattklein123 Hopefully good to go now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @objectiser and @rnburn!
The sampling decision is currently used to determine whether a span should be created/started. The decision is propagated to downstream services via the
x-trace-id
header.The problem is that, if the
x-trace-id
is not propagated along with other trace context (e.g. the service does not propagate it due to using a zipkin compatible tracer in the service that does not know about this header), then the decision is lost in the downstream service and it will make its own sampling decision. This will lead to inconsistent fragmented traces.The other problem is that any B3-sampled header received with a request will be ignored by the local sampling decision.
This PR deals with the situation by delegating the envoy proxy's tracing (sampling) decision down to the tracers themselves. There are essentially three tracers, which handle the decision in different ways:
LightStep - this tracer only creates/starts a span if the tracing decision indicates it should - otherwise no span is created - as now. It is the responsibility of any proxied services to propagate the
x-trace-id
to ensure that decision is passed to downstream services.Zipkin - this tracer propagates its own sampling decision via the
X-B3-Sampled
header. If that header is not present, such as when creating the root span or it hasn't been propagated with the trace context, then the proxy's tracing (sampling) decision will be used to set theX-B3-Sampled
header which will subsequently be passed to downstream services.Other OpenTracing compatible tracers - the only way to influence the sampling of a trace instance in the OpenTracing standard is through the use of the
sampling.priority
tag - so in this situation if the proxy sampling decision is false, thesampling.priority
tag will be set to 0 when creating the span. It is then the tracer's responsibility to use or ignore that hint.Risk Level: Medium
Testing: unit tests
Docs Changes:
none
Release Notes:
This change delegates the sampling decision to the tracer, which may have additional context that needs to be taken into account when determining the actual sampling state. If no such state already exists for a trace instance, then the sampling decision supplied by the Envoy proxy will be used.