-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for sending traces to OpenTelemetry #1230
Conversation
embeddedconfig/global.yaml.tmpl
Outdated
@@ -427,3 +431,9 @@ replica: | |||
staticpeeraddrs: [] | |||
metadatabaseurls: *AllReplicaBaseUrls | |||
replicaserviceendpoint: *GlobalReplicaRust | |||
|
|||
otel: | |||
samplerate: 0.0001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sample rate matches Borda's current sample rate of 0.01%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's easier to specify the sample rate as an integer, where the ratio is 1/samplerate
(this also allows us to re-weight without losing precision both ways)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not sure I follow. What's the problem with using decimal rates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, by modeling this after Honeycomb's DeterministicSampler, I ended up switching to Integer rates anyway :)
otel/otel.go
Outdated
|
||
func init() { | ||
// Fully sample all borda.FullyReportedOps | ||
fullySampledOps = make(map[string]bool, len(borda.FullyReportedOps)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we might want to make these configurable if they're not already? I worry this could otherwise hammer honeycomb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also the only place in the otel
code where we reference borda
. Probably worth splitting it out. I can't find where this is defined at a glance, can you link it?
One note - this could muck with our re-weighting of sampled traces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer an issue
resource := | ||
resource.NewWithAttributes( | ||
semconv.SchemaURL, | ||
semconv.ServiceNameKey.String("flashlight"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need attribute.Key("SampleRate").Int64(1/cfg.SampleRate)
(with some type fudging) so that honeycomb re-weights counts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also appears to be wrong in github.com/getlantern/telemetry. (cc @myleshorton)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm handling this in our sampler on a span by span basis.
embeddedconfig/global.yaml.tmpl
Outdated
samplerate: 0.0001 | ||
endpoint: api.honeycomb.io:443 | ||
headers: | ||
x-honeycomb-team: GKSJeT1vfWEWgGzPoxiKLE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to not do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I guess that would be metrics.iantem.io
next.
ops.EnableOpenTelemetry("flashlight") | ||
|
||
stopperMx.Lock() | ||
if stopper != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary for this code to be thread safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the stopper can be called through the public function Stop()
, which can theoretically run concurrently with new calls to Configure()
. Configure()
is called every time our global config changes, so it happens multiple times during the running of the program. Stop()
is really only called when the app shuts down, but in theory that could happen concurrently with an ongoing Configure()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha - I understand the model now. I am a bit concerned that otel has global state (otel.SetTracerProvider
) and that we don't understand how thread-safe that is (it appears that it is, but that's also not in the contract). Do we already have locked global state in flashlight somewhere? If so we could just put a TracerProvider
in there, and it might simplify this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also full rely on otel's thread-safety and do something like
oldTp := otel.GetTracerProvider()
otel.SetTracerProvider(newTp)
if oldTp != nil {
oldTp.Shutdown()
}
Shutdown
appears to call exporter.Shutdown
. (At least the documentation says that it does.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, but sadly GetTracerProvider()
returns an interface that doesn't expose Shutdown()
. I can cast to the concrete type, but if someone set a different implementation of TracerProvider
then we get some weirdness. I think I'll leave as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would they do that? That's super annoying. What about
interface Shutdowner {
Shutdown()
}
if shutdowner, ok := oldTp.(Shutdowner); ok {
shutdowner.Shutdown()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, we SetTracerProvider
a few lines above, so it doesn't seem a stretch to try to cast to that type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave this here: open-telemetry/opentelemetry-go#23
go stopper() | ||
} | ||
stopper = func() { | ||
ctx, cancel := context.WithTimeout(context.Background(), 1*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging for a minute when a user is trying to close the app (?) seems bad. And the otel exporter should be much faster than that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's faster than that under good network conditions, but in a censored environment, potentially going over domain fronting (not yet implemented), it can take a moment. The alternative is to lose potentially valuable telemetry data.
0a9f002
to
e75c54a
Compare
headers: | ||
x-honeycomb-team: GKSJeT1vfWEWgGzPoxiKLE | ||
samplerate: 0.0001 | ||
opsamplerates: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale for treating these ops differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It varies by op, but let's take something like client_started
as an example. We may want to use that as a signifier that the client ran at all and use that to derive our user numbers. This even happens only once per startup, so if we're sampling it at a low rate, we'll completely miss a bunch of clients. If we're interesting in looking for clients of a specific app version in a small country, this means that we could completely miss the presence of these if they just didn't make the cut for getting sampled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More generally, for low frequency ops, sampling at low rates risks missing them completely, so it's desirable to sample some things at higher rates.
otel/otel.go
Outdated
psc := trace.SpanContextFromContext(p.ParentContext) | ||
|
||
sampleRateMx.RLock() | ||
defer sampleRateMx.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing what this protects against. If it's just opSampleRates
(isn't that defined at startup?) then we could make a copy of the map in the opAwareSampler
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think I was missing that we reload the global config sometimes. I would release the lock right after we read the sample right.
This is maybe a stupid question, but could we just restart flashlight when it changes instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flashlight is constantly handling client connections, so restarting the whole process is really disruptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, Configure
gets called when we reload the config, which creates a new opAwareSampler
, right? So we can actually copy the op sample rates into the struct and not worry about locking here. I think it's worth it since this will be called concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like using the copy, I've implemented that.
58ef0c3
to
5e28c90
Compare
The CI failures don't seem to be specific to this PR, so I'm going to merge. |
This PR adds support for sending ops data to an OpenTelemetry endpoint as traces. The sample rate can be configured through the global config, and all ops that are fully sampled by Borda are also fully sampled by OpenTelemetry.