Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: [do not merge] OpenTelemetry Datadog Exporter #1316

Closed

Conversation

ericmustin
Copy link

@ericmustin ericmustin commented Jul 15, 2020

Edit

DataDog has released an OpenTelemetry exporter which can be found at https://github.com/DataDog/dd-opentelemetry-exporter-js

Which problem is this PR solving?

  • This PR adds a plugin for exporting traces to the 3rd party vendor Datadog. It is meant for review purposes only and should not be merged since, afaik current opentelemetry exporter vendor policy dictates that 3rd party vendor specific exporters should be hosted under the Vendor's Github Org, and they should be linked to via the OpenTelemetry-Registry. That being said per SIG mtg discussion I was told this would be the appropriate way to be able to get as many eyes for a review as possible.

Any and all feedback is very much appreciated!

Short description of the changes

  • This PR adds a few components which I'll attempt to summarise below

    • DatadogSpanProcessor: An implementation of the SpanProcessor. A current constraint when exporting traces to the Datadog Trace-Agent is that the traces must be 'complete' ie, all started spans in a trace should be sent in one array when they are all finished. The DatadogSpanProcessor is functionally very similar to the BatchSpanProcessor in that it batches together spans for export, the main implementation difference that it only exports complete traces, and includes both a maxQueueSize and maxTraceSize

    • DatadogExporter: An exporter implementation that exposes export and shutdown methods. The exporter in practice is very similar to the JaegerExporter in that it mainly serves to transform OpenTelemetry Spans into Datadog Spans (via the helper methods in ./transform.ts ), and then flush those to the Datadog trace-agent. It does so by leveraging the datadog tracer dd-trace-js's internal AgentExporter , which converts the Spans into the appropriate msgpack format and sends them via http to the trace-agent. Additionally, the DatadogExporter allows users to set Datadog specific tags such as env, service_name, and version. both by Env var and via in-code configuration

    • DatadogPropagator: A propagator implementation that injects/extracts datadog specific distributed tracing propagation headers. This is similar to the B3Propagator in that it must handle 128bit => 64bit trace-id conversion.

    • DatadogProbabilitySampler: A sampler implementation that records all spans but only samples span according to the sampling rate. This is similar to the ProbabilitySampler with the main difference being that all spans are records (but not sampled). The underlying reasons for this is that Datadog's Trace-Agent generates some trace-related metrics such as hits and errors, and in order to properly upscale sampled traces into accurate hit counts, it needs to be aware of all traces/spans. Since the ReadableSpan doesn't expose the sampling rate of the trace, the only way to determine this information is to export all spans and traces (with traces that should be dropped due to sampling having their datadog specific sampling rate metric tag adjusted accordingly but still exported to the trace agent)

Example Usage

import { NodeTracerProvider } from '@opentelemetry/node';
import { DatadogSpanProcessor, DatadogExporter, DatadogPropagator, DatadogProbabilitySampler } from '@opentelemetry/exporter-datadog';
const provider = new NodeTracerProvider();
const exporterOptions = {
  service_name: 'my-service', // optional
  agent_url: 'http://localhost:8126', // optional
  tags: 'example_key:example_value,example_key_two:value_two', // optional
  env: 'production', // optional
  version: '1.0' // optional
}
const exporter = new DatadogExporter(exporterOptions);
//  Now, register the exporter.
provider.addSpanProcessor(new DatadogSpanProcessor(exporter));
// Next, add the Datadog Propagator for distributed tracing
provider.register({
  propagator: new DatadogPropagator(),
  // while datadog suggests the default ALWAYS_ON sampling, for probability sampling,
  // to ensure the appropriate generation of tracing metrics by the datadog-agent,
  // use the `DatadogProbabilitySampler`
  sampler: new DatadogProbabilitySampler(0.75)  
})

Example Output

example output

ericmustin added 19 commits July 6, 2020 17:40
…e sampling rate is accounted for on auto_reject
…ess Maps and account for buffer max trace and queue size
…s and update underlying behavior to match expectations
merge latest changes from master
Copy link
Member

@dyladan dyladan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not merge

@dyladan
Copy link
Member

dyladan commented Jul 15, 2020

Based on your explanation of your sampler, you may be interested in this open-telemetry/oteps#107

@codecov
Copy link

codecov bot commented Jul 15, 2020

Codecov Report

Merging #1316 into master will decrease coverage by 0.03%.
The diff coverage is 92.77%.

@@            Coverage Diff             @@
##           master    #1316      +/-   ##
==========================================
- Coverage   93.16%   93.12%   -0.04%     
==========================================
  Files         139      146       +7     
  Lines        3921     4336     +415     
  Branches      804      922     +118     
==========================================
+ Hits         3653     4038     +385     
- Misses        268      298      +30     
Impacted Files Coverage Δ
...metry-exporter-datadog/src/datadogSpanProcessor.ts 87.38% <87.38%> (ø)
...es/opentelemetry-exporter-datadog/src/transform.ts 90.90% <90.90%> (ø)
...ages/opentelemetry-exporter-datadog/src/datadog.ts 94.11% <94.11%> (ø)
...-exporter-datadog/src/datadogProbabilitySampler.ts 94.73% <94.73%> (ø)
...elemetry-exporter-datadog/src/datadogPropagator.ts 98.18% <98.18%> (ø)
...ges/opentelemetry-exporter-datadog/src/defaults.ts 100.00% <100.00%> (ø)
...ckages/opentelemetry-exporter-datadog/src/types.ts 100.00% <100.00%> (ø)
... and 4 more

Copy link
Member

@vmarchaud vmarchaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. The only things that stand out is the specific behavior to wait for all spans to be ended to send them; the user should be aware that the exporter will drop spans that should have been traced (specially because the default NoopLogger, it might be hard for user to understand where spans are dropped)

this._traces_spans_started.delete(traceId);
this._traces_spans_finished.delete(traceId);
this._check_traces_queue.delete(traceId);
this._exporter.export(spans, () => {});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to log an error here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this will make an export fail silently which is no good

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup makes sense, i've added an error log here. (are there preferences on log level? error feels appropriate here since most debug logs wont get registered, but not sure if there's otel conventions are noiseyness)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exporter for the otel collector logs an error for reference, so i think thats the way to go

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty, will keep as error log.

Copy link
Member

@dyladan dyladan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems ok to me in general. I would worry about the possibility for denial of tracing by an attacker who intentionally makes many requests which don't end, but this is probably minor and there are other ways to combat that.

this._exporter.shutdown();
}

// does nothing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// does nothing.
// does something.

:)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, updated to be a bit more specific

this._traces_spans_started.delete(traceId);
this._traces_spans_finished.delete(traceId);
this._check_traces_queue.delete(traceId);
this._exporter.export(spans, () => {});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this will make an export fail silently which is no good


return (
this._traces_spans_started.get(traceId) -
this._traces_spans_finished.get(traceId) <=
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird indenting here. Does the linter force this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, there's a few like this, the linter complains otherwise, this is the output of npm run lint:fix i believe.

}
}

getTraceContext(span: ReadableSpan): string | undefined[] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
getTraceContext(span: ReadableSpan): string | undefined[] {
getTraceContext(span: ReadableSpan): (string | undefined)[] {

OR

Suggested change
getTraceContext(span: ReadableSpan): string | undefined[] {
getTraceContext(span: ReadableSpan): Array<string | undefined> {

OR

Suggested change
getTraceContext(span: ReadableSpan): string | undefined[] {
getTraceContext(span: ReadableSpan): [string, string, string | undefined] {

Copy link
Author

@ericmustin ericmustin Jul 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, updated. apologies here I don't use typescript terribly often as you can probably tell, so if there's any ts choices here that you think are questionable i'd certainly be open to updating.

);

try {
this._exporter.export(formattedDatadogSpans);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sync?

Copy link
Author

@ericmustin ericmustin Jul 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is, but it's only appending the trace to the AgentExporter buffer (which gets flushed on an interval), see: https://github.com/DataDog/dd-trace-js/blob/89e9caa74f643fa31e3aba80f608a0f8857f8b74/packages/dd-trace/src/exporters/agent/index.js#L18

Copy link
Member

@obecny obecny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pass

packages/opentelemetry-exporter-datadog/src/transform.ts Outdated Show resolved Hide resolved
packages/opentelemetry-exporter-datadog/src/transform.ts Outdated Show resolved Hide resolved
packages/opentelemetry-exporter-datadog/src/transform.ts Outdated Show resolved Hide resolved
packages/opentelemetry-exporter-datadog/src/datadog.ts Outdated Show resolved Hide resolved
packages/opentelemetry-exporter-datadog/src/datadog.ts Outdated Show resolved Hide resolved
packages/opentelemetry-exporter-datadog/src/datadog.ts Outdated Show resolved Hide resolved
@pauldraper
Copy link
Contributor

pauldraper commented Jul 16, 2020

The least clear thing is how to populate Datadog's name and resource, because OTel doesn't have both concepts. open-telemetry/opentelemetry-specification#531

When I created my version of the exporter, I split the span names. Like http.request:/accounts/{int}/users/{int}/permissions became name http.request and resource /accounts/{int}/users/{int}/permissions. Or pg.query:SELECT became name pg.query and resource SELECT.

I like this approach too though.

screenshot_668

@ericmustin
Copy link
Author

Seems ok to me in general. I would worry about the possibility for denial of tracing by an attacker who intentionally makes many requests which don't end, but this is probably minor and there are other ways to combat that.

@dyladan in the otel ruby trace exporter i had some logic for dropping a "random" trace when the queue hits max size, which prevents the sort of denial attack you're referring to and allows the traces that do get exported to represent a better approximation of actual application behavior. Do you think that's worth implementing here, it obviously has some shortfalls but could be better than no traces being exported at all in the case of extremely high load+incomplete traces

@dyladan
Copy link
Member

dyladan commented Jul 16, 2020

Seems ok to me in general. I would worry about the possibility for denial of tracing by an attacker who intentionally makes many requests which don't end, but this is probably minor and there are other ways to combat that.

@dyladan in the otel ruby trace exporter i had some logic for dropping a "random" trace when the queue hits max size, which prevents the sort of denial attack you're referring to and allows the traces that do get exported to represent a better approximation of actual application behavior. Do you think that's worth implementing here, it obviously has some shortfalls but could be better than no traces being exported at all in the case of extremely high load+incomplete traces

It's up to you if it's worth the extra work for the minimal benefit. I would probably not worry about it unless it becomes a problem. The better solution would be for the agent to not require full traces so you can export spans at any time :)

@jon-whit
Copy link

@ericmustin dude, you're a community hero! Thanks for putting this up. I've been wanting this now at my organization for a while. Looking forward to getting this merged in within the coming weeks :)

Copy link
Member

@obecny obecny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thx for all changes, there are just few minor comments yet from my side, but I'm approving, nice job :)

return ddSpanBase;
}

function addErrors(ddSpanBase: typeof Span, span: ReadableSpan): void {
if (span.status && span.status.code && span.status.code > 0) {
// TODO: set error.msg error.type error.stack based on error events
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: would you mind create an issue in repo so this is not forgotten ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup can do, i'll add an open issue linking to the spec pr open-telemetry/opentelemetry-specification#697 (which got approved it looks like 🎉 )

if (span.status && span.status.code && span.status.code > 0) {
// TODO: set error.msg error.type error.stack based on error events
// Opentelemetry-js has not yet implemented https://github.com/open-telemetry/opentelemetry-specification/pull/697
// the type and stacktrace are not officially recorded. Until this implemented,
// we can infer a type by using the status code and also the non spec `<library>.error_name` attribute
const possibleType = inferErrorType(span);
ddSpanBase.setTag('error', 1);
ddSpanBase.setTag('error.msg', span.status.message);
ddSpanBase.setTag(DatadogDefaults.ERROR_TAG, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this 1 mean, can this have a different values? maybe add some explanation or soem enum what it really means - I see this in few other places so this might make sense to put here some const default

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error span tag can be either 1 or 0 if set. 0 is ok, 1 is error. Generally it only gets set in the case of an error. I've moved this into a more descriptive enum with a code comment to clarify.

packages/opentelemetry-exporter-datadog/src/transform.ts Outdated Show resolved Hide resolved
packages/opentelemetry-exporter-datadog/src/transform.ts Outdated Show resolved Hide resolved
@ericmustin
Copy link
Author

@pauldraper yup I hear you, appreciate the feedback, and yes i agree with what you've brought up. For now the approach across ddog exporters in other languages (python, and ruby which is soon to ship, and here) is:

  • For resource, use some attribute/tag heuristics to determine if it's a web/http related span, and if so we set the datadogSpan resource to be the http method + route ex: GET api/info POST admin/update etc etc. This gets sanitized/normalized at the datadog-agent level (where the exporter is sending traces to).
  • Otherwise it defaults to the otel span name, which is sometimes not especially granular in the way users might prefer for helping get deeper visibility into a particular service.
  • for span operation name, we're mainly using the span.kind along with the instrumentationLibrary.name when it matches a supported plugin. so a span name's oughta look something like express.server http.client etc etc

I think we're definitely looking to be iterative here and so if there's feedback (when this starts to get used by folks) around a need for better resource/service/operation name visibility (in the datadog senses of the terms) for specific types of spans, we want to address that.

more thoughts about db spans and some general config thoughts

  • DB spans come to mind here as a place where it would be really nice to have the datadog resource set to the full obfuscated sql query., but i think it's a potential footgun. I'm still relatively new to the Otel space, but imo DB span granularity/normalization is an area i'd prefer to see addressed in a more formalised way at the Spec level, instead of having to apply a lot of vendor specific heuristics at the exporter level. The fact is that the approach to setting some sort of low cardinality span name that represents the database command varies pretty wildly across languages. JS takes an approach (pg, mysql) that is similar to what we recently added to a ruby instrumentation approach and imo is i think a nice middle ground between 0 granularity and ...shoving a sql parsing into the tracer... .But python instrumentation OTOH doesn't attempt to infer the sql command. That being said, to borrow a phrase, you goto war with the army you have, and so if the users are finding the mysql.query:SELECT pg.query:DELETE resource naming convention unhelpful and would prefer just SELECT DELETE as the resource, I think i'd be open to just adding an additional heuristic into the exporter.

  • Another idea we've seen success with on the datadog side of things is just allowing some sort of arbitrary post_processing hook where uses can define a function that takes the span as an input and allows them to modify/mutate it to fit their preferred use case. It introduces some crazy performance issues though when used improperly, it makes for a really time consuming config/onboarding experience for many users, and i'm not sure if there's a precedent around that approach in otel in general.

… queue, updated snake to camel case where appropriate
@ericmustin
Copy link
Author

@dyladan I went ahead and added some logic for dropping a random trace in the event of very high load + many incomplete traces, lmk what you think, it's relatively straightforward. When appropriate I can close this PR and start to publish this under a datadog repo+ make a pr with that repo to get it listed in the registry. And it goes without saying but appreciate the really thoughtful and timely feedback from you, @obecny and @vmarchaud on this.

@pauldraper
Copy link
Contributor

pauldraper commented Jul 18, 2020

DB spans come to mind here as a place where it would be really nice to have the datadog resource set to the full obfuscated sql query

DB queries aren't necessarily low cardinality, though many perf-based tools do rely on that fact.

Another issue is that DB queries can be very long (many hundreds or thousands of characters).

Another idea we've seen success with on the datadog side of things is just allowing some sort of arbitrary post_processing hook

I think this is a good approach, and in practice quite useful.

@dyladan
Copy link
Member

dyladan commented Aug 4, 2020

Is this now OK to close?

@jon-whit
Copy link

What's the status on this? I don't see any follow up work from it.

@ericmustin
Copy link
Author

@jon-whit hey there, we've released a beta exporter if you want to give it a shot. https://github.com/DataDog/dd-opentelemetry-exporter-js

@dyladan
Copy link
Member

dyladan commented Aug 28, 2020

@ericmustin you may want to add it to the registry if you haven't already

@ericmustin
Copy link
Author

@dyladan ah yup, will do, i have to add a bunch of ruby stuff in there as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants