Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add experimental tracing using a native event collector #4250

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

rochdev
Copy link
Member

@rochdev rochdev commented Apr 20, 2024

What does this PR do?

Add experimental tracing using a native event collector that stores and processes traces outside of the application.

Motivation

This was previously just a proof-of-concept. The goal of this PR is to have at last something on the main branch with proper CI tests so that we stay compatible with the use of events.

Additional Notes

Copy link

github-actions bot commented Apr 20, 2024

Overall package size

Self size: 6.73 MB
Deduped: 61.99 MB
No deduping: 62.27 MB

Dependency sizes

name version self size total size
@datadog/native-appsec 8.0.1 15.59 MB 15.6 MB
@datadog/native-iast-taint-tracking 2.1.0 14.91 MB 14.92 MB
@datadog/pprof 5.3.0 9.85 MB 10.22 MB
protobufjs 7.2.5 2.77 MB 6.56 MB
@datadog/native-iast-rewriter 2.3.1 2.15 MB 2.24 MB
@opentelemetry/core 1.14.0 872.87 kB 1.47 MB
@datadog/native-metrics 2.0.0 898.77 kB 1.3 MB
@opentelemetry/api 1.8.0 1.21 MB 1.21 MB
import-in-the-middle 1.7.4 70.19 kB 739.86 kB
msgpack-lite 0.1.26 201.16 kB 281.59 kB
opentracing 0.14.7 194.81 kB 194.81 kB
semver 7.5.4 93.4 kB 123.8 kB
pprof-format 2.1.0 111.69 kB 111.69 kB
@datadog/sketches-js 2.1.0 109.9 kB 109.9 kB
lodash.sortby 4.7.0 75.76 kB 75.76 kB
lru-cache 7.14.0 74.95 kB 74.95 kB
ignore 5.2.4 51.22 kB 51.22 kB
int64-buffer 0.1.10 49.18 kB 49.18 kB
shell-quote 1.8.1 44.96 kB 44.96 kB
istanbul-lib-coverage 3.2.0 29.34 kB 29.34 kB
tlhunter-sorted-set 0.1.0 24.94 kB 24.94 kB
limiter 1.1.5 23.17 kB 23.17 kB
dc-polyfill 0.1.4 23.1 kB 23.1 kB
retry 0.13.1 18.85 kB 18.85 kB
jest-docblock 29.7.0 8.99 kB 12.76 kB
crypto-randomuuid 1.0.0 11.18 kB 11.18 kB
path-to-regexp 0.1.7 6.78 kB 6.78 kB
koalas 1.0.2 6.47 kB 6.47 kB
module-details-from-path 1.0.3 4.47 kB 4.47 kB

🤖 This report was automatically generated by heaviest-objects-in-the-universe

@pr-commenter
Copy link

pr-commenter bot commented Apr 20, 2024

Benchmarks

Benchmark execution time: 2024-05-10 20:16:51

Comparing candidate commit 0ce5417 in PR branch collector-beta with baseline commit 52da415 in branch master.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 259 metrics, 6 unstable metrics.

scenario:plugin-graphql-with-depth-on-max-18

  • 🟩 max_rss_usage [-156.073MB; -89.875MB] or [-16.340%; -9.410%]

Copy link

codecov bot commented Apr 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 45.19%. Comparing base (5278b1c) to head (12d68ac).
Report is 4 commits behind head on master.

Current head 12d68ac differs from pull request most recent head c6e9cf4

Please upload reports for the commit c6e9cf4 to get more accurate results.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #4250       +/-   ##
===========================================
- Coverage   80.42%   45.19%   -35.24%     
===========================================
  Files           3       89       +86     
  Lines         373     2518     +2145     
  Branches       33       33               
===========================================
+ Hits          300     1138      +838     
- Misses         73     1380     +1307     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rochdev rochdev marked this pull request as ready for review April 23, 2024 09:48
@rochdev rochdev requested review from a team as code owners April 23, 2024 09:48
@rochdev rochdev requested a review from wconti27 April 23, 2024 09:48
Copy link
Member

@simon-id simon-id left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't do a super in depth review (especially for encoder.js), few nitpicks and questions but LGTM

@simon-id
Copy link
Member

simon-id commented May 3, 2024

comments closed, LGTM 👍

.filter(val => val)
.join(' ')

span.setTag(RESOURCE_NAME, resource)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is RESOURCE_NAME tag no longer added?

We're kinda-sorta using it here for establishing endpoint names to associate with profiles, although we can also compute the value from HTTP_METHOD and HTTP_ROUTE tags when RESOURCE_NAME isn't present.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still set but later in the pipeline. At this point the resource name cannot be trusted anyway because we allow changing it up until the span is finished, which is often used by customers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to simply always use HTTP_METHOD and HTTP_ROUTE for profiling since the real resource name would already be ignored anyway even with the current code? (although it could be argued these also can be changed by a span hook, but at that point the profile would need to wait for the span to actually be finished before looking at the resource name)

@rochdev rochdev marked this pull request as draft May 7, 2024 22:44
@rochdev rochdev force-pushed the collector-beta branch 2 times, most recently from 39e4e9f to f44dbab Compare May 8, 2024 01:48
@rochdev rochdev marked this pull request as ready for review May 8, 2024 01:51
Comment on lines +34 to +36
const ticks = finishTime
? now() + finishTime - trace.ticks - trace.startTime
: now() - trace.ticks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this on the collector side? Otherwise we need to keep the timing state in the library. I feel like it should be safe to just send a Date.now() in the start event and monotonic timestamps in each of the start and end, then compute the delta in the collector.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's basically what this is doing except that it does that at the trace level instead of the span level. This is for efficiency but also because it was otherwise causing intermittent slight drifts. It's also worth noting that the computation is only happening when a finish time if provided, otherwise it only sends the monotonic timestamp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both cases though you're holding the trace.ticks value. I'm suggesting we should try to avoid holding that in the library if at all possible. And I'm not sure why there would be drift coming from the same process on the same machine between monotonic timestamps that both came from that same process. 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also an optimization so that for applications under high load that generate traces quickly the events only need to send <=32 bits for the ticks instead of constantly 64 bits.

It also means an easier to reason about floor for the ticks which would always be 0 for the segment. However, since the number is arbitrary anyway, that doesn't actually matter too much.

Basically, it seemed to me that adding a single subtraction operation didn't add enough complexity to justify going with the alternative that saves a subtraction but ends up sending a lot more data. With the current structure of events, the timestamp is basically otherwise the largest piece of data of an event. I even considered going with microseconds since nanoseconds are rarely (read: never) useful for tracing data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you're trading an extra potential maximum of 32 bits of data over the wire for more than that held in process memory for the duration of that span. That adds up quickly when there are potentially hundreds of thousands of concurrent spans. Process memory is a lot more scarce than bandwidth.

More concerning to me though is that we need a place to store any data like this which is held in-process for the span duration. I'm trying to get us away from storing the big bag of data that is SpanContext, but things like this further entangle it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually the opposite. It trades the memory that would otherwise be held in memory for every span because the data is buffered, to a single integer that is held only once for the whole segment. So this actually saves memory in-process, and also has the benefit of reducing bandwidth. You could argue it increases CPU usage a tiny bit by adding a subtraction, but that's minimal enough to justify the trade-off.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear in what way this is reducing memory. We're still holding data in the SpanContext and therefore holding the SpanContext, are we not? So don't we still have the same problem with holding this big bag of data that we've always had? 😕

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand where the problem is. Correct me if I'm wrong, but these seem to be the 2 approaches we are discussing:

Option 1:

  • Send timestamp and ticks for segment.
  • Send raw ticks for the span.

Option 2:

  • Send timestamp for segment.
  • Send computed ticks for span.

For option 1, you end up having to always send 64bit values. For option 2, you often need less than 64bit since the values will start from 0 instead of the ticks when the segment started. So option 2 uses less memory, at the cost of slightly more CPU for a subtraction per span, but that is an extremely low cost, and would be offset anyway by the need to write less data to the buffer.

I don't understand how putting less data in the buffer is supposed to result in more memory? Can you clarify your calculation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computed ticks requires holding additional data on the Span Context between all the captured points in the segment, while just using raw ticks everywhere does not. It also makes it more complicated to take a more stream-oriented approach with less segment buffering as the timestamp the computed ticks are in relation to changes and would therefore add additional complexity to translating them back to absolute timestamps for backend processing and frontend visualization.

There's not a whole lot of benefit to being event-based if we're going to just continue buffering all the data in-process anyway. We're just moving held memory around rather than actually getting it out of the process which is the whole point of wanting to go for an event model.


let segmentId = 0

class DatadogCollectorSpan extends DatadogSpan {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sub-classing is rather difficult to follow. Can we not just put all this logic in-place in DatadogSpan and use channel.hasSubscribers to skip it? There's a lot of circular dependency between this and the parent class calling methods on this that makes it challenging to understand the flow. I feel like it'd be a lot easier to understand if it was done in the parent class instead, or even with plain functions called from the parent class. 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to isolate the logic to a separate class instead of mixing it with the existing logic. I could instead just put all the code in the same place but it seemed like it would be even more confusing. I felt like keeping them separate also makes it more obvious what is shared versus what isn't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be separate but as functions that are just called from the other. I feel like the flow would be a lot clearer that way. As a sub-class you have to do a lot of spelunking into the parent class to understand what is being called when. 😐

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent calling a protected method on the child is how most of the tracer works since it comes from an OpenTracing background and OT does that extensively, so I don't think it should be surprising to anyone reading the code.

With that said, I can change it if you really want, but I'm not sure of the best approach then. I don't necessarily want to do it in the same file, because one of the ideas here was to not load this code at all when it's not needed. Not doing this is currently a huge problem in the library especially for serverless, and I didn't want to exacerbate the issue. What approach would you suggest with that in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not going to block on this one, just wanted to point out it's really hard to follow so it would be ideal if we could make it more coherent somehow.

One possibility to enable only loading when you need it is multiple layers of diagnostics channel. You could have the existing class publish events with subscribers that do the transformations these functions are doing and then publish the result to those collector channels. Not sure that would be any clearer, but it would enable optional loading.

We could also consider some sort of dependency injection type thing where we register export strategies on the tracer and the span class could just loop over registered strategies and call each. Then the flow goes in a more singular direction rather than this confusing circular dependency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sound like great ideas to explore, but in my opinion are out of the scope of this PR. Internals are already loaded in various ways that can end up being hard to follow, and standardizing on a way to do this would be more beneficial than introducing yet another way to load things for this PR specifically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more concerned that piling on more complexity will make it even harder to untangle in the future whenever we get to cleaning that up. The present existence of tech debt doesn't justify making more. 😅

packages/dd-trace/src/collector/span.js Show resolved Hide resolved

this._timer = clearTimeout(this._timer)

collector.send_events(data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ is collector.send_events sync? Do we need a done callback then? Can't we just return? Or is this done to preserve the flush signature from other exporters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants