-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't stop the profiler if encoding a profile fails #4779
Conversation
Overall package sizeSelf size: 7.47 MB Dependency sizes| name | version | self size | total size | |------|---------|-----------|------------| | @datadog/native-appsec | 8.1.1 | 18.67 MB | 18.68 MB | | @datadog/native-iast-taint-tracking | 3.1.0 | 12.27 MB | 12.28 MB | | @datadog/pprof | 5.3.0 | 9.85 MB | 10.22 MB | | protobufjs | 7.2.5 | 2.77 MB | 5.16 MB | | @datadog/native-iast-rewriter | 2.5.0 | 2.51 MB | 2.59 MB | | @opentelemetry/core | 1.14.0 | 872.87 kB | 1.47 MB | | @datadog/native-metrics | 2.0.0 | 898.77 kB | 1.3 MB | | @opentelemetry/api | 1.8.0 | 1.21 MB | 1.21 MB | | jsonpath-plus | 10.0.0 | 659.84 kB | 1.12 MB | | import-in-the-middle | 1.11.2 | 112.74 kB | 826.22 kB | | msgpack-lite | 0.1.26 | 201.16 kB | 281.59 kB | | opentracing | 0.14.7 | 194.81 kB | 194.81 kB | | pprof-format | 2.1.0 | 111.69 kB | 111.69 kB | | @datadog/sketches-js | 2.1.0 | 109.9 kB | 109.9 kB | | semver | 7.6.3 | 95.82 kB | 95.82 kB | | lodash.sortby | 4.7.0 | 75.76 kB | 75.76 kB | | lru-cache | 7.14.0 | 74.95 kB | 74.95 kB | | ignore | 5.3.1 | 51.46 kB | 51.46 kB | | int64-buffer | 0.1.10 | 49.18 kB | 49.18 kB | | shell-quote | 1.8.1 | 44.96 kB | 44.96 kB | | istanbul-lib-coverage | 3.2.0 | 29.34 kB | 29.34 kB | | rfdc | 1.3.1 | 25.21 kB | 25.21 kB | | tlhunter-sorted-set | 0.1.0 | 24.94 kB | 24.94 kB | | limiter | 1.1.5 | 23.17 kB | 23.17 kB | | dc-polyfill | 0.1.4 | 23.1 kB | 23.1 kB | | retry | 0.13.1 | 18.85 kB | 18.85 kB | | jest-docblock | 29.7.0 | 8.99 kB | 12.76 kB | | crypto-randomuuid | 1.0.0 | 11.18 kB | 11.18 kB | | koalas | 1.0.2 | 6.47 kB | 6.47 kB | | path-to-regexp | 0.1.10 | 6.38 kB | 6.38 kB | | module-details-from-path | 1.0.3 | 4.47 kB | 4.47 kB |🤖 This report was automatically generated by heaviest-objects-in-the-universe |
this.eventSerializer = new EventSerializer() | ||
return profile | ||
return () => thatEventSerializer.createProfile(startDate, endDate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity: why defer profiler creation here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It operates on the detached instance of EventSerializer, so any errors encountered should be nonfatal for the profiler operation. Since we no longer treat errors during encoding as fatal, moving it into encoding phase upholds the non-fatality of any errors that might arise. Yes, all it does is create a Profile
object, and its constructor has not much in way of being able to throw errors, but on principle it's better it's there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
BenchmarksBenchmark execution time: 2024-10-15 14:29:15 Comparing candidate commit ab5e38f in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 260 metrics, 6 unstable metrics. |
cbe194f
to
ab5e38f
Compare
Had to add another commit, the profiler must stop with error on first collect when there are no profilers configured otherwise the |
What does this PR do?
_submit
method, as well as a needless test for it.Motivation
Originally I was investigating a customer support case of profiles not being submitted. I revisited various code paths that could result in the profiler being stopped. At first I suspected we have a bug that would cause the profiler to stop if it encountered an error sending the HTTP request. This turned out not to be the case, fortunately, and I added a unit test to confirm it.
At the same time, I realized that profile encoding errors on the other hand are treated as fatal and they shouldn't be.
To understand this, it's useful to know that the profiler has three distinct operations when its async operation fires every minute: Profiles are first gathered (in some internal data structure specific to profile type), then they are encoded (internal data structures serialized to gzipped pprof binary data), finally they are sent to the agent over HTTP. Of these operations, errors during gathering and encoding phases would case the profiler to stop entirely. (Even if only gathering of one type of profiles, e.g. CPU errored out, other sub-profiles, e.g. heap and event would be stopped too.)
It makes sense to stop profiling if gathering operation errors out – it is possible that something stateful in the sub-profilers can't recover from those. It also makes sense to not stop profiling when there's an error during send: these are in vast majority HTTP 502 and 503 errors, both transient.
Stopping profilers on encoding errors is too harsh, though. Encoding is a stateless operation that transforms immutable profiling data from some internal representation to gzipped pprof file. If one of those errors out (we don't currently have evidence of such errors), we should still try to submit other profile types (e.g. if events profile encoding fails, but CPU and heap profiles succeeded, we should still send them.) This is the main aspect of this change.
Additional Notes
JIRA ticket: PROF-10717