Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't stop the profiler if encoding a profile fails #4779

Merged
merged 4 commits into from
Oct 16, 2024

Conversation

szegedi
Copy link
Contributor

@szegedi szegedi commented Oct 14, 2024

What does this PR do?

  • Profiler no longer treats errors during profile encoding to be fatal and no longer stops profiling if they occur. It will submit all the profiles that were successfully encoded.
  • As a minor improvement, profiler cycle restart is moved to an earlier stage in the gather-encode-send cycle (between gather and encode, it used to be between encode and submit), reducing the cadence drift. (Not that it's a big problem.)
  • As another minor improvement, timeline events profiler defers creation of the Profile object until encoding, as it's a stateless operation that doesn't affect the next cycle of the profiler and can thus be in the now no longer fatal encoding path.
  • Adds a test confirming encoding errors no longer stop profiling.
  • Adds a test confirming sending errors do not stop profiling.
  • Removes a check of an always-holding invariant from internal private _submit method, as well as a needless test for it.

Motivation

Originally I was investigating a customer support case of profiles not being submitted. I revisited various code paths that could result in the profiler being stopped. At first I suspected we have a bug that would cause the profiler to stop if it encountered an error sending the HTTP request. This turned out not to be the case, fortunately, and I added a unit test to confirm it.

At the same time, I realized that profile encoding errors on the other hand are treated as fatal and they shouldn't be.

To understand this, it's useful to know that the profiler has three distinct operations when its async operation fires every minute: Profiles are first gathered (in some internal data structure specific to profile type), then they are encoded (internal data structures serialized to gzipped pprof binary data), finally they are sent to the agent over HTTP. Of these operations, errors during gathering and encoding phases would case the profiler to stop entirely. (Even if only gathering of one type of profiles, e.g. CPU errored out, other sub-profiles, e.g. heap and event would be stopped too.)

It makes sense to stop profiling if gathering operation errors out – it is possible that something stateful in the sub-profilers can't recover from those. It also makes sense to not stop profiling when there's an error during send: these are in vast majority HTTP 502 and 503 errors, both transient.

Stopping profilers on encoding errors is too harsh, though. Encoding is a stateless operation that transforms immutable profiling data from some internal representation to gzipped pprof file. If one of those errors out (we don't currently have evidence of such errors), we should still try to submit other profile types (e.g. if events profile encoding fails, but CPU and heap profiles succeeded, we should still send them.) This is the main aspect of this change.

Additional Notes

JIRA ticket: PROF-10717

Copy link

github-actions bot commented Oct 14, 2024

Overall package size

Self size: 7.47 MB
Deduped: 63.33 MB
No deduping: 63.61 MB

Dependency sizes | name | version | self size | total size | |------|---------|-----------|------------| | @datadog/native-appsec | 8.1.1 | 18.67 MB | 18.68 MB | | @datadog/native-iast-taint-tracking | 3.1.0 | 12.27 MB | 12.28 MB | | @datadog/pprof | 5.3.0 | 9.85 MB | 10.22 MB | | protobufjs | 7.2.5 | 2.77 MB | 5.16 MB | | @datadog/native-iast-rewriter | 2.5.0 | 2.51 MB | 2.59 MB | | @opentelemetry/core | 1.14.0 | 872.87 kB | 1.47 MB | | @datadog/native-metrics | 2.0.0 | 898.77 kB | 1.3 MB | | @opentelemetry/api | 1.8.0 | 1.21 MB | 1.21 MB | | jsonpath-plus | 10.0.0 | 659.84 kB | 1.12 MB | | import-in-the-middle | 1.11.2 | 112.74 kB | 826.22 kB | | msgpack-lite | 0.1.26 | 201.16 kB | 281.59 kB | | opentracing | 0.14.7 | 194.81 kB | 194.81 kB | | pprof-format | 2.1.0 | 111.69 kB | 111.69 kB | | @datadog/sketches-js | 2.1.0 | 109.9 kB | 109.9 kB | | semver | 7.6.3 | 95.82 kB | 95.82 kB | | lodash.sortby | 4.7.0 | 75.76 kB | 75.76 kB | | lru-cache | 7.14.0 | 74.95 kB | 74.95 kB | | ignore | 5.3.1 | 51.46 kB | 51.46 kB | | int64-buffer | 0.1.10 | 49.18 kB | 49.18 kB | | shell-quote | 1.8.1 | 44.96 kB | 44.96 kB | | istanbul-lib-coverage | 3.2.0 | 29.34 kB | 29.34 kB | | rfdc | 1.3.1 | 25.21 kB | 25.21 kB | | tlhunter-sorted-set | 0.1.0 | 24.94 kB | 24.94 kB | | limiter | 1.1.5 | 23.17 kB | 23.17 kB | | dc-polyfill | 0.1.4 | 23.1 kB | 23.1 kB | | retry | 0.13.1 | 18.85 kB | 18.85 kB | | jest-docblock | 29.7.0 | 8.99 kB | 12.76 kB | | crypto-randomuuid | 1.0.0 | 11.18 kB | 11.18 kB | | koalas | 1.0.2 | 6.47 kB | 6.47 kB | | path-to-regexp | 0.1.10 | 6.38 kB | 6.38 kB | | module-details-from-path | 1.0.3 | 4.47 kB | 4.47 kB |

🤖 This report was automatically generated by heaviest-objects-in-the-universe

this.eventSerializer = new EventSerializer()
return profile
return () => thatEventSerializer.createProfile(startDate, endDate)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: why defer profiler creation here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It operates on the detached instance of EventSerializer, so any errors encountered should be nonfatal for the profiler operation. Since we no longer treat errors during encoding as fatal, moving it into encoding phase upholds the non-fatality of any errors that might arise. Yes, all it does is create a Profile object, and its constructor has not much in way of being able to throw errors, but on principle it's better it's there.

nsavoire
nsavoire previously approved these changes Oct 15, 2024
Copy link
Collaborator

@nsavoire nsavoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@pr-commenter
Copy link

pr-commenter bot commented Oct 15, 2024

Benchmarks

Benchmark execution time: 2024-10-15 14:29:15

Comparing candidate commit ab5e38f in PR branch szegedi/keep-on-profiling with baseline commit e453243 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 260 metrics, 6 unstable metrics.

@szegedi
Copy link
Contributor Author

szegedi commented Oct 15, 2024

Had to add another commit, the profiler must stop with error on first collect when there are no profilers configured otherwise the profiler/control benchmark does not terminate.

@szegedi szegedi merged commit 59eb9a7 into master Oct 16, 2024
198 checks passed
@szegedi szegedi deleted the szegedi/keep-on-profiling branch October 16, 2024 14:16
This was referenced Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants