[PROF-11003] Fix unsafe initialization when using profiler with otel tracing #4195

ivoanjo · 2024-12-05T12:26:29Z

What does this PR do?

This PR fixes two issues in the profiler's support for reading from the opentelemetry ("otel") context:

If during the initialization of our otel reading code an exception was raised, we tried to rescue it but did not properly clean it up so it could still confuse the app/cause weird behaviors
Initialization of the otel reading code could happen during an allocation sample, where it's not safe to run random Ruby code (including throwing exceptions)

Motivation:

I suspect these issues may be linked to a customer crash:

[BUG] unexpected situation - recordd:1 current:0

-- C level backtrace information -------------------------------------------
ruby(rb_print_backtrace+0x11) [0x55ba03ccf90f] vm_dump.c:820
ruby(rb_vm_bugreport) vm_dump.c:1151
ruby(bug_report_end+0x0) [0x55ba03e91607] error.c:1042
ruby(rb_bug_without_die) error.c:1042
ruby(die+0x0) [0x55ba03ac0998] error.c:1050
ruby(rb_bug) error.c:1052
ruby(disallow_reentry+0x0) [0x55ba03ab6dcc] vm_sync.c:226
ruby(rb_ec_vm_lock_rec_check+0x1a) [0x55ba03cb17aa] eval_intern.h:144
ruby(rb_ec_tag_state) eval_intern.h:155
ruby(rb_vm_exec) vm.c:2484
ruby(vm_invoke_proc+0x201) [0x55ba03cb62b1] vm.c:1509
ruby(rb_vm_invoke_proc+0x33) [0x55ba03cb65d3] vm.c:1728
ruby(thread_do_start_proc+0x176) [0x55ba03c63516] thread.c:598
ruby(thread_do_start+0x12) [0x55ba03c648a2] thread.c:615
ruby(thread_start_func_2) thread.c:672
ruby(nt_start+0x107) [0x55ba03c65137] thread_pthread.c:2187
/lib/x86_64-linux-gnu/libpthread.so.0(start_thread+0xd9) [0x7ff360b66609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff360a70353]

...but I could not reproduce it myself. Nevertheless, the things being fixed were still definitely bugs ;)

Change log entry

Yes. Fix unsafe initialization when using profiler with otel tracing

Additional Notes:

N/A

How to test the change?

This change includes test coverage (took longer than the fixes to code :P)

The docs for `rb_protect` clearly say that we must call `rb_set_errinfo (Qnil)` if we want the exception to be cleanly ignored.

When called from inside an allocation (NEWOBJ tracepoint), it's not safe to allocate further new objects (including exceptions to be raised). One such example of an allocation is when calling `read_otel_current_span_key_const` to initialize the otel span key. Thus, this commit introduces the `is_safe_to_allocate_objects` flag (and plumbs it around a bunch of methods...) so that we can gate calls to `read_otel_current_span_key_const` and not perform them when they're not safe.

I was testing on 3.1 so I missed this one >_>

github-actions · 2024-12-05T12:26:42Z

Thank you for updating Change log entry section 👏

^{Visited at: 2024-12-05 14:06:54 UTC}

spec/datadog/profiling/collectors/thread_context_spec.rb

datadog-datadog-prod-us1 · 2024-12-05T12:37:19Z

Datadog Report

Branch report: ivoanjo/prof-11003-fix-profiler-otel-initialization
Commit report: 47b1706
Test service: dd-trace-rb

✅ 0 Failed, 22411 Passed, 1459 Skipped, 5m 51.73s Total Time

pr-commenter · 2024-12-05T12:49:02Z

Benchmarks

Benchmark execution time: 2024-12-05 12:48:45

Comparing candidate commit 47b1706 in PR branch ivoanjo/prof-11003-fix-profiler-otel-initialization with baseline commit f354358 in branch master.

Found 0 performance improvements and 1 performance regressions! Performance is the same for 30 metrics, 2 unstable metrics.

scenario:profiler - sample timeline=false

🟥 throughput [-0.511op/s; -0.463op/s] or [-7.901%; -7.172%]

codecov-commenter · 2024-12-05T12:59:17Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.75%. Comparing base (f354358) to head (47b1706).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4195      +/-   ##
==========================================
- Coverage   97.76%   97.75%   -0.01%     
==========================================
  Files        1357     1357              
  Lines       81890    81914      +24     
  Branches     4164     4164              
==========================================
+ Hits        80060    80076      +16     
- Misses       1830     1838       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This checker is used to detect accidental thread scheduling switching points happening during profiling sampling. See the bigger comment in unsafe_api_calls_check.h . I was able to check that this checker correctly triggers for the bug in #4195, and also the bug I'm going to fix next, which is the use of `rb_hash_lookup` in the otel context reading code.

`rb_hash_lookup` calls `#hash` on the key being looked up so it's safe to use unless during sampling. This can cause the same issue as we saw in #4195 leading to ``` [BUG] unexpected situation - recordd:1 current:0 -- C level backtrace information ------------------------------------------- ruby(rb_print_backtrace+0x11) [0x55ba03ccf90f] vm_dump.c:820 ruby(rb_vm_bugreport) vm_dump.c:1151 ruby(bug_report_end+0x0) [0x55ba03e91607] error.c:1042 ruby(rb_bug_without_die) error.c:1042 ruby(die+0x0) [0x55ba03ac0998] error.c:1050 ruby(rb_bug) error.c:1052 ruby(disallow_reentry+0x0) [0x55ba03ab6dcc] vm_sync.c:226 ruby(rb_ec_vm_lock_rec_check+0x1a) [0x55ba03cb17aa] eval_intern.h:144 ruby(rb_ec_tag_state) eval_intern.h:155 ruby(rb_vm_exec) vm.c:2484 ruby(vm_invoke_proc+0x201) [0x55ba03cb62b1] vm.c:1509 ruby(rb_vm_invoke_proc+0x33) [0x55ba03cb65d3] vm.c:1728 ruby(thread_do_start_proc+0x176) [0x55ba03c63516] thread.c:598 ruby(thread_do_start+0x12) [0x55ba03c648a2] thread.c:615 ruby(thread_start_func_2) thread.c:672 ruby(nt_start+0x107) [0x55ba03c65137] thread_pthread.c:2187 /lib/x86_64-linux-gnu/libpthread.so.0(start_thread+0xd9) [0x7ff360b66609] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff360a70353] ```

This checker is used to detect accidental thread scheduling switching points happening during profiling sampling. See the bigger comment in unsafe_api_calls_check.h . I was able to check that this checker correctly triggers for the bug in #4195, and also the bug I'm going to fix next, which is the use of `rb_hash_lookup` in the otel context reading code.

`rb_hash_lookup` calls `#hash` on the key being looked up so it's safe to use unless during sampling. This can cause the same issue as we saw in #4195 leading to ``` [BUG] unexpected situation - recordd:1 current:0 -- C level backtrace information ------------------------------------------- ruby(rb_print_backtrace+0x11) [0x55ba03ccf90f] vm_dump.c:820 ruby(rb_vm_bugreport) vm_dump.c:1151 ruby(bug_report_end+0x0) [0x55ba03e91607] error.c:1042 ruby(rb_bug_without_die) error.c:1042 ruby(die+0x0) [0x55ba03ac0998] error.c:1050 ruby(rb_bug) error.c:1052 ruby(disallow_reentry+0x0) [0x55ba03ab6dcc] vm_sync.c:226 ruby(rb_ec_vm_lock_rec_check+0x1a) [0x55ba03cb17aa] eval_intern.h:144 ruby(rb_ec_tag_state) eval_intern.h:155 ruby(rb_vm_exec) vm.c:2484 ruby(vm_invoke_proc+0x201) [0x55ba03cb62b1] vm.c:1509 ruby(rb_vm_invoke_proc+0x33) [0x55ba03cb65d3] vm.c:1728 ruby(thread_do_start_proc+0x176) [0x55ba03c63516] thread.c:598 ruby(thread_do_start+0x12) [0x55ba03c648a2] thread.c:615 ruby(thread_start_func_2) thread.c:672 ruby(nt_start+0x107) [0x55ba03c65137] thread_pthread.c:2187 /lib/x86_64-linux-gnu/libpthread.so.0(start_thread+0xd9) [0x7ff360b66609] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff360a70353] ```

* add supported version script and table * update script * rubocop lint * modify script locations, add description to md table * improve table output * wip * refactor code * wip * add supported versions * add branch for testing * remove json to avoid merge conflict issues * update PR body * Update .github/scripts/generate_table_versions.rb Co-authored-by: Steven Bouwkamp <stevenbouwkamp@gmail.com> * switch to use gem declarations instead of hardcoded mappings * linting checks * cleanup comments * refactor code * cleanup code * Combine duplicate option table rows The documentation for instrumenting rake had two rows for the same option key. This consolidates those entries into a single row. * Enable type checking for AgentSettingsResolver/AgentSettings Steep doesn't seem to be a big fan of Structs so I just went ahead and turned the `AgentSettings` into a regular class that's equivalent to the struct we had before. (In particular, I decided to still keep every field as optional). Ideally this would be a `Data` class, but we're far from dropping support for Rubies that don't have it. * Move url building behavior from `AgentBaseUrl` to `AgentSettings` This is preparation to also share this behavior with profiling. * Refactor crashtracking to use `AgentSettings#url` The behavior from the old `AgentBaseUrl` is now contained in `AgentSettings` so we can clean up the extra logic. * [PROF-11078] Fix profiling exception when agent url is an ipv6 address **What does this PR do?** This PR builds atop #4237 and fixes a similar-ish issue in the profiler caused by the same mishandling of ipv6 addresses. In particular, when provided with an ipv6 address in the agent url, the profiler would fail with an exception: ``` $ env DD_AGENT_HOST=2001:db8:1::2 DD_PROFILING_ENABLED=true \ bundle exec ddprofrb exec ruby -e "sleep 2" dd-trace-rb/lib/datadog/profiling/http_transport.rb:27:in `initialize': Failed to initialize transport: invalid authority (ArgumentError) ``` **Motivation:** Luckily we didn't have any customers using this, as it fails immediately and loudly, but it's still a bug on a configuration that should be supported. **Additional Notes:** Since we had similar buggy logic copy-pasted in crashtracking and profiling (crashtracking had been fixed in #4237) I chose to extract out the relevant logic into the `AgentSettings` class, so that both can reuse it. **How to test the change?** I've added unit test coverage for this issue to profiling, and the snippet above can be used to end-to-end test it's working fine. Here's how it looks on my machine now: ``` E, [2025-01-02T17:32:32.398756 #359317] ERROR -- datadog: [datadog] (dd-trace-rb/lib/datadog/profiling/http_transport.rb:68:in `export') Failed to report profiling data (agent: http://[2001:db8:1::2]:8126/): failed ddog_prof_Exporter_send: error trying to connect: tcp connect error: Network is unreachable (os error 101): tcp connect error: Network is unreachable (os error 101): Network is unreachable (os error 101) ``` E.g. we correctly try to connect to the dummy address, and fail :) (Note: The error message is a bit ugly AND repeats itself a bit. That's being tracked separately in DataDog/libdatadog#283 ) * Implement `==` for new `AgentSettings` class Forgot this one, some of our tests relied on it! * use Ruby 3.4.1 for test-memcheck GHA * Update exceptions file with another variant of thread creation memory leak Since our exceptions match on the stack, they are affected by internal naming changes, and it looks like a new `ruby_xcalloc_body` function is now showing up in the stack. * Introduce Ruby 3.5 gemfile variant for testing with dev builds This is waaay incomplete in terms of adding support for Ruby 3.5 but should get us going for ASAN testing for now. * Update list of files used to compute cache checksum In practice this shouldn't make a difference, since the final lockfiles are supposed to be a superset of the root-level gemfile BUT the `Appraisals` file no longer exists anyway and "just in case" let's have it anyway as it seems more correct. * Bump Ruby 3.4 integration image to stable version * Remove workaround for strscan issue This is not expected to be an issue in 3.5 (and is probably fixed for 3.4 as well, but I'll leave that for a separate PR to not affect the appraisals). * Add unsafe api calls checker to track down issues such as #4195 This checker is used to detect accidental thread scheduling switching points happening during profiling sampling. See the bigger comment in unsafe_api_calls_check.h . I was able to check that this checker correctly triggers for the bug in #4195, and also the bug I'm going to fix next, which is the use of `rb_hash_lookup` in the otel context reading code. * Fix going into Ruby code when looking up otel context `rb_hash_lookup` calls `#hash` on the key being looked up so it's safe to use unless during sampling. This can cause the same issue as we saw in #4195 leading to ``` [BUG] unexpected situation - recordd:1 current:0 -- C level backtrace information ------------------------------------------- ruby(rb_print_backtrace+0x11) [0x55ba03ccf90f] vm_dump.c:820 ruby(rb_vm_bugreport) vm_dump.c:1151 ruby(bug_report_end+0x0) [0x55ba03e91607] error.c:1042 ruby(rb_bug_without_die) error.c:1042 ruby(die+0x0) [0x55ba03ac0998] error.c:1050 ruby(rb_bug) error.c:1052 ruby(disallow_reentry+0x0) [0x55ba03ab6dcc] vm_sync.c:226 ruby(rb_ec_vm_lock_rec_check+0x1a) [0x55ba03cb17aa] eval_intern.h:144 ruby(rb_ec_tag_state) eval_intern.h:155 ruby(rb_vm_exec) vm.c:2484 ruby(vm_invoke_proc+0x201) [0x55ba03cb62b1] vm.c:1509 ruby(rb_vm_invoke_proc+0x33) [0x55ba03cb65d3] vm.c:1728 ruby(thread_do_start_proc+0x176) [0x55ba03c63516] thread.c:598 ruby(thread_do_start+0x12) [0x55ba03c648a2] thread.c:615 ruby(thread_start_func_2) thread.c:672 ruby(nt_start+0x107) [0x55ba03c65137] thread_pthread.c:2187 /lib/x86_64-linux-gnu/libpthread.so.0(start_thread+0xd9) [0x7ff360b66609] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff360a70353] ``` * Avoid trying to sample allocations when VM is raising exception During my experiments to reproduce issues around allocation profiling, I've noted that the VM is in an especially delicate state during exception raising, so let's just decline to sample in this situation. * Update tests with new signatures for test methods * Check if symbol is static before calling SYM2ID on it It occurs to me that if a symbol is dynamic, we were causing it to become a static symbol (e.g. making it never be able to be garbage collected). This can be very bad! And also, we know the symbol we're looking for must be a static symbol because if nothing else, our initialization caused it to become a static symbol. Thus, if we see a dynamic symbol, we can stop there, since by definition it won't be the symbol we're looking after. This is... really awkward to add a specific unit test for, so I've just relied on our existing test coverage to show that this has not affected the correctness of our otel code. * Document that unsafe api calls checker is only for test code * Add 3.4 support * Update DevelopmentGuide * Remove `racc` gem from 3.3 and 3.4 appraisal files * [🤖] Lock Dependency: https://github.com/DataDog/dd-trace-rb/actions/runs/12595964519 * Remove strscan specification in 3.4 gemfile * [🤖] Lock Dependency: https://github.com/DataDog/dd-trace-rb/actions/runs/12595969993 * add hardcoded update workflow file --------- Co-authored-by: Steven Bouwkamp <stevenbouwkamp@gmail.com> Co-authored-by: Bradley Schaefer <soulcutter@soulcutter.com> Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com> Co-authored-by: Andrey Marchenko <andrey.marchenko@datadoghq.com> Co-authored-by: Sarah Chen <sarah.chen@datadoghq.com> Co-authored-by: ivoanjo <ivoanjo@users.noreply.github.com>

ivoanjo added 4 commits December 5, 2024 12:10

Bugfix: Clear pending exception after ignoring it

383afbb

The docs for `rb_protect` clearly say that we must call `rb_set_errinfo (Qnil)` if we want the exception to be cleanly ignored.

Minor rubocop fixes

10e4def

Fix missing argument in function that's only used on Ruby 3.2+

47b1706

I was testing on 3.1 so I missed this one >_>

ivoanjo requested review from a team as code owners December 5, 2024 12:26

github-actions bot added the profiling Involves Datadog profiling label Dec 5, 2024

datadog-datadog-prod-us1 bot reviewed Dec 5, 2024

View reviewed changes

spec/datadog/profiling/collectors/thread_context_spec.rb Show resolved Hide resolved

spec/datadog/profiling/collectors/thread_context_spec.rb Show resolved Hide resolved

ivoanjo added bug Involves a bug otel OpenTelemetry-related changes labels Dec 5, 2024

anmarchenko approved these changes Dec 5, 2024

View reviewed changes

ivoanjo merged commit 6f2057b into master Dec 5, 2024
348 checks passed

ivoanjo deleted the ivoanjo/prof-11003-fix-profiler-otel-initialization branch December 5, 2024 14:07

github-actions bot added this to the 2.8.0 milestone Dec 5, 2024

ivoanjo mentioned this pull request Dec 10, 2024

Bump to version 2.8.0 #4215

Merged

ivoanjo mentioned this pull request Dec 19, 2024

[PROF-11003] Fix allocation profiling + otel tracing causing Ruby crash #4240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-11003] Fix unsafe initialization when using profiler with otel tracing #4195

[PROF-11003] Fix unsafe initialization when using profiler with otel tracing #4195

ivoanjo commented Dec 5, 2024 •

edited

Loading

github-actions bot commented Dec 5, 2024 •

edited

Loading

datadog-datadog-prod-us1 bot commented Dec 5, 2024 •

edited

Loading

pr-commenter bot commented Dec 5, 2024

codecov-commenter commented Dec 5, 2024

[PROF-11003] Fix unsafe initialization when using profiler with otel tracing #4195

[PROF-11003] Fix unsafe initialization when using profiler with otel tracing #4195

Conversation

ivoanjo commented Dec 5, 2024 • edited Loading

github-actions bot commented Dec 5, 2024 • edited Loading

datadog-datadog-prod-us1 bot commented Dec 5, 2024 • edited Loading

Datadog Report

pr-commenter bot commented Dec 5, 2024

Benchmarks

scenario:profiler - sample timeline=false

codecov-commenter commented Dec 5, 2024

Codecov Report

ivoanjo commented Dec 5, 2024 •

edited

Loading

github-actions bot commented Dec 5, 2024 •

edited

Loading

datadog-datadog-prod-us1 bot commented Dec 5, 2024 •

edited

Loading