Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEBUG-3533 increase UDS transport timeout to 30 seconds #4411

Merged
merged 5 commits into from
Feb 20, 2025

Conversation

p-datadog
Copy link
Member

@p-datadog p-datadog commented Feb 19, 2025

What does this PR do?

Increases timeout for unix domain socket transport from 1 to 30 seconds.

Motivation:

Shopist uses UDS and dd-trace-rb is able to submit traces to the agent via UDS with the current 1 second timeout but is not able to receive remote configuration with this timeout:

E, [2025-02-13T19:54:57.661215 #31] ERROR -- datadog: [datadog] (/usr/local/bundle/bundler/gems/dd-trace-rb-85f2c0f6f276/lib/datadog/core/remote/component.rb:53:in `rescue in block in initialize') remote worker error: Datadog::Core::Remote::Client::TransportError #<Datadog::Core::Transport::InternalErrorResponse:0x00007d1d6496f740>, error_type:Net::ReadTimeout error:Net::ReadTimeout location: /usr/local/bundle/bundler/gems/dd-trace-rb-85f2c0f6f276/lib/datadog/core/remote/client.rb:34:in `sync'. reseting client state

Increasing UDS timeout fixes the error and permits remote configuration to be retrieved.

Change log entry
Yes: increase timeout when unix domain socket is used for datadog agent

Additional Notes:

I investigated the timeout values and discovered the following: initially, there was no timeout set at all. In 2021 a timeout was added to both HTTP and UDS transports and it was set to 1 second for both. In 2024 the HTTP timeout was increased to 30 seconds with UDS timeout remaining unchanged at 1 second.

I confirmed that the agent may issue a synchronous request to the backend (over the internet) when remote configuration is requested by the tracer. Therefore the unix domain socket timeout should be the same as the HTTP timeout, which currently is 30 seconds for Ruby.

I think good timeout values for UDS are 5 or 10 seconds. We expect to not need to deal with a network however the agent does need time to process the request. I think if the HTTP timeout is left at 30 seconds, a 10 second timeout is reasonable for UDS; alternatively, a sensible combination in my mind would be 5 second timeout for UDS and 15 or 20 second timeout for HTTP.

Note that Heroku for example used to have (still has?) a fixed 30 second timeout for processing the entire request, and exceeding this time generally is a Bad Thing therefore in production having a 30 second timeout for any single component operation in a web app is often not ideal. Hence my thought to reduce the HTTP timeout from 30 to 15 or 20 seconds.

How to test the change?
Tested manually against shopist

@p-datadog p-datadog requested a review from a team as a code owner February 19, 2025 15:29
@github-actions github-actions bot added the core Involves Datadog core libraries label Feb 19, 2025
Copy link
Member

@Strech Strech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the issue is in receive timeout (maybe the payload and latency). Do this setting increases SO_RCVTIMEO or all type of timeouts?

@TonyCTHsu
Copy link
Contributor

Question: this value is already configurable from user, right? Does that really need to change the default?

@datadog-datadog-prod-us1
Copy link
Contributor

datadog-datadog-prod-us1 bot commented Feb 19, 2025

Datadog Report

Branch report: di-uds-timeout
Commit report: a10f126
Test service: dd-trace-rb

✅ 0 Failed, 20407 Passed, 1374 Skipped, 3m 19.42s Total Time

@pr-commenter
Copy link

pr-commenter bot commented Feb 19, 2025

Benchmarks

Benchmark execution time: 2025-02-19 22:10:24

Comparing candidate commit a10f126 in PR branch di-uds-timeout with baseline commit f710dac in branch master.

Found 0 performance improvements and 1 performance regressions! Performance is the same for 30 metrics, 2 unstable metrics.

scenario:profiler - gvl benchmark samples

  • 🟥 throughput [-709.521op/s; -701.130op/s] or [-6.066%; -5.995%]

@p-datadog
Copy link
Member Author

@Strech It is the read timeout on the socket. Set in Ruby so probably affects Ruby IO layer + everything under it.

I also found out while answering this question that the timeout is defined in two places for some reason, I changed the other one also.

@tonyhsu The timeout is user-configurable but the default value for UDS does not actually work for remote config (as discovered on shopist).

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.70%. Comparing base (30f6e21) to head (67be929).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4411      +/-   ##
==========================================
- Coverage   97.71%   97.70%   -0.02%     
==========================================
  Files        1366     1366              
  Lines       83402    83402              
  Branches     4231     4231              
==========================================
- Hits        81497    81488       -9     
- Misses       1905     1914       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@p-datadog p-datadog changed the title DEBUG-3533 increase UDS transport timeout to 10 seconds DEBUG-3533 increase UDS transport timeout to 30 seconds Feb 19, 2025
@p-datadog
Copy link
Member Author

I confirmed that the agent may issue a synchronous request to the backend (over the internet) when remote configuration is requested by the tracer. Therefore the unix domain socket timeout should be the same as the HTTP timeout, which currently is 30 seconds for Ruby.

@p-datadog p-datadog added this to the 2.11.0 milestone Feb 19, 2025
@p-datadog p-datadog merged commit 62d691e into master Feb 20, 2025
501 checks passed
@p-datadog p-datadog deleted the di-uds-timeout branch February 20, 2025 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants