-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add on-by-default option to use TCP keepalive in the REST client. #5319
Conversation
9d5a6b4
to
8059abe
Compare
8059abe
to
5808eeb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly from documentation (https://everything.curl.dev/transfers/conn/keepalive.html), using the defaults for the rest of KEEPALIVE
options can buy as 60 sec (Idle time) + 60 sec (Probe interval) * 9 (Probe count) = 10 minutes of IDLE processing time on REST Server side.
I am worried this might not be enough time for a large ingestion (that's where we typically hit those "Connection reset by peer" type of errors), and I think it'd make sense to set a larger number of probes. In fact I don't see any downside to setting Probe count to 120
for example (and buy us 2 hours instead), as our goal here is not to detect broken connections as soon as possible, but to put our connection on top of the list of active connections in potential intermediate devices such as Firewalls etc for as long as possible before our query is done processing.
Let's also add a text description to this PR, and consider backport the change to 2.26 when approved. |
I don't understand this. The REST server doesn't have to respond by itself when receiving a keepalive packet, does it? Isn't the response to the TCP keepalive sent by the operating system's network stack? |
Sorry I wasn't clear. I am referring to case 2.4 here: https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html . The connection might be alive, but inactivity can make intermediate proxies and LBs decide to delete it as inactive. So sending keepalives for longer can reduce that probability, and sending them for only 10 minutes might not be enough. I've heard of ingestions lasting 1 hour or so in the past. Actually I can't think of another use case that Keepalives would be helpful in our case other than what I just described. Please keep me honest as I might be missing some other scenario. |
I think this ten minute timer gets reset every time a side receives and acknowledges a keepalive. Therefore an ingestion can last for longer. |
Sorry I wasn't clear. While the REST server is processing (e.g. writing to s3) an incoming WRITE query with large data (say 2GB) from the client, the TCP connection remains without any traffic. So that TCP connection, that is not necessarily between the client and the server, it might be between the client and a load balancer, can remain inactive for, say 20 mins, that it takes to process/ingest the incoming data. So even if the server replies with ACK in the first 10 minutes, it might need
You are right, I got it wrong. The default settings look ok then indeed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but let's add a PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved config.h and config_api_external.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left two suggestions but LGTM 👍
131e416
to
237632f
Compare
SC-56102
This PR enables by default TCP keepalive in the REST client connections, and adds an option to disable them. TCP keepalive is expected to reduce failure rates on large remote queries.
TYPE: FEATURE
DESC: Connections to TileDB Cloud use TCP keepalive by default.
TYPE: CONFIG
DESC: Add
rest.curl.tcp_keepalive
config option that controls using TCP keepalive for TileDB Cloud connections. It is enabled by default.