shared: don't infinitely retry on I/O timeout #8

bimmlerd · 2023-11-16T09:21:38Z

After the introduction of shared clients, we received reports of excessive CPU usage. Through a debug logging build, the cause was identified to be mishandling of "i/o timeout" errors, which are returned from read (or write) calls when the read (or write) deadline of a connection are exceeded.

In the existing code, once we hit such an I/O timeout, the receiving goroutine was calling 'ReadMsg' in a busy loop, as the connection was not closed, but reads were not succeeding either. No useful work was performed anymore, until a further attempt at sending would reset the read deadline (c.f. the 'SendContext' method). This, however, depends on whether the retry occurs on the same five tuple, which isn't guaranteed. Indeed, it seems more than conceivable that a DNS client might retry with a different source port.

To remedy the issue, this patch adds a check for the exceeded deadline. Unceremoniously, we shut down the handling goroutine if we hit this I/O timeout. This then closes the connection, which also shuts down the receiving goroutine.

Reported-by: John Watson john@dctrwatson.com

After the introduction of shared clients, we received reports of excessive CPU usage. Through a debug logging build, the cause was identified to be mishandling of "i/o timeout" errors, which are returned from read (or write) calls when the read (or write) deadline of a connection are exceeded. In the existing code, once we hit such an I/O timeout, the receiving goroutine was calling 'ReadMsg' in a busy loop, as the connection was not closed, but reads were not succeeding either. No useful work was performed anymore, until a further attempt at sending would reset the read deadline (c.f. the 'SendContext' method). This, however, depends on whether the retry occurs on the same five tuple, which isn't guaranteed. Indeed, it seems more than conceivable that a DNS client might retry with a different source port. To remedy the issue, this patch adds a check for the exceeded deadline. Unceremoniously, we shut down the handling goroutine if we hit this I/O timeout. This then closes the connection, which also shuts down the receiving goroutine. Reported-by: John Watson <john@dctrwatson.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>

Pulls in cilium/dns#8 directly, to generate a CI image to run. Signed-off-by: David Bimmler <david.bimmler@isovalent.com>

bimmlerd force-pushed the pr/bimmlerd/respect-exceeded-deadline branch from ed3163f to 085befa Compare November 16, 2023 09:23

bimmlerd added a commit to bimmlerd/cilium that referenced this pull request Nov 16, 2023

DO NOT MERGE: pull in dns fix

0b04982

Pulls in cilium/dns#8 directly, to generate a CI image to run. Signed-off-by: David Bimmler <david.bimmler@isovalent.com>

bimmlerd mentioned this pull request Nov 16, 2023

[DO NOT MERGE]: pull in dns fix cilium/cilium#29220

Closed

bimmlerd closed this Nov 20, 2023

bimmlerd deleted the pr/bimmlerd/respect-exceeded-deadline branch November 20, 2023 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shared: don't infinitely retry on I/O timeout #8

shared: don't infinitely retry on I/O timeout #8

bimmlerd commented Nov 16, 2023

shared: don't infinitely retry on I/O timeout #8

shared: don't infinitely retry on I/O timeout #8

Conversation

bimmlerd commented Nov 16, 2023