Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shared: don't infinitely retry on I/O timeout #8

Closed

Conversation

bimmlerd
Copy link
Member

After the introduction of shared clients, we received reports of excessive CPU usage. Through a debug logging build, the cause was identified to be mishandling of "i/o timeout" errors, which are returned from read (or write) calls when the read (or write) deadline of a connection are exceeded.

In the existing code, once we hit such an I/O timeout, the receiving goroutine was calling 'ReadMsg' in a busy loop, as the connection was not closed, but reads were not succeeding either. No useful work was performed anymore, until a further attempt at sending would reset the read deadline (c.f. the 'SendContext' method). This, however, depends on whether the retry occurs on the same five tuple, which isn't guaranteed. Indeed, it seems more than conceivable that a DNS client might retry with a different source port.

To remedy the issue, this patch adds a check for the exceeded deadline. Unceremoniously, we shut down the handling goroutine if we hit this I/O timeout. This then closes the connection, which also shuts down the receiving goroutine.

Reported-by: John Watson john@dctrwatson.com

After the introduction of shared clients, we received reports of
excessive CPU usage. Through a debug logging build, the cause was
identified to be mishandling of "i/o timeout" errors, which are returned
from read (or write) calls when the read (or write) deadline of a
connection are exceeded.

In the existing code, once we hit such an I/O timeout, the receiving
goroutine was calling 'ReadMsg' in a busy loop, as the connection was
not closed, but reads were not succeeding either. No useful work was
performed anymore, until a further attempt at sending would reset the
read deadline (c.f. the 'SendContext' method). This, however, depends on
whether the retry occurs on the same five tuple, which isn't guaranteed.
Indeed, it seems more than conceivable that a DNS client might retry
with a different source port.

To remedy the issue, this patch adds a check for the exceeded deadline.
Unceremoniously, we shut down the handling goroutine if we hit this I/O
timeout. This then closes the connection, which also shuts down the
receiving goroutine.

Reported-by: John Watson <john@dctrwatson.com>
Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
@bimmlerd bimmlerd force-pushed the pr/bimmlerd/respect-exceeded-deadline branch from ed3163f to 085befa Compare November 16, 2023 09:23
bimmlerd added a commit to bimmlerd/cilium that referenced this pull request Nov 16, 2023
Pulls in cilium/dns#8 directly, to generate a
CI image to run.

Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
@bimmlerd bimmlerd closed this Nov 20, 2023
@bimmlerd bimmlerd deleted the pr/bimmlerd/respect-exceeded-deadline branch November 20, 2023 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant