Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS certificate reload during connection setup can block pending PN_PROACTOR_TIMEOUT processing #1572

Open
kgiusti opened this issue Jul 18, 2024 · 0 comments
Assignees

Comments

@kgiusti
Copy link
Contributor

kgiusti commented Jul 18, 2024

This issue is causing this failure #1551

The "fix" for #1551 was to remove a heartbeat. That simply removed one possible case where delayed timer processing can cause timeout errors. The issue still exists.

Root cause: with the new tcp adaptor we've modified how TLS is configured. The desired behavior is to have changes to the related TLS certificate files be adopted when a new connection is established. This means that the management plane can update TLS cert files and the changes will take effect on the next new connection.

The router implementation of this is flawed. Currently there is no mechanism to alert the router that the cert files have been updated. Instead I've implemented the brain-dead approach of simply reloading the cert files unconditionally every time a new TLS connection is established.

This is bad: loading a cert file is a time consuming (thread blocking) operation. On my bare metal laptop I've observed this operation taking nearly a second to complete. Since the reload is taking place on the I/O thread (new connection event) that I/O thread can be blocked "outside" of proactor for a long time.

In the case of the test_80_balancing failure what is happening is the test is bringing up nearly 100 short-lived TCP client connections simultaneously. This is not unreasonable behavior as high connection rates will occur in the field. The problem is that this results in blocking the few (default 4) I/O threads in the cert reload call. This means there are no available threads for proactor to schedule other work - specifically expiration of the proactor timer.

IIUC proactor does not implement a thread scheduling priority, which means timer events won't take precedence over I/O events. Therefore during a heavily loaded I/O event it's possible the timer handler won't run in a timely manner.

TL:DR - under a heavy connection load TLS processing can delay servicing the system timer long enough to fail timed operations.

@kgiusti kgiusti self-assigned this Jul 18, 2024
kgiusti added a commit to kgiusti/skupper-router that referenced this issue Jul 30, 2024
…art 1: raw conn only)

This does not solve skupperproject#1572. It is part of a series of changes that will
address skupperproject#1572.
kgiusti added a commit to kgiusti/skupper-router that referenced this issue Aug 12, 2024
…art 1: raw conn only)

This does not solve skupperproject#1572. It is part of a series of changes that will
address skupperproject#1572.
kgiusti added a commit to kgiusti/skupper-router that referenced this issue Aug 15, 2024
This does not solve skupperproject#1572. It is part of a series of changes that will
address skupperproject#1572.
kgiusti added a commit to kgiusti/skupper-router that referenced this issue Aug 29, 2024
This does not solve skupperproject#1572. It is part of a series of changes that will
address skupperproject#1572.
kgiusti added a commit to kgiusti/skupper-router that referenced this issue Sep 23, 2024
This does not solve skupperproject#1572. It is part of a series of changes that will
address skupperproject#1572.
kgiusti added a commit to kgiusti/skupper-router that referenced this issue Sep 30, 2024
This does not solve skupperproject#1572. It is part of a series of changes that will
address skupperproject#1572.
kgiusti added a commit that referenced this issue Oct 1, 2024
This does not solve #1572. It is part of a series of changes that will
address #1572.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant