Refresh ahead cache fails after several refreshes #357
Labels
priority: p1
Important issue which blocks shipping the next release. Will be fixed prior to next release.
type: bug
Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
We use a default parameter value in the function
_seconds_until_refresh
which determines when to refresh the client certificate. In normal operation, we expect the refresh operation to complete in the background ~4 minutes before the current certificate expires. However, we have discovered:In effect, this means for any client using the default refresh ahead strategy (and not the "lazy-refresh" strategy which is unaffected here), the refresh cycle will fail to refresh the client cert prior to its expiration after the refresh operation has run for a few cycles. This also means this statement will always be false.
Observe:
Startup
now = 00:00 (this should change but doesn't because it's evaluated when the function definition is executed)
retrieve initial certs and start refresh ahead operation loop
Refresh Ahead Operation 0
current time = 00:00
current cert expiration = 01:00
cached cert has expired = no
time till expiration = 60 minutes
refresh = 30 minutes
Refresh Ahead Operation 1
current time = 00:30
current cert expiration = 01:30
cached cert has expired = no
time till expiration = 01:30 - 00:00 = 1.5 hours = 90 minutes
refresh = 45 minutes
Refresh Ahead Operation 2
current time = 01:15
current cert expiration = 02:15
cached cert has expired = no
time till expiration = 02:15 - 00:00 = 2.25 hours = 135 minutes
refresh = 68 minutes
Refresh Ahead Operation 3
current time = 02:23
current cert expiration = 02:15
cached cert has expired = yes!
By refresh ahead operation 3, the existing cached cert will have expired causing any new connections between 02:15-02:23 to fail. Connection pools may not immediately try to recreate connections during this "bad" period, but as time passes, the chances of creating a new connection when the client cert is invalid goes up.
How to fix this
Stop using a time value as a default argument and either always pass "now" in, or retrieve it within the function.
The text was updated successfully, but these errors were encountered: