-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
internal/servers/controller: Worker failure connection cleanup #1340
Conversation
709704e
to
236fa0e
Compare
236fa0e
to
33929c2
Compare
@jefferai feedback all makes sense, will work, on applying it tomorrow! |
33929c2
to
5549b1a
Compare
5549b1a
to
7f0a27a
Compare
7f0a27a
to
c6b3047
Compare
c6b3047
to
db2a864
Compare
db2a864
to
5ed52f6
Compare
5ed52f6
to
c6f1158
Compare
@jefferai this is ready for review again. I'll check in a bit to see if tests passed. 🤞 |
c6f1158
to
e73e6d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if @jefferai wanted to review this again before merging but this LGTM
This commit adds the support to do the following: * Mark connections for non-reporting workers as closed. This is the controller counterpart to the worker functionality (see #1330). This is written as a scheduled job that does the work DB-side in a single atomic query. * Works to reconcile states if such a broken controller-worker connection resumes and a worker reports a connection as connected that should be disconnected. In this case, the controller will send an update request, and the worker will honor it and terminate the connection. * Further refinement of the grace period setting has been added here. We have converged on the current server "liveness" setting as our default here, which is half of the previous 30s (15 seconds, in other words). Additionally, this is now configurable on the controller and worker side, with the caveat that it's currently impossible to do so in config as the setting has been untagged in HCL. This is exposed so that we can run some sophisticated testing scenarios where we skew the grace period to either the controller or worker to ensure the aforementioned reconciliation works. * Some repository functions have been added to support the new functionality, in addition to some test code to the worker to allow querying of session state while testing.
…uilding" This reverts commit 0dc2d4606dcb12478906dc4d9b6a7cf6c2e0de14.
ebc0b96
to
b6ac810
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* internal/servers/controller: refactor WaitForNextWorkerStatusUpdate * internal/servers/worker: use connId instead of conn.GetConnectionId() * internal/servers/worker: remove Worker.logClose * internal/session: complement versus inclusive state search * internal/session: use ScanRows instead of Scan * internal/session: make closeConnectionsForDeadServersCte results gormable * internal/tests/cluster: rename TestWorkerSessionCleanup to TestSessionCleanup * Simplify SQL query for CloseConnectionsForDeadWorkers (#1410) Co-authored-by: Michael Gaffney <mgaffney@users.noreply.github.com>
This commit adds the support to do the following:
Mark connections for non-reporting workers as closed. This is the
controller counterpart to the worker functionality (see internal/servers/worker: Controller failure connection cleanup #1330). This is
written as a scheduled job that does most of the work DB-side, save some
rudimentary checking of individual workers' last update times.
Works to reconcile states if such a broken controller-worker
connection resumes and a worker reports a connection as connected that
should be disconnected. In this case, the controller will send an update
request, and the worker will honor it and terminate the connection.
Further refinement of the grace period setting has been added here.
We have converged on the current server "liveness" setting as our
default here, which is half of the previous 30s (15 seconds, in other
words). Additionally, this is now configurable on the controller and
worker side, with the caveat that it's currently impossible to do so in
config as the setting has been untagged in HCL. This is exposed so that
we can run some sophisticated testing scenarios where we skew the grace
period to either the controller or worker to ensure the aforementioned
reconciliation works.
Some repository functions have been added to support the new
functionality.