Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/servers/controller: Worker failure connection cleanup #1340

Merged
merged 10 commits into from
Jul 14, 2021

Conversation

vancluever
Copy link
Contributor

This commit adds the support to do the following:

  • Mark connections for non-reporting workers as closed. This is the
    controller counterpart to the worker functionality (see internal/servers/worker: Controller failure connection cleanup #1330). This is
    written as a scheduled job that does most of the work DB-side, save some
    rudimentary checking of individual workers' last update times.

  • Works to reconcile states if such a broken controller-worker
    connection resumes and a worker reports a connection as connected that
    should be disconnected. In this case, the controller will send an update
    request, and the worker will honor it and terminate the connection.

  • Further refinement of the grace period setting has been added here.
    We have converged on the current server "liveness" setting as our
    default here, which is half of the previous 30s (15 seconds, in other
    words). Additionally, this is now configurable on the controller and
    worker side, with the caveat that it's currently impossible to do so in
    config as the setting has been untagged in HCL. This is exposed so that
    we can run some sophisticated testing scenarios where we skew the grace
    period to either the controller or worker to ensure the aforementioned
    reconciliation works.

  • Some repository functions have been added to support the new
    functionality.

@vancluever vancluever requested a review from jefferai June 22, 2021 01:52
@vancluever vancluever force-pushed the vancluever/session-cleanup-controller branch from 709704e to 236fa0e Compare June 22, 2021 17:27
Base automatically changed from vancluever/session-cleanup-worker to vancluever/session-cleanup June 22, 2021 19:18
@vancluever vancluever force-pushed the vancluever/session-cleanup-controller branch from 236fa0e to 33929c2 Compare June 22, 2021 20:04
@vancluever
Copy link
Contributor Author

@jefferai feedback all makes sense, will work, on applying it tomorrow!

@vancluever vancluever force-pushed the vancluever/session-cleanup-controller branch from 33929c2 to 5549b1a Compare June 23, 2021 17:45
@vancluever vancluever force-pushed the vancluever/session-cleanup-controller branch from 5549b1a to 7f0a27a Compare June 23, 2021 22:11
@vancluever vancluever force-pushed the vancluever/session-cleanup-controller branch from 7f0a27a to c6b3047 Compare June 23, 2021 23:52
@vancluever vancluever force-pushed the vancluever/session-cleanup-controller branch from c6b3047 to db2a864 Compare June 24, 2021 02:05
@vancluever vancluever force-pushed the vancluever/session-cleanup-controller branch from db2a864 to 5ed52f6 Compare June 24, 2021 18:25
@vancluever vancluever force-pushed the vancluever/session-cleanup-controller branch from 5ed52f6 to c6f1158 Compare June 24, 2021 18:28
@vancluever vancluever requested a review from jefferai June 24, 2021 18:34
@vancluever
Copy link
Contributor Author

@jefferai this is ready for review again. I'll check in a bit to see if tests passed. 🤞

@vancluever vancluever force-pushed the vancluever/session-cleanup-controller branch from c6f1158 to e73e6d4 Compare June 24, 2021 19:36
louisruch
louisruch previously approved these changes Jul 13, 2021
Copy link
Collaborator

@louisruch louisruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if @jefferai wanted to review this again before merging but this LGTM

This commit adds the support to do the following:

* Mark connections for non-reporting workers as closed. This is the
controller counterpart to the worker functionality (see #1330). This is
written as a scheduled job that does the work DB-side in a single atomic
query.

* Works to reconcile states if such a broken controller-worker
connection resumes and a worker reports a connection as connected that
should be disconnected. In this case, the controller will send an update
request, and the worker will honor it and terminate the connection.

* Further refinement of the grace period setting has been added here.
We have converged on the current server "liveness" setting as our
default here, which is half of the previous 30s (15 seconds, in other
words). Additionally, this is now configurable on the controller and
worker side, with the caveat that it's currently impossible to do so in
config as the setting has been untagged in HCL. This is exposed so that
we can run some sophisticated testing scenarios where we skew the grace
period to either the controller or worker to ensure the aforementioned
reconciliation works.

* Some repository functions have been added to support the new
functionality, in addition to some test code to the worker to allow
querying of session state while testing.
…uilding"

This reverts commit 0dc2d4606dcb12478906dc4d9b6a7cf6c2e0de14.
@vancluever vancluever force-pushed the vancluever/session-cleanup-controller branch from ebc0b96 to b6ac810 Compare July 13, 2021 16:51
@vancluever vancluever requested a review from louisruch July 13, 2021 21:54
Copy link
Collaborator

@louisruch louisruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vancluever vancluever merged commit 5a70875 into main Jul 14, 2021
@vancluever vancluever deleted the vancluever/session-cleanup-controller branch July 14, 2021 16:14
vancluever added a commit that referenced this pull request Jul 23, 2021
* internal/servers/controller: refactor WaitForNextWorkerStatusUpdate

* internal/servers/worker: use connId instead of conn.GetConnectionId()

* internal/servers/worker: remove Worker.logClose

* internal/session: complement versus inclusive state search

* internal/session: use ScanRows instead of Scan

* internal/session: make closeConnectionsForDeadServersCte results gormable

* internal/tests/cluster: rename TestWorkerSessionCleanup to TestSessionCleanup

* Simplify SQL query for CloseConnectionsForDeadWorkers (#1410)

Co-authored-by: Michael Gaffney <mgaffney@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants