internal/servers/controller: Worker failure connection cleanup #1340

vancluever · 2021-06-22T01:52:02Z

This commit adds the support to do the following:

Mark connections for non-reporting workers as closed. This is the
controller counterpart to the worker functionality (see internal/servers/worker: Controller failure connection cleanup #1330). This is
written as a scheduled job that does most of the work DB-side, save some
rudimentary checking of individual workers' last update times.
Works to reconcile states if such a broken controller-worker
connection resumes and a worker reports a connection as connected that
should be disconnected. In this case, the controller will send an update
request, and the worker will honor it and terminate the connection.
Further refinement of the grace period setting has been added here.
We have converged on the current server "liveness" setting as our
default here, which is half of the previous 30s (15 seconds, in other
words). Additionally, this is now configurable on the controller and
worker side, with the caveat that it's currently impossible to do so in
config as the setting has been untagged in HCL. This is exposed so that
we can run some sophisticated testing scenarios where we skew the grace
period to either the controller or worker to ensure the aforementioned
reconciliation works.
Some repository functions have been added to support the new
functionality.

internal/cmd/config/config.go

internal/servers/controller/controller.go

internal/servers/controller/handlers/workers/worker_service.go

vancluever · 2021-06-23T00:41:10Z

@jefferai feedback all makes sense, will work, on applying it tomorrow!

vancluever · 2021-06-24T18:34:35Z

@jefferai this is ready for review again. I'll check in a bit to see if tests passed. 🤞

internal/servers/controller/testing.go

louisruch

Not sure if @jefferai wanted to review this again before merging but this LGTM

This commit adds the support to do the following: * Mark connections for non-reporting workers as closed. This is the controller counterpart to the worker functionality (see #1330). This is written as a scheduled job that does the work DB-side in a single atomic query. * Works to reconcile states if such a broken controller-worker connection resumes and a worker reports a connection as connected that should be disconnected. In this case, the controller will send an update request, and the worker will honor it and terminate the connection. * Further refinement of the grace period setting has been added here. We have converged on the current server "liveness" setting as our default here, which is half of the previous 30s (15 seconds, in other words). Additionally, this is now configurable on the controller and worker side, with the caveat that it's currently impossible to do so in config as the setting has been untagged in HCL. This is exposed so that we can run some sophisticated testing scenarios where we skew the grace period to either the controller or worker to ensure the aforementioned reconciliation works. * Some repository functions have been added to support the new functionality, in addition to some test code to the worker to allow querying of session state while testing.

…g opt

…uilding" This reverts commit 0dc2d4606dcb12478906dc4d9b6a7cf6c2e0de14.

louisruch

LGTM

internal/servers/controller/testing.go

internal/servers/worker/status.go

internal/session/query.go

internal/session/repository_connection.go

* internal/servers/controller: refactor WaitForNextWorkerStatusUpdate * internal/servers/worker: use connId instead of conn.GetConnectionId() * internal/servers/worker: remove Worker.logClose * internal/session: complement versus inclusive state search * internal/session: use ScanRows instead of Scan * internal/session: make closeConnectionsForDeadServersCte results gormable * internal/tests/cluster: rename TestWorkerSessionCleanup to TestSessionCleanup * Simplify SQL query for CloseConnectionsForDeadWorkers (#1410) Co-authored-by: Michael Gaffney <mgaffney@users.noreply.github.com>

vancluever requested a review from jefferai June 22, 2021 01:52

github-actions bot added core core/servers core/session labels Jun 22, 2021

vancluever force-pushed the vancluever/session-cleanup-controller branch from 709704e to 236fa0e Compare June 22, 2021 17:27

vercel bot temporarily deployed to Preview June 22, 2021 17:27 Inactive

Base automatically changed from vancluever/session-cleanup-worker to vancluever/session-cleanup June 22, 2021 19:18

vancluever force-pushed the vancluever/session-cleanup-controller branch from 236fa0e to 33929c2 Compare June 22, 2021 20:04

vercel bot temporarily deployed to Preview June 22, 2021 20:04 Inactive

jefferai reviewed Jun 22, 2021

View reviewed changes

internal/cmd/config/config.go Show resolved Hide resolved

jefferai reviewed Jun 22, 2021

View reviewed changes

internal/servers/controller/controller.go Outdated Show resolved Hide resolved

jefferai reviewed Jun 22, 2021

View reviewed changes

internal/servers/controller/controller.go Outdated Show resolved Hide resolved

jefferai reviewed Jun 22, 2021

View reviewed changes

internal/servers/controller/handlers/workers/worker_service.go Outdated Show resolved Hide resolved

vancluever force-pushed the vancluever/session-cleanup-controller branch from 33929c2 to 5549b1a Compare June 23, 2021 17:45

vercel bot temporarily deployed to Preview June 23, 2021 17:45 Inactive

vancluever force-pushed the vancluever/session-cleanup-controller branch from 5549b1a to 7f0a27a Compare June 23, 2021 22:11

vercel bot temporarily deployed to Preview June 23, 2021 22:11 Inactive

vancluever force-pushed the vancluever/session-cleanup-controller branch from 7f0a27a to c6b3047 Compare June 23, 2021 23:52

vercel bot temporarily deployed to Preview June 23, 2021 23:52 Inactive

vancluever force-pushed the vancluever/session-cleanup-controller branch from c6b3047 to db2a864 Compare June 24, 2021 02:05

vercel bot temporarily deployed to Preview June 24, 2021 02:05 Inactive

github-actions bot added the core/sql label Jun 24, 2021

vancluever force-pushed the vancluever/session-cleanup-controller branch from db2a864 to 5ed52f6 Compare June 24, 2021 18:25

vercel bot temporarily deployed to Preview June 24, 2021 18:25 Inactive

vancluever force-pushed the vancluever/session-cleanup-controller branch from 5ed52f6 to c6f1158 Compare June 24, 2021 18:28

vercel bot temporarily deployed to Preview June 24, 2021 18:28 Inactive

vancluever requested a review from jefferai June 24, 2021 18:34

vancluever force-pushed the vancluever/session-cleanup-controller branch from c6f1158 to e73e6d4 Compare June 24, 2021 19:36

louisruch reviewed Jul 13, 2021

View reviewed changes

internal/servers/controller/testing.go Outdated Show resolved Hide resolved

louisruch previously approved these changes Jul 13, 2021

View reviewed changes

vancluever added 10 commits July 13, 2021 09:51

Apply suggestion from @jefferai

095f4c6

Add tests for updated WaitForNextSuccessfulStatusUpdate

14f7ceb

internal/servers/controller: set status to total connections closed

b4d95e1

internal/servers: remove redundant liveness check for query building

6a39bcc

controller, worker: remove redundant StatusGracePeriodDuration testin…

97bdf98

…g opt

internal/session: add rows.Close() to a few DB calls

e2883c2

db: add timestamp subtraction functions

89dd6c0

Revert "internal/servers: remove redundant liveness check for query b…

c9e48c3

…uilding" This reverts commit 0dc2d4606dcb12478906dc4d9b6a7cf6c2e0de14.

Use select for WaitForNextWorkerStatusUpdate

b6ac810

vancluever dismissed louisruch’s stale review via b6ac810 July 13, 2021 16:51

vancluever force-pushed the vancluever/session-cleanup-controller branch from ebc0b96 to b6ac810 Compare July 13, 2021 16:51

vercel bot temporarily deployed to Preview July 13, 2021 16:51 Inactive

vancluever requested a review from louisruch July 13, 2021 21:54

louisruch approved these changes Jul 14, 2021

View reviewed changes

vancluever merged commit 5a70875 into main Jul 14, 2021

vancluever deleted the vancluever/session-cleanup-controller branch July 14, 2021 16:14