-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support to stop TiDB gracefully #2810
Conversation
Can we make sure that TiDB still receives connection before the LB removes the backend? |
In #2597, we want the tidb-server to continue to receive connections before it's removed from LB backend. For example, fail the readiness probe first but don't send |
I think so. When a pod is going to be deleted (DeletionTimestamp != nil), its IP is ignored in endpoints controller and will be completely removed from endpoints. ref: #2597 (comment) So we don't need Before removed from endpoints, TiDB can receive new connections. After removed from endpoints, TiDB can't receive new connections, but the old connections work well as expected. The |
I think the LB that @tennix mentioned is the real load balancer, e.g. the NLB on AWS, it has its own probe for the target backend, which is also for port 4000 with our current configurations, however, with this PR, I think the new connections will still be sent to the TiDB Pod that is being terminated, which will cause request failure because the NLB will not remove it from its backend at that time. I think the key point is:
Is my understanding correct? @tennix |
I don't know why this request failed? TiDB will not close the connection until the user closes the connection himself. |
For this description, |
About endpoints remove please this comment: #2597 (comment) We can't control the LB behavior in TiDB Operator, all we can do is wait the connections to close gracefully by client.
The request flow from LB to Pod is: LB -> NodePort -> Service -> Pod. After the pod was removed from Service, no new connection can be sent to the Pod anymore. |
The event happens in the following order:
The node will be removed from the external LB if using |
as the tidb-server closes the listener immediately after receiving termination signals (including graceful QUIT), does it help to wait a few seconds before sending signals to the tidb-server process? this gives the endpoints controller (also the LB operator) time to react. |
For service that is This is not guaranteed to happen before stopping tidb-server. But we can sleep a few seconds before send Therefore, almost no connection will fail during the rolling update. |
pkg/apis/pingcap/v1alpha1/types.go
Outdated
// events. For the PostStart and PreStop lifecycle handlers, management of the container blocks | ||
// until the action is complete, unless the container process fails, in which case the handler is aborted. | ||
// +optional | ||
Lifecycle *corev1.Lifecycle `json:"lifecycle,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about moving these two fields to ComponentSpec
struct? thought it may not normally be necessary to configure these fields for other components, it would be strange for users that only TiDB has these pod fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea
…-operator into graceful-stop-tidb
|
||
if tc.Spec.TiDB.TerminationGracePeriodSeconds != nil { | ||
tidbSet.Spec.Template.Spec.TerminationGracePeriodSeconds = tc.Spec.TiDB.TerminationGracePeriodSeconds | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't update other components' controller logic
implement in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, #2847
I suggest changing the release note to a more user faced one: stopping TiDB gracefully is the main goal, but it's not the actual change in this PR. any action can be injected to achieve different purposes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/merge |
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
cherry pick to release-1.1 in PR #2848 |
* graceful stop tidb * address comment
* add v1.1.2 release nots * Apply suggestions from code review Co-authored-by: DanielZhangQD <36026334+DanielZhangQD@users.noreply.github.com> Co-authored-by: Ran <huangran@pingcap.com> * Apply suggestions from code review Co-authored-by: Ran <huangran@pingcap.com> * add release notes of #2810 Co-authored-by: DanielZhangQD <36026334+DanielZhangQD@users.noreply.github.com> Co-authored-by: Ran <huangran@pingcap.com>
What problem does this PR solve?
This pr add
terminationGracePeriodSeconds
andlifecycle
toTidbCluster
spec.tidb
attribute.User can use these two attributes to stop TiDB server gracefully:
spec.tidb.terminationGracePeriodSeconds
to a proper value, for exmaple:600
(10 minutes).spec.tidb.lifecycle
to:kill -QUIT 1
;The TiDB server will go graceful shutting down to wait for all clients to close the connection until k8s force kill the TiDB container(10 minutes later).
fixes: #2597
What is changed and how does it work?
Check List
Tests
Code changes
Side effects
Related changes
Does this PR introduce a user-facing change?: