Bug Report: [vtgate
] RemoveTablet
during TabletExternallyReparented
event can lead to healthcheck corruption
#16373
Labels
Overview of the Issue
During external reparent events, a shard can be in a state where to
vttablet
processes are running asPRIMARY
tablets.vtgate
s consult a so calledPrimaryTermStartTimestamp
to break ties in this situation, and will exclusively prefer thePRIMARY
tablet with the highestPrimaryTermStartTimestamp
to sent queries to.The duration for how long multiple
vttablet
processes can be seen inPRIMARY
role is influenced by the--shutdown_grace_period
flag, which until version 20.0 was0
, meaning potentially "indefinite" (but in practice is usually limited by the transaction timeout).If, during the time that a
vtgate
is seeing twoPRIMARY
tablets running, a tablet deletion is recognized (this can be any other tablet in the shard), thevtgate
healtcheck would go into an invalid state where both the demoted and promoted primary tablets are seen as valid targets for@primary
queries. This could lead to queries being silently sent to the wrong tablet (and silently being retried, so most of the time this error state was completely invisible).But we've also seen cases where
vtgate
processes ended up trying to send queries to the demoted primary exclusively, causing all DML queries processed by thesevtgate
s to fail.Once in this invalid state, I don't think the
vtgate
can leave it until the affected tablet is deleted from the topology. Restarting thevttablet
process on the demoted primary would cause it to start back up as aREPLICA
, but it would still be seen as a valid candidate for@primary
queries. Restarting / replacing an affectedvtgate
process would be another option to get the affectedvtgate
processes into a healthy state again.Reproduction Steps
N/A
Binary Version
Operating System and Environment details
Log Fragments
The text was updated successfully, but these errors were encountered: