-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fencing a failed primary #49
Conversation
pkg/flypg/node.go
Outdated
if err := n.PGBouncer.ConfigurePrimary(ctx, primary, true); err != nil { | ||
return fmt.Errorf("failed to reconfigure pgbouncer: %s", err) | ||
} | ||
} | ||
// Create a zombie.lock file containing the resolved primary. | ||
// This will be an empty string if we are unable to resolve the real primary. | ||
if err := writeZombieLock(primary); err != nil { | ||
return fmt.Errorf("failed to set zombie lock: %s", err) | ||
} | ||
|
||
fmt.Println("Setting all existing tables to read-only") | ||
if err := admin.SetReadOnly(ctx, conn); err != nil { | ||
return fmt.Errorf("failed to set read-only: %s", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't return if we can't reconfigure pgbouncer here, since we should still set existing tables to readonly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, i'll need to re-evaluate the returns a bit more. Good call.
"testing" | ||
) | ||
|
||
func TestZombieDiagnosis(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is excellent. this makes it easier for me to continue to threaten to make real integration tests :)
mConn, err := repmgr.NewRemoteConnection(ctx, standby.Hostname) | ||
if err != nil { | ||
fmt.Printf("failed to connect to %s", standby.Hostname) | ||
totalInactive++ | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these mConn
connections be closed? It doesn't look like they're used after passing to PrimaryMember()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they should!
pkg/flypg/node.go
Outdated
if err != nil { | ||
if errors.Is(err, ErrZombieDiscovered) { | ||
fmt.Println("Unable to confirm we are the real primary!") | ||
fmt.Printf("Registered members: %d, Active member(s): %d, Inactive member(s): %d, Conflicts detected: %d\n", | ||
totalMembers, | ||
totalActive, | ||
totalInactive, | ||
totalConflicts, | ||
) | ||
|
||
fmt.Println("Identifying ourself as a Zombie") | ||
|
||
// if primary is non-empty we were able to build a consensus on who the real primary and should | ||
// be recoverable on reboot. | ||
if primary != "" { | ||
fmt.Printf("Majority of members agree that %s is the real primary\n", primary) | ||
fmt.Println("Reconfiguring PGBouncer to point to the real primary") | ||
if err := n.PGBouncer.ConfigurePrimary(ctx, primary, true); err != nil { | ||
return fmt.Errorf("failed to reconfigure pgbouncer: %s", err) | ||
} | ||
} | ||
// Create a zombie.lock file containing the resolved primary. | ||
// This will be an empty string if we are unable to resolve the real primary. | ||
if err := writeZombieLock(primary); err != nil { | ||
return fmt.Errorf("failed to set zombie lock: %s", err) | ||
} | ||
|
||
fmt.Println("Setting all existing tables to read-only") | ||
if err := admin.SetReadOnly(ctx, conn); err != nil { | ||
return fmt.Errorf("failed to set read-only: %s", err) | ||
} | ||
|
||
return fmt.Errorf("zombie primary detected. Use `fly machines restart <machine-id>` to rejoin the cluster or consider removing this node") | ||
} | ||
|
||
return fmt.Errorf("failed to run zombie diagnosis: %s", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can reduce nesting a bit by moving the errors.Is()
outside the outer if
:
if errors.Is(err, ErrZombieDiscovered) {
...
} else if err != nil {
...
}
pkg/flypg/node.go
Outdated
@@ -322,6 +446,11 @@ func (n *Node) configure(store *state.Store) error { | |||
fmt.Println(err.Error()) | |||
} | |||
|
|||
// Clear target and wait for primary resolution | |||
if err := n.PGBouncer.ConfigurePrimary(ctx, "", false); err != nil { | |||
fmt.Println(err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this not need to be bubbled up?
pkg/flypg/zombie.go
Outdated
func writeZombieLock(hostname string) error { | ||
if err := ioutil.WriteFile("/data/zombie.lock", []byte(hostname), 0644); err != nil { | ||
return err | ||
} | ||
|
||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit. ioutil
is deprecated. You can use io.WriteFile()
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right. Are you talking about os.WriteFile
?
pkg/flypg/zombie_test.go
Outdated
t.Logf("test case: %d failed. Wasn't expecting to be a Zombie", i) | ||
t.Fail() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can be combined into a t.Fatalf()
pkg/flypg/zombie_test.go
Outdated
}, | ||
} | ||
|
||
for i, c := range tests.Cases { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not personally a fan of table tests for anything but really simple tests. I find they get really complicated as you start adding edge cases. I'd probably rework these into subtests using t.Run()
.
Although, if you want to keep the table tests, you can still wrap each of these in a t.Run()
and then you can filter on them with go test -run=XXX
@@ -322,6 +445,11 @@ func (n *Node) configure(store *state.Store) error { | |||
fmt.Println(err.Error()) | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here too I suppose
This seems to work well overall, however, I was able to discover a race condition that can happen in the event the failed primary comes back up in the middle of a failover. Example: 3 member setup Node A: Primary Node A - Fails This being the case, I think we need to take a slightly more conservative approach and fence the primary in the event any primary conflicts exist, even when quorum is met. UPDATE |
This PR aims to ensure that a booting primary that has diverged from the cluster is properly isolated.
Note: There's still some cleanup to do, but the core logic is there.
What this doesn't solve
Split-brains as a result of a network partition.
How do we verify the real primary?
We start out evaluating the cluster state by checking each registered standby for connectivity and asking who their primary is.
The "clusters state" is represented across a few different dimensions:
Total members
Number of registered members, including the primary.
Total active members
Number of members that are responsive. This includes the primary we are evaluating, so this will never be less than one.
Total inactive members
Number of registered members that are non-responsive.
Conflict map
The conflict map is a
map[string]int
that tracks conflicting primary's queried from our standbys and the number of occurrences a given primary was referenced.As an example, say we have a 3 member cluster and both of the standby's indicate their registered primary does not match. This will be recorded as:
The real primary is resolvable so long as the majority of members can agree on who it is. Quorum being defined as
total_members / 2 + 1
.There is one exception to note here. When self meets quorum, we will still fence ourself in the event a conflict is found. This is to protect against a possible race condition if we are coming up during an active failover.
Tests can be found here: https://github.com/fly-apps/postgres-flex/pull/49/files#diff-3d71960ff7855f775cb257a74643d67d2636b354c9d485d10c2ded2426a7f362
What if the real primary can't be resolved or doesn't match the booting primary?
In both of these instances the primary member will be fenced.
If the real primary is resolvable
The cluster will be made read-only, the PGBouncer will be reconfigured to target the "real" primary and the ip address is written to a
zombie.lock
file. The PGBouncer reconfiguration will ensure that any connections hitting this member will be routed to the real primary in order to minimize interruptions. Once, completed there will be panic to force a full member restart. When the member is restarted, we will read the ip address from thezombie.lock
file and use that to attempt to rejoin the cluster we diverged from. If we are successful, thezombie.lock
is cleared and we will boot as a standby.Note: We will not attempt to rejoin a cluster if the resolved primary resides in a region that differs from the
PRIMARY_REGION
environment variable set on self. ThePRIMARY_REGION
will need to be updated before a rejoin will be attempted.If the real primary is NOT resolvable
The cluster will be made read-only, PGBouncer will remain disabled and a
zombie.lock
file will be created without a value. When the member reboots, we will read thezombie.lock
file and see that it's empty. This indicates that we've entered a failure mode that can't be recovered automatically. This could be an issue where previously deleted members were not properly unregistered, or the primary's state has diverged to a point where its registered members have been cycled out.How to address these situations will be a part of future documentation.