Skip to content

Commit

Permalink
Limit voting members to the those residing within the primary region (#…
Browse files Browse the repository at this point in the history
…193)

* Limit voting members to the those residing in the primary region

* Require in-region quorum

* Update docs to reflect changes

* Typo
  • Loading branch information
davissp14 authored Apr 25, 2023
1 parent 215ed8b commit adb02b7
Show file tree
Hide file tree
Showing 4 changed files with 8 additions and 11 deletions.
4 changes: 1 addition & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,7 @@ fly pg create --name <app-name> --initial-cluster-size 3 --region ord --flex
```

## High Availability
For HA, it's recommended that you run at least 3 members.

Automatic failovers will only consider members residing within your primary region. The primary region is represented as an environment variable defined within the `fly.toml` file. That being said, if you're running a 3 member setup at least 2 of your members should reside within your primary region.
For HA, it's recommended that you run at least 3 members within your primary region. Automatic failovers will only consider members residing within your primary region. The primary region is represented as an environment variable defined within the `fly.toml` file.

## Horizontal scaling
Use the clone command to scale up your cluster.
Expand Down
10 changes: 5 additions & 5 deletions docs/fencing.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Fencing

## How do we verify the real primary?
We start out evaluating the cluster state by checking each registered standby for connectivity and asking who their primary is.
We start out by evaluating the cluster state by checking each registered standby within the primary region for connectivity and asking who their primary is.

The "clusters state" is represented across a few different dimensions:

Expand All @@ -24,7 +24,7 @@ map[string]int{
}
```

The real primary is resolvable so long as the majority of members can agree on who it is. Quorum being defined as `total_members / 2 + 1`.
The real primary is resolvable so long as the majority of members can agree on who it is. Quorum being defined as `total_members_in_region / 2 + 1`.

**Note: When the primary being evaluated meets quorum, it will still be fenced in the event a conflict is found. This is to protect against a possible race condition where an old primary comes back up in the middle of an active failover.**

Expand All @@ -45,11 +45,11 @@ The cluster will be made read-only and the `zombie.lock` file will be created wi

## Monitoring cluster state

In order to mitigate possible split-brain scenarios, it's important that cluster state is evaluated regularly and when specific events/actions take place.
In order to mitigate possible split-brain scenarios, it's important that cluster state is evaluated regularly and when specific events/actions take place.

### On boot
This is to ensure the booting primary is not a primary coming back from the dead.

### During standby connect/reconnect/disconnect events
There are a myriad of reasons why a standby might disconnect, but we have to assume the possibility of a network partition. In either case, if quorum is lost, the primary will be fenced.

Expand All @@ -60,7 +60,7 @@ Cluster state is monitored in the background at regular intervals. This acts as
## Split-brain detection window
**This pertains to v0.0.36+**

When a network partition is initiated, the following steps are performed:
When a network partition is initiated, the following steps are performed:

1. Repmgr will attempt to ping registered members with a 5s connect timeout.
2. Repmgr will wait up to 30 seconds for the standby to reconnect before issuing a `child_node_disconnect` event.
Expand Down
2 changes: 1 addition & 1 deletion internal/flypg/repmgr.go
Original file line number Diff line number Diff line change
Expand Up @@ -388,7 +388,7 @@ func (r *RepMgr) VotingMembers(ctx context.Context, conn *pgx.Conn) ([]Member, e

var voters []Member
for _, member := range members {
if member.Role == StandbyRoleName || member.Role == WitnessRoleName {
if (member.Role == StandbyRoleName || member.Role == WitnessRoleName) && member.Region == r.PrimaryRegion {
voters = append(voters, member)
}
}
Expand Down
3 changes: 1 addition & 2 deletions internal/flypg/zombie.go
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,6 @@ func TakeDNASample(ctx context.Context, node *Node, standbys []Member) (*DNASamp
sample.totalConflicts++
sample.conflictMap[primary.Hostname]++
}

}

return sample, nil
Expand Down Expand Up @@ -182,7 +181,7 @@ func Quarantine(ctx context.Context, n *Node, primary string) error {
}

func DNASampleString(s *DNASample) string {
return fmt.Sprintf("Registered members: %d, Active member(s): %d, Inactive member(s): %d, Conflicts detected: %d",
return fmt.Sprintf("Voting member(s): %d, Active: %d, Inactive: %d, Conflicts: %d",
s.totalMembers,
s.totalActive,
s.totalInactive,
Expand Down

0 comments on commit adb02b7

Please sign in to comment.