Skip to content

Commit

Permalink
fixes pg replica error: "canceling statement due to conflict with rec…
Browse files Browse the repository at this point in the history
…overy"

A customer observed less frequent but constant errors returned by SpiceDB when using Postgres strict replicas mode.

When googling the error, it typically points to a conflict when applying WAL changes to the replicas that conflicted with in-flight queries.

Two parameters could be tweaked to reduce the likelihood:
max_standby_archive_delay max_standby_streaming_delay.
Postgres docs also describe a bit what is going on in
https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-CONFLICT

A relevant part from the docs:

>On the primary server, these cases simply result in waiting;
>and the user might choose to cancel either of the conflicting actions.
>However, on the standby there is no choice: the WAL-logged action
>already occurred on the primary so the standby must not fail to apply
>it. Furthermore, allowing WAL application to wait indefinitely may be
>very undesirable, because the standby's state will become increasingly
>far behind the primary's. Therefore, a mechanism is provided to forcibly
>cancel standby queries that conflict with to-be-applied WAL records.

These are similar to serialization errors; in fact, we observed pgx reporting them as SQL error code 40001.

There are at least two strategies we could take:
- retry the request
- redirect the request to the primary

I don't believe retrying is a good idea here. If I understood correctly, Postgres gives you those flags as a grace period for a query to complete before being forcefully canceled. I suspect retrying could have an undesirable side-effect: it increases the odds that WAL changes are delayed before application, which in turn increases lag and would make SpiceDB fall back more often to the primary.

Therefore, to avoid adding pressure that could increase the lag, I suggest not retrying and directly falling back to the primary.
  • Loading branch information
vroldanbet committed Feb 4, 2025
1 parent f5c4424 commit 654e634
Show file tree
Hide file tree
Showing 3 changed files with 109 additions and 6 deletions.
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ require (
go.opentelemetry.io/otel v1.33.0
go.opentelemetry.io/otel/sdk v1.33.0
go.opentelemetry.io/otel/trace v1.33.0
go.uber.org/atomic v1.11.0
go.uber.org/atomic v1.11.0 // indirect
go.uber.org/goleak v1.3.0
golang.org/x/exp v0.0.0-20240909161429-701f63a606c0
golang.org/x/mod v0.22.0
Expand Down
17 changes: 12 additions & 5 deletions internal/datastore/postgres/strictreader.go
Original file line number Diff line number Diff line change
Expand Up @@ -43,15 +43,22 @@ func (srqf strictReaderQueryFuncs) rewriteError(err error) error {
return nil
}

if srqf.isReplicationLagError(err) {
return common.NewRevisionUnavailableError(fmt.Errorf("revision %s is not available on the replica", srqf.revision.String()))
}

return err
}

func (srqf strictReaderQueryFuncs) isReplicationLagError(err error) bool {
var pgerr *pgconn.PgError
if errors.As(err, &pgerr) {
if (pgerr.Code == pgInvalidArgument && strings.Contains(pgerr.Message, "is in the future")) ||
strings.Contains(pgerr.Message, "replica missing revision") {
return common.NewRevisionUnavailableError(fmt.Errorf("revision %s is not available on the replica", srqf.revision.String()))
}
return (pgerr.Code == pgInvalidArgument && strings.Contains(pgerr.Message, "is in the future")) ||
strings.Contains(pgerr.Message, "replica missing revision") ||
pgxcommon.IsSerializationError(err)
}

return err
return false
}

func (srqf strictReaderQueryFuncs) addAssertToSelectSQL(sql string) string {
Expand Down
96 changes: 96 additions & 0 deletions internal/datastore/postgres/strictreader_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
package postgres

Check failure on line 1 in internal/datastore/postgres/strictreader_test.go

View workflow job for this annotation

GitHub Actions / Lint Go

Please run go run mage.go lint:go. diff --git a/internal/datastore/postgres/strictreader_test.go b/internal/datastore/postgres/strictreader_test.go index 82a7334..7cdb16f 100644 --- a/internal/datastore/postgres/strictreader_test.go +++ b/internal/datastore/postgres/strictreader_test.go @@ -3,12 +3,14 @@ package postgres import ( "context" "fmt" - "github.com/jackc/pgx/v5" "testing" - "github.com/authzed/spicedb/internal/datastore/common" + "github.com/jackc/pgx/v5" + "github.com/jackc/pgx/v5/pgconn" "github.com/stretchr/testify/require" + + "github.com/authzed/spicedb/internal/datastore/common" ) type mockQuerier struct { @@ -65,7 +67,6 @@ func TestStrictReaderDetectsLagErrors(t *testing.T) { for _, tt := range cases { t.Run(tt.name, func(t *testing.T) { - received := tt.in expected := tt.want

import (
"context"
"fmt"
"github.com/jackc/pgx/v5"
"testing"

"github.com/authzed/spicedb/internal/datastore/common"
"github.com/jackc/pgx/v5/pgconn"
"github.com/stretchr/testify/require"
)

type mockQuerier struct {
err error
}

func (mc mockQuerier) QueryFunc(ctx context.Context, rowsFunc func(ctx context.Context, rows pgx.Rows) error, sql string, optionsAndArgs ...any) error {
return mc.err
}

func (mc mockQuerier) QueryRowFunc(ctx context.Context, rowFunc func(ctx context.Context, row pgx.Row) error, sql string, optionsAndArgs ...any) error {
return mc.err
}

func (mc mockQuerier) ExecFunc(_ context.Context, _ func(ctx context.Context, tag pgconn.CommandTag, err error) error, _ string, _ ...interface{}) error {
return mc.err
}

func TestStrictReaderDetectsLagErrors(t *testing.T) {
mc := mockQuerier{}
reader := strictReaderQueryFuncs{
wrapped: mc,
}

cases := []struct {
name string
in error
want error
}{
{
name: "no error is bubbledtest",
},
{
name: "missing revision on replica - invalid argument",
in: &pgconn.PgError{Code: pgInvalidArgument, Message: "is in the future"},
want: common.RevisionUnavailableError{},
},
{
name: "missing revision on replica",
in: &pgconn.PgError{Message: "replica missing revision"},
want: common.RevisionUnavailableError{},
},
{
name: "serialization error due to conflicting WAL changes on replica",
in: &pgconn.PgError{Code: "40001"},
want: common.RevisionUnavailableError{},
},
{
name: "bubbles up unrelated error",
in: fmt.Errorf("unrelated error"),
want: fmt.Errorf("unrelated error"),
},
}

for _, tt := range cases {
t.Run(tt.name, func(t *testing.T) {

received := tt.in
expected := tt.want

mc.err = received
reader.wrapped = mc
err := reader.ExecFunc(context.Background(), nil, "SELECT 1")
if expected != nil {
require.ErrorAs(t, err, &expected)
} else {
require.NoError(t, err)
}

err = reader.QueryFunc(context.Background(), nil, "SELECT 1")
if expected != nil {
require.ErrorAs(t, err, &expected)
} else {
require.NoError(t, err)
}

err = reader.QueryRowFunc(context.Background(), nil, "SELECT 1")
if expected != nil {
require.ErrorAs(t, err, &expected)
} else {
require.NoError(t, err)
}
})
}
}

0 comments on commit 654e634

Please sign in to comment.