Epic: stabilize physical replication #6211

vadim2404 · 2023-12-21T10:54:21Z

Summary

Original issue we hit was

page server returned error: tried to request a page version that was garbage collected. requested at C/1E923DE0 gc cutoff C/23B3DF00

but then the scope grew up quickly. This is the Epic to track main physical replication work

Tasks

Give feedback

deal with running_xacts to hot standby replica #7236

c/compute t/bug
Remove primary_is_running #8162
Shutdown records, or do we need to do a checkpoint at compute shutdown?
https://github.com/neondatabase/cloud/issues/14903
https://github.com/neondatabase/cloud/issues/10800
Internal replica health monitoring #8249

c/compute
https://github.com/neondatabase/cloud/issues/16671
Retrospective-RFC / document about how physical replication works #8322
Bug: walreceiver did not restart after erroring out #8172

c/compute t/bug
Investigate endpoints with >30s lag
hot_standby_feedback -- no consensus, but can be a reason of replica lag -- we keep it self-server, user can set it via API
Hot standby GUCs missmatch problem #9023
Do not copy logical replicaiton slots to replica #9458
https://github.com/neondatabase/cloud/issues/19655
max_standby_archive_delay / max_standby_streaming_delay -- https://neondb.slack.com/archives/C07AY3FB45Q
Force-restart if replica signifficantly lagged behind the primary
Do not wait running-xacts record from primary if checkoint doesn't contain oldestActiveXid #8484
Options

Follow-ups:

AccessExclusiveLocks not acquired in replica #8325

Related Epics:

https://github.com/neondatabase/cloud/issues/13390

The text was updated successfully, but these errors were encountered:

knizhnik · 2023-12-22T14:28:48Z

I am thinking now how it can be done.

Replica receives WAL from safekeeper.
Master compute knows nothing about presence of replica - there is no replication slot at master.
Replica can be arbitrary lagged, suspended, ... It may not access nether SK, neither SK for arbitrary long time
There are also no replication slots at SK, so SK has no knowladge about all existed replicas and their WAL poisition.

So what can we do?

We can create replication slot at master. This slot will be persisteted using AUX_KEY mechanism (right now it works only for logical slots, but it an be changed) and applying this wal record PS will know about position of replica. It is not clear who will advance this slot, if replication is performed from SK. In principle. SK can send this position in some feedback message to PS. But looks pretty ugly.
SK should explicitly notify PS about current position of all replicas. Not so obvious how to report this position to PS which now just receives WAL stream from SK. Should it be some special message in SK<->PS protocol? Or should SK generate WAL record with replica position (not clear which LSN this record should be assigned to be included in stream of existed WAL records). As it was mentioned above SK has no information about all replicas, so lack of such message doesn;t mean that there is no replica wityh eom old LSN.
Replica should notify PS itself (by means of some special message). The problem is that replica can be offline and do not send any requests to PS.
In addition to PITR we can also have max_replica_lag parameter. If replica exceeds this value, then it is disabled.

kelvich · 2024-01-02T06:43:23Z

https://neondb.slack.com/archives/C04DGM6SMTM/p1703154820552359

kelvich · 2024-01-02T06:59:43Z

So basically we need to delay PITR for some amount of time for lagging replicas when they are enabled.

Replica should notify PS itself (by means of some special message). The problem is that replica can be offline and do not send any requests to PS.

That could be done with time lease. Replica sends message each 10 minutes, when pageserver don't receive 3 messages in a row it considers replica to be disabled.

SK should explicitly notify PS about current position of all replicas. Not so obvious how to report this position to PS which now just receives WAL stream from SK. Should it be some special message in SK<->PS protocol? Or should SK generate WAL record with replica position (not clear which LSN this record should be assigned to be included in stream of existed WAL records). As it was mentioned above SK has no information about all replicas, so lack of such message doesn;t mean that there is no replica wityh eom old LSN.

Won't usual feedback message help? IIRC we already have it for backpressure, also pageserver knows that LSN's via storage broker.

knizhnik · 2024-01-02T07:01:29Z

PiTR is enforced at PS and information about replica flush/apply position is avaiable only at SK. The problem is that PS can be connected to one SK1, and replica - some some other SK2. The only components which knows about all SKs are compute and broker. But compute may we inactive (suspended) at the moment when GC is performed by PS. And involving broker in the process of garbage collection on PS seems to be overkill. Certainly SK can somehow interact with each other or through wal proposer. But it also seems to be too complicated and fragile.

kelvich · 2024-01-02T11:51:09Z

PiTR is enforced at PS and information about replica flush/apply position is avaiable only at SK. The problem is that PS can be connected to one SK1, and replica - some some other SK2. The only components which knows about all SKs are compute and broker. But compute may we inactive (suspended) at the moment when GC is performed by PS. And involving broker in the process of garbage collection on PS seems to be overkill. Certainly SK can somehow interact with each other or through wal proposer. But it also seems to be too complicated and fragile.

Through broker pageserver has information about LSNs on all safekeepers. That is how pageserver decides which one to connect to. So safekeeper can advertise min feedback lsn out of all replicas connected to it (if any).

Also, most likely, we should use information from broker when deciding which safekeeper to connect to on replica. @arssher what do you think?

arssher · 2024-01-02T12:44:02Z

Through broker pageserver has information about LSNs on all safekeepers. That is how pageserver decides which one to connect to. So safekeeper can advertise min feedback lsn out of all replicas connected to it (if any).'

Yes, this seems to be the easiest way.

Also, most likely, we should use information from broker when deciding which safekeeper to connect to on replica. @arssher what do you think?

Not necessarily. Replica here is different from pageserver because it costs something, so we're ok to keep the standby -> safekeeper connection all the time as long as standby as alive, which means standby can be initiator of the connection. So what we do currently is just wire all safekeepers into primary_conninfo; if some is down, libpq will try another etc. If set of safekeepers changes we need to update the setting, but this is not hard (though this is not automated yet).

With pageserver we can't do similar because we don't want to keep live connections from all existing attached timelines, and safekeeper learns about new data first, so it should be initiator of the connection. Usage of broker gives another advantage: pageserver concurrently can have active connection and at the same time up to date info about other safekeeper positions, so can choose better where to connect in complicated scenarios like when connection to current sk is good, but it is very slow for whatever reason. But similar heuristics though less powerful can be implemented without broker data (e.g. restart connection if no new data arrives within some period).

Also using broker on standby likely would be quite untrivial because it is grpc, I'm not even sure C grpc library exists. So looks like a significant work without much gain.

arssher · 2024-01-02T12:47:27Z

On a related note, I'm also very suspicious that original issue is caused by this -- "doubt that.replica lags for 7 days" -- me too. Looking at metrics to understand standby position would be very useful, but likely pg_last_wal_replay_lsn is not collected :(

knizhnik · 2024-01-02T13:04:35Z

Ok, so to summarise all above:

Information about replica apply position can be obtained by PS from broker (still not quite clear to me how frequent this information is updated)
The problem most likely is not caused by replication lag, but by some bug in tracking VM updates either at compute, either at PS side. As far as the problem is reproduced only on replica, then most likely it is bug in compute, particularly in performing redo in compute. PS knows nothing if get_page request comes from master or replica, so unlikely the problem is here. But there is one important difference: master does get_page request with latest option (takes latest LSN), while replica uses latest=false.

knizhnik · 2024-01-02T13:54:28Z

One of the problems with requesting information about replica position from broker is that it is available only as far as replica is connected to one of SK. But if it is suspended, then this information is not available. As far as I understand only control plane has information. about all replicas. But it is not desirable to:

to involve control plane in GC process
block GC until all replicas are online
remember current state of all replicas in some shared storage

vadim2404 · 2024-01-02T17:00:14Z

under investigation (most probably slip to the next week)

arssher · 2024-01-02T18:25:23Z

One of the problems with requesting information about replica position from broker is that it is available only as far as replica is connected to one of SK.

Yes, but as Stas wrote somewhere it's mostly ok to keep data only as long as replica is around. Newly seeded replica shouldn't lag significantly. Well, there is probably also standby pinned to LSN, but it can be addresses separately.

knizhnik · 2024-01-02T19:29:10Z

Newly seeded replica shouldn't lag significantly.

My concern is that replica can be suspended because of inactivity.
I wonder how we are protecting replica fro scale to zero now (if there are no active requests to replica).

vadim2404 · 2024-01-09T10:21:46Z

Recently, @arssher turned off the suspension for computes, which has logical replication subscribers.
a41c412

@knizhnik, you can adjust this part for RO endpoints. In compute_ctl the compute type (R/W or R/O) is known

vadim2404 · 2024-01-09T16:51:35Z

@knizhnik to check why replica requires to download the WAL.

kelvich · 2024-01-09T20:39:42Z

My concern is that replica can be suspended because of inactivity.
Do not suspend read-only replica if it applies some WAL within some time interval (i.e. 5 minutes). It can be checked using last_flush_lsn.
Periodically wakeup read-only node to make it possible to connect to master and get updates. Wakeup period should be several times larger than suspend interval (otherwise it has not sense to suspend replica at all). It may be also useful periodically wakeup not only read-only replicas, but any other suspended nodes. Such computes will have a chance to perform some bookkeeping work, i.e. autovacuum. I do not think that if node will be awaken once per hour for 5 minutes, then it can some significantly affect cost (for users).

Hm, how we did end up here? Replica should be suspended due to inactivity. New start will start with latest LSN, so not sure why replica suspend is relevant.

There are two open questions now:

why replica lags a lot, that shouldn't happen and that is the most pressing issue
how we delay GC in case of legitimately lagging replica. Approach with broker sounds reasonable (no replica == no need to hold GC). Control plane doesn't know about replica LSN and shouldn't know about them.

knizhnik · 2024-01-10T07:34:56Z

Sorry. my concerns about read-only replica suspension (when there are not active queries) seems to be irrelevant.
Unlike "standard" read-only replica in Vanilla Postgres, we do not need to replay all WAL when activating suspended replica. Page server should just create basebackup with most recent LSN for launching this replica. And I have tested that it is really done now in this way.

So lagged replica can not be caused by replica suspension. Quite opposite: suspend and restart of replica should cause replica to "catch up" with master. Large replication lag between master and replica should be caused by some other reasons. Actually I see only two reasons:

Replica apply WAL slowly than master is producing it. For example replica use less powerful VM than master.
There was some error with processing WAL at replica which stuck replication. I can be related with the problem recently fixed by @arssher (alignment of segments sent to replica on page boundary).

Are there links to the projects suffering for this problem? Can we include them in this ticket?

Concerning approach described above: take information about replica LSN from broker and use it to restrict PiTR boundary to prevent GC from removing layers which may be accessed by replica. There are two kind of LSNs maintained by SK: last committed LSN returned in the response to happen requests and triple of LSNs (write/flush/apply) included in hot-stanndby feedback and collected by SK as min from all subscribers (PS and replicas). I wonder of broker can provide now access to both of this LSNs. @arssher ?

arssher · 2024-01-10T09:50:43Z

I wonder of broker can provide now access to both of this

Not everything is published right now, but this is trivial to add, see LSNsSafekeeperTimelineInfo

vadim2404 · 2024-01-17T09:13:37Z

status update: in review

vadim2404 · 2024-01-23T17:00:22Z

to review it with @MMeent

vadim2404 · 2024-01-30T16:48:24Z

@arssher to review the PR

ItsWadams · 2024-01-30T21:54:40Z

Hey All - a customer just asked about this in an email thread with me about pricing. Are there any updates we can provide them?

vadim2404 · 2024-01-31T08:57:47Z

The problem was identified, and @knizhnik is working on fixing it.

But the fix requires time because it affects compute, safekeeper, and pageserver. I suppose in February, we will merge it and ship it.

YanicNeon · 2024-01-31T17:25:45Z

We got a support case about this problem today (ZD #2219)

Keeping an eye on this thread

acervantes23 · 2024-02-07T02:03:13Z

@knizhnik what's the latest status on this issue?

knizhnik · 2024-02-07T07:06:48Z

@knizhnik what's the latest status on this issue?

There is PR #6357 waiting for one more round of review.
There was also some problems with e2e tests: https://neondb.slack.com/archives/C03438W3FLZ/p1706868624273839
which are not yet resolved and where I need some help for somebody familiar with e2e tests.

It also creates a shutdown checkpoint, which is important for ROs to get a list of running xacts faster instead of going through the CLOG. See https://www.postgresql.org/docs/current/server-shutdown.html for the list of modes and signals. Related to #6211

## Problem We currently use 'immediate' mode in the most commonly used shutdown path, when the control plane calls a `compute_ctl` API to terminate Postgres inside compute without waiting for the actual pod / VM termination. Yet, 'immediate' shutdown doesn't create a shutdown checkpoint and ROs have bad times figuring out the list of running xacts during next start. ## Summary of changes Use 'fast' mode, which creates a shutdown checkpoint that is important for ROs to get a list of running xacts faster instead of going through the CLOG. On the control plane side, we poll this `compute_ctl` termination API for 10s, it should be enough as we don't really write any data at checkpoint time. If it times out, we anyway switch to the slow k8s-based termination. See https://www.postgresql.org/docs/current/server-shutdown.html for the list of modes and signals. The default VM shutdown hook already uses `fast` mode, see [1] [1] https://github.com/neondatabase/neon/blob/c9fd8d76937c2031fd4fea1cdf661d6cf4f00dc3/vm-image-spec.yaml#L30-L31 Related to #6211

ololobus · 2024-07-09T15:51:16Z

This week:

Heikki will work on RFC Retrospective-RFC / document about how physical replication works #8322
Alexey to check how we start PITC (we likely now create a branch + RO)
Review and merge: Add neon.running_xacts_overflow_policy to make it possible for RO replica to startup without primary even in case running xacts overflow #8323 Previously replica didn't restore running xacts, with this PR we will only affect a small number (presumably)
Tristan: confirm that replica lag is reported
Tristan: dashboard for replication lag and alert
@ololobus figure out some easy fix for avoiding the RO misconfiguration (see Stas comment in the issue), likely we will try to sync the RO GUCs with RW

save-buffer · 2024-07-09T17:23:32Z

Regarding hot standby feedback (one of the items in the original checklist), we recently allowed people to enable it in their pg_settings on the control plane side.

## Problem We currently use 'immediate' mode in the most commonly used shutdown path, when the control plane calls a `compute_ctl` API to terminate Postgres inside compute without waiting for the actual pod / VM termination. Yet, 'immediate' shutdown doesn't create a shutdown checkpoint and ROs have bad times figuring out the list of running xacts during next start. ## Summary of changes Use 'fast' mode, which creates a shutdown checkpoint that is important for ROs to get a list of running xacts faster instead of going through the CLOG. On the control plane side, we poll this `compute_ctl` termination API for 10s, it should be enough as we don't really write any data at checkpoint time. If it times out, we anyway switch to the slow k8s-based termination. See https://www.postgresql.org/docs/current/server-shutdown.html for the list of modes and signals. The default VM shutdown hook already uses `fast` mode, see [1] [1] https://github.com/neondatabase/neon/blob/c9fd8d76937c2031fd4fea1cdf661d6cf4f00dc3/vm-image-spec.yaml#L30-L31 Related to #6211

ololobus · 2024-07-15T17:53:34Z

Alexey to check how we start PITC (we likely now create a branch + RO)

TWIMC, I did that, and it's pretty complicated. There is a flow-chart here https://www.notion.so/neondatabase/Ephemeral-Endpoints-6388264bf28142e79d3b6f6bb6986fe8

TL;DR, it's currently a valid flow to create a normal RO on some branch without running RW. Basically, in proxy and cplane there is a generic mechanism to spin-up various ROs and RWs, it's not necessarily a PITC, it can be just 'at this branch HEAD'

For ephemeral endpoints, i.e. compute @LSN on some branch, we currently should create a temporary (ephemeral) branch and start RO on it. Yet, later it's going to be switched to fully static ROs

knizhnik · 2024-07-16T05:33:43Z

Yet, later it's going to be switched to fully static ROs

What actually you mean by "fully static ROs"?
Right now it is possible to start static replica by means of CLI, but not through UI.
It still requires branch. The main differences with with normal (hot-standby) replica are:

they do no have connection to primary (safekeeper)
they do not need to get information about running xacts at startup
As far as I understand it is not possible too through UI.

This "ephemeral endpoints" or "static replicas" still require separate Postgres instance (POD/VM) and separate timeline/task at PS. In principle, creating temporary branch for static replicas is not strictly needed. Its get_page@lsn requests can be served by PS for original timeline. But branch creation allows to pin particular LSN horizon and protect this data fro GC.
Also looks like having extra tokio task at PS is cheap, so there is no string motivation to avoid branch creation.

What IMHO will be really useful is to allow time travel without spawning of separate compute. In this case we can access different time slices in the same Postgres cluster. But it seems to be non-trivial because CLOG and other SLUs are now access locally and so it is hard to provide versioning for them.

ololobus · 2024-07-16T11:49:20Z

What actually you mean by "fully static ROs"?

I meant that we will start static computes pinned to specific LSN. Right now, it's turned off in cplane, so for some branch@LSN compute, we first need to create a temporary branch at this LSN and then start a 'normal' RO on it. IIRC, the problem was with the races with GC. Once we have leases, we can turn on static computes usage in cplane again

ololobus · 2024-07-16T15:45:42Z

This week:

Alexey: Check and release new compute image region by region
Alexey: Pass review and merge https://github.com/neondatabase/cloud/pull/15349
Tristan: add alerts and dashboard panels for Internal replica health monitoring #8249

ololobus · 2024-07-30T15:42:29Z

This week:

Tristan: deal with negative lag values, Fix negative replication delay metric #8520
Tristan: change lag dashboard to expose lag in seconds
Heikki: finish RO RFC Add retroactive RFC about physical replication #8546
Decide on Do not wait running-xacts record from primary if checkoint doesn't contain oldestActiveXid #8484
Alexey: check-in with storage team about leases

Heikki proposal for RO starts and pageserver GC races -- we can create a new 'ephemeral' branch + static endpoint

stepashka · 2024-07-31T08:08:19Z

once the lag metric looks good, please ping the DBaaS team, e.g. on #proj-observability-for-users about the metrics we can add to the UI? 🙏
cc @lpetkov @seymourisdead

ololobus · 2024-08-06T15:38:12Z

This week:

Tristan: change lag dashboard to expose lag in seconds
Heikki: finish RO RFC Add retroactive RFC about physical replication #8546

For #8484 we can postpone it. The most recent case https://neondb.slack.com/archives/C03H1K0PGKH/p1722631550388579

Side note for #8484: oldestActiveXid wasn't persisted on pageserver, now it's. Fast shutdown + check availability help over time. Also we have a clear recovery path -- start/restart RW. Currently it doesn't look like we need to rush with #8484

tristan957 · 2024-08-07T23:16:47Z

I have changed the dashboard to also expose lag in seconds.

ololobus · 2024-08-13T15:41:13Z

This week:

Heikki: finish RO RFC Add retroactive RFC about physical replication #8546
Tristan: add filters to dashboard with size and lag cutoff
Tristan: fitler also all logs with ERROR + replication
Alexey: propose Polina to add UI for hot_standby_feedback
Alexey: consider adding max_standby_archive_delay / max_standby_streaming_delay to allow list

Protocol version 2 has been the default for a while now, and we no longer have any computes running in production that used protocol version 1. This completes the migration by removing support for v1 in both the pageserver and the compute. See issue #6211.

ololobus · 2024-08-20T15:41:47Z

This week:

Add new receive/replay metrics Add compute_receive_lsn metric #8750
Add new panels to per endpoint dashboard
Investigate replication-related errors like ERROR: cannot advance replication slot to 0/7B56EA8, minimum is 0/84544E0

Protocol version 2 has been the default for a while now, and we no longer have any computes running in production that used protocol version 1. This completes the migration by removing support for v1 in both the pageserver and the compute. See issue #6211.

ololobus · 2024-09-17T15:36:30Z

This week:

Konstantin: Hot standby GUCs missmatch problem #9023
Tristan: add panel for 'conflict with replication' and close Internal replica health monitoring #8249

ololobus · 2024-09-24T15:45:43Z

This week:

Konstantin: GUCs missmatch Hot standby GUCs missmatch problem #9023
Alexey: add GUCs sync to cplane at get spec?

ololobus · 2024-10-15T15:36:46Z

This week:

@knizhnik and @MMeent reach some consensus on Hot standby GUCs missmatch problem #9023

ololobus · 2024-10-22T15:40:14Z

This week:

@knizhnik and @MMeent reach some consensus on Hot standby GUCs missmatch problem #9023 (@hlinnaka to facilitate)
Finalize and pass review Do not copy logical replicaiton slots to replica #9458

ololobus · 2024-10-29T16:38:04Z

This week:

@knizhnik finalize Hot standby GUCs missmatch problem #9023 (@hlinnaka to facilitate)
@knizhnik finalize and pass review Do not copy logical replicaiton slots to replica #9458 (now split into two PRs)

vadim2404 added t/bug Issue Type: Bug c/compute Component: compute, excluding postgres itself labels Dec 21, 2023

vadim2404 assigned knizhnik Dec 21, 2023

knizhnik mentioned this issue Jan 15, 2024

Propagate apply_lsn from SK to PS to prevent GC from collecting objects which may be still requested by replica #6357

Closed

5 tasks

knizhnik mentioned this issue Feb 16, 2024

Send LSN range in getpage request #6718

Closed

5 tasks

ololobus mentioned this issue Jul 5, 2024

compute_ctl: Use 'fast' shutdown for Postgres termination #8289

Merged

5 tasks

ololobus assigned hlinnaka Jul 9, 2024

stepashka changed the title ~~Epic: physical replication~~ Epic: stabilize physical replication Jul 30, 2024

ololobus mentioned this issue Aug 1, 2024

Add retroactive RFC about physical replication #8546

Merged

hlinnaka mentioned this issue Aug 20, 2024

Remove support for pageserver <-> compute protocol version 1 #8774

Merged

ololobus assigned tristan957 and unassigned hlinnaka Sep 17, 2024

ololobus unassigned tristan957 Oct 17, 2024

Epic: stabilize physical replication #6211

Epic: stabilize physical replication #6211

Comments

vadim2404 commented Dec 21, 2023 • edited by ololobus Loading

Summary

Tasks

knizhnik commented Dec 22, 2023

kelvich commented Jan 2, 2024

kelvich commented Jan 2, 2024

knizhnik commented Jan 2, 2024

kelvich commented Jan 2, 2024 • edited Loading

arssher commented Jan 2, 2024

arssher commented Jan 2, 2024

knizhnik commented Jan 2, 2024

knizhnik commented Jan 2, 2024

vadim2404 commented Jan 2, 2024

arssher commented Jan 2, 2024

knizhnik commented Jan 2, 2024

vadim2404 commented Jan 9, 2024

vadim2404 commented Jan 9, 2024

kelvich commented Jan 9, 2024

knizhnik commented Jan 10, 2024

arssher commented Jan 10, 2024

vadim2404 commented Jan 17, 2024

vadim2404 commented Jan 23, 2024

vadim2404 commented Jan 30, 2024

ItsWadams commented Jan 30, 2024

vadim2404 commented Jan 31, 2024

YanicNeon commented Jan 31, 2024

acervantes23 commented Feb 7, 2024

knizhnik commented Feb 7, 2024

ololobus commented Jul 9, 2024 • edited Loading

save-buffer commented Jul 9, 2024

ololobus commented Jul 15, 2024 • edited Loading

knizhnik commented Jul 16, 2024

ololobus commented Jul 16, 2024

ololobus commented Jul 16, 2024 • edited Loading

ololobus commented Jul 30, 2024 • edited Loading

stepashka commented Jul 31, 2024

ololobus commented Aug 6, 2024 • edited by tristan957 Loading

tristan957 commented Aug 7, 2024

ololobus commented Aug 13, 2024 • edited by tristan957 Loading

ololobus commented Aug 20, 2024 • edited Loading

ololobus commented Sep 17, 2024 • edited by tristan957 Loading

ololobus commented Sep 24, 2024

ololobus commented Oct 15, 2024

ololobus commented Oct 22, 2024

ololobus commented Oct 29, 2024

vadim2404 commented Dec 21, 2023 •

edited by ololobus

Loading

kelvich commented Jan 2, 2024 •

edited

Loading

ololobus commented Jul 9, 2024 •

edited

Loading

ololobus commented Jul 15, 2024 •

edited

Loading

ololobus commented Jul 16, 2024 •

edited

Loading

ololobus commented Jul 30, 2024 •

edited

Loading

ololobus commented Aug 6, 2024 •

edited by tristan957

Loading

ololobus commented Aug 13, 2024 •

edited by tristan957

Loading

ololobus commented Aug 20, 2024 •

edited

Loading

ololobus commented Sep 17, 2024 •

edited by tristan957

Loading