-
Notifications
You must be signed in to change notification settings - Fork 437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: stabilize physical replication #6211
Comments
I am thinking now how it can be done.
So what can we do?
|
So basically we need to delay PITR for some amount of time for lagging replicas when they are enabled.
That could be done with time lease. Replica sends message each 10 minutes, when pageserver don't receive 3 messages in a row it considers replica to be disabled.
Won't usual feedback message help? IIRC we already have it for backpressure, also pageserver knows that LSN's via storage broker. |
PiTR is enforced at PS and information about replica flush/apply position is avaiable only at SK. The problem is that PS can be connected to one SK1, and replica - some some other SK2. The only components which knows about all SKs are compute and broker. But compute may we inactive (suspended) at the moment when GC is performed by PS. And involving broker in the process of garbage collection on PS seems to be overkill. Certainly SK can somehow interact with each other or through wal proposer. But it also seems to be too complicated and fragile. |
Through broker pageserver has information about LSNs on all safekeepers. That is how pageserver decides which one to connect to. So safekeeper can advertise min feedback lsn out of all replicas connected to it (if any). Also, most likely, we should use information from broker when deciding which safekeeper to connect to on replica. @arssher what do you think? |
Yes, this seems to be the easiest way.
Not necessarily. Replica here is different from pageserver because it costs something, so we're ok to keep the standby -> safekeeper connection all the time as long as standby as alive, which means standby can be initiator of the connection. So what we do currently is just wire all safekeepers into primary_conninfo; if some is down, libpq will try another etc. If set of safekeepers changes we need to update the setting, but this is not hard (though this is not automated yet). With pageserver we can't do similar because we don't want to keep live connections from all existing attached timelines, and safekeeper learns about new data first, so it should be initiator of the connection. Usage of broker gives another advantage: pageserver concurrently can have active connection and at the same time up to date info about other safekeeper positions, so can choose better where to connect in complicated scenarios like when connection to current sk is good, but it is very slow for whatever reason. But similar heuristics though less powerful can be implemented without broker data (e.g. restart connection if no new data arrives within some period). Also using broker on standby likely would be quite untrivial because it is grpc, I'm not even sure C grpc library exists. So looks like a significant work without much gain. |
On a related note, I'm also very suspicious that original issue is caused by this -- "doubt that.replica lags for 7 days" -- me too. Looking at metrics to understand standby position would be very useful, but likely pg_last_wal_replay_lsn is not collected :( |
Ok, so to summarise all above:
|
One of the problems with requesting information about replica position from broker is that it is available only as far as replica is connected to one of SK. But if it is suspended, then this information is not available. As far as I understand only control plane has information. about all replicas. But it is not desirable to:
|
under investigation (most probably slip to the next week) |
Yes, but as Stas wrote somewhere it's mostly ok to keep data only as long as replica is around. Newly seeded replica shouldn't lag significantly. Well, there is probably also standby pinned to LSN, but it can be addresses separately. |
My concern is that replica can be suspended because of inactivity. |
@knizhnik to check why replica requires to download the WAL. |
Hm, how we did end up here? Replica should be suspended due to inactivity. New start will start with latest LSN, so not sure why replica suspend is relevant. There are two open questions now:
|
Sorry. my concerns about read-only replica suspension (when there are not active queries) seems to be irrelevant. So lagged replica can not be caused by replica suspension. Quite opposite: suspend and restart of replica should cause replica to "catch up" with master. Large replication lag between master and replica should be caused by some other reasons. Actually I see only two reasons:
Are there links to the projects suffering for this problem? Can we include them in this ticket? Concerning approach described above: take information about replica LSN from broker and use it to restrict PiTR boundary to prevent GC from removing layers which may be accessed by replica. There are two kind of LSNs maintained by SK: last committed LSN returned in the response to happen requests and triple of LSNs (write/flush/apply) included in hot-stanndby feedback and collected by SK as min from all subscribers (PS and replicas). I wonder of broker can provide now access to both of this LSNs. @arssher ? |
Not everything is published right now, but this is trivial to add, see LSNsSafekeeperTimelineInfo |
status update: in review |
to review it with @MMeent |
@arssher to review the PR |
Hey All - a customer just asked about this in an email thread with me about pricing. Are there any updates we can provide them? |
The problem was identified, and @knizhnik is working on fixing it. But the fix requires time because it affects compute, safekeeper, and pageserver. I suppose in February, we will merge it and ship it. |
We got a support case about this problem today (ZD #2219) Keeping an eye on this thread |
@knizhnik what's the latest status on this issue? |
There is PR #6357 waiting for one more round of review. |
It also creates a shutdown checkpoint, which is important for ROs to get a list of running xacts faster instead of going through the CLOG. See https://www.postgresql.org/docs/current/server-shutdown.html for the list of modes and signals. Related to #6211
## Problem We currently use 'immediate' mode in the most commonly used shutdown path, when the control plane calls a `compute_ctl` API to terminate Postgres inside compute without waiting for the actual pod / VM termination. Yet, 'immediate' shutdown doesn't create a shutdown checkpoint and ROs have bad times figuring out the list of running xacts during next start. ## Summary of changes Use 'fast' mode, which creates a shutdown checkpoint that is important for ROs to get a list of running xacts faster instead of going through the CLOG. On the control plane side, we poll this `compute_ctl` termination API for 10s, it should be enough as we don't really write any data at checkpoint time. If it times out, we anyway switch to the slow k8s-based termination. See https://www.postgresql.org/docs/current/server-shutdown.html for the list of modes and signals. The default VM shutdown hook already uses `fast` mode, see [1] [1] https://github.com/neondatabase/neon/blob/c9fd8d76937c2031fd4fea1cdf661d6cf4f00dc3/vm-image-spec.yaml#L30-L31 Related to #6211
This week:
|
Regarding hot standby feedback (one of the items in the original checklist), we recently allowed people to enable it in their pg_settings on the control plane side. |
## Problem We currently use 'immediate' mode in the most commonly used shutdown path, when the control plane calls a `compute_ctl` API to terminate Postgres inside compute without waiting for the actual pod / VM termination. Yet, 'immediate' shutdown doesn't create a shutdown checkpoint and ROs have bad times figuring out the list of running xacts during next start. ## Summary of changes Use 'fast' mode, which creates a shutdown checkpoint that is important for ROs to get a list of running xacts faster instead of going through the CLOG. On the control plane side, we poll this `compute_ctl` termination API for 10s, it should be enough as we don't really write any data at checkpoint time. If it times out, we anyway switch to the slow k8s-based termination. See https://www.postgresql.org/docs/current/server-shutdown.html for the list of modes and signals. The default VM shutdown hook already uses `fast` mode, see [1] [1] https://github.com/neondatabase/neon/blob/c9fd8d76937c2031fd4fea1cdf661d6cf4f00dc3/vm-image-spec.yaml#L30-L31 Related to #6211
TWIMC, I did that, and it's pretty complicated. There is a flow-chart here https://www.notion.so/neondatabase/Ephemeral-Endpoints-6388264bf28142e79d3b6f6bb6986fe8 TL;DR, it's currently a valid flow to create a normal RO on some branch without running RW. Basically, in proxy and cplane there is a generic mechanism to spin-up various ROs and RWs, it's not necessarily a PITC, it can be just 'at this branch HEAD' For ephemeral endpoints, i.e. compute |
What actually you mean by "fully static ROs"?
This "ephemeral endpoints" or "static replicas" still require separate Postgres instance (POD/VM) and separate timeline/task at PS. In principle, creating temporary branch for static replicas is not strictly needed. Its What IMHO will be really useful is to allow time travel without spawning of separate compute. In this case we can access different time slices in the same Postgres cluster. But it seems to be non-trivial because CLOG and other SLUs are now access locally and so it is hard to provide versioning for them. |
I meant that we will start static computes pinned to specific LSN. Right now, it's turned off in cplane, so for some |
This week:
|
This week:
Heikki proposal for RO starts and pageserver GC races -- we can create a new 'ephemeral' branch + static endpoint |
once the lag metric looks good, please ping the DBaaS team, e.g. on #proj-observability-for-users about the metrics we can add to the UI? 🙏 |
This week:
For #8484 we can postpone it. The most recent case https://neondb.slack.com/archives/C03H1K0PGKH/p1722631550388579 Side note for #8484: |
I have changed the dashboard to also expose lag in seconds. |
This week:
|
Protocol version 2 has been the default for a while now, and we no longer have any computes running in production that used protocol version 1. This completes the migration by removing support for v1 in both the pageserver and the compute. See issue #6211.
This week:
|
Protocol version 2 has been the default for a while now, and we no longer have any computes running in production that used protocol version 1. This completes the migration by removing support for v1 in both the pageserver and the compute. See issue #6211.
This week:
|
This week:
|
This week:
|
This week:
|
This week:
|
Summary
Original issue we hit was
but then the scope grew up quickly. This is the Epic to track main physical replication work
Tasks
walreceiver
did not restart after erroring out #8172Follow-ups:
Related Epics:
The text was updated successfully, but these errors were encountered: