Implement "timelines" in page server #43

hlinnaka · 2021-04-16T10:58:48Z

This replaces the page server's "datadir" concept. The Page Server now
always works with a "Zenith Repository". When you initialize a new
repository with "zenith init", it runs initdb and loads an initial
basebackup of the freshly-created cluster into the repository, on "main"
branch. Repository can hold multiple "timelines", which can be given
human-friendly names, making them "branches". One page server simultaneously
serves all timelines stored in the repository, and you can have multiple
Postgres compute nodes connected to the page server, as long they all
operate on a different timeline.

There is a new command "zenith branch", which can be used to fork off
new branches from existing branches.

The repository uses the directory layout desribed as Repository format
v1 in neondatabase/rfcs#5. It it highly inefficient:

we never create new snapshots. So in practice, it's really just a base
backup of the initial empty cluster, and everything else is reconstructed
by redoing all WAL
when you create a new timeline, the base snapshot and all WAL is copied
from the new timeline to the new one. There is no smarts about
referencing the old snapshots/wal from the ancestor timeline.

To support all this, this commit includes a bunch of other changes:

Implement "basebackup" funtionality in page server. When you initialize
a new compute node with "zenith pg create", it connects to the page
server, and requests a base backup of the Postgres data directory on
that timeline. (the base backup excludes user tables, so it's not
as bad as it sounds).
Have page server's WAL receiver write the WAL into timeline dir. This
allows running a Page Server and Compute Nodes without a WAL safekeeper,
until we get around to integrate that properly into the system. (Even
after we integrate WAL safekeeper, this is perhaps how this will operate
when you want to run the system on your laptop.)
restore_datadir.rs was renamed to restore_local_repo.rs, and heavily
modified to use the new format. It now also restores all WAL.
Page server no longer scans and restores everything into memory at startup.
Instead, when the first request is made for a timeline, the timeline is
slurped into memory at that point.
The responsibility for telling page server to "callmemaybe" was moved
into Postgres libpqpagestore code. Also, WAL producer connstring cannot
be specified in the pageserver's command line anymore.
Having multiple "system identifiers" in the same page server is no
longer supported. I repurposed much of that code to support multiple
timelines, instead.
Implemented very basic, incomplete, support for PostgreSQL's Extended
Query Protocol in page_service.rs. Turns out that rust-postgres'
copy_out() function always uses the extended query protocol to send
out the command, and I'm using that to stream the base backup from the
page server.

TODO: I haven't fixed the WAL safekeeper for this scheme, so all the
integration tests involving safekeepers are failing. My plan is to modify
the safekeeper to know about Zenith timelines, too, and modify it to work
with the same Zenith repository format. It only needs to care about the
'.zenith/timelines//wal' directories.

hlinnaka · 2021-04-16T11:02:58Z

So, this is what I've been hacking this week. It grew into a much bigger patch than I intended; sorry for the massive PR.

The main point is to:

Introduce the concept of a "Zenith repository". The repository is the persistent storage format used by the page server.
Introduce the concept of "zenith timelines". One repository holds one postgres cluster, but it can contain multiple timeline of the same cluster. One page server serves them all.

I started this originally as a CLI project, but there is actually only few changes to the CLI. Most of the changes are in the page server.

My plan next is to make the changes in the walkeeper so that the walkeeper tests would pass again. But since I've been hacking on this or a while, I wanted to share and get any feedback now.

kelvich · 2021-04-18T08:13:06Z

Thanks for unifying things, I quickly read diff -- looks good.

some random notes:

why do we need timeline_id in branch string "mybranch@2/15D3DD8". Isn't branch == timeline and mybranch@15D3DD8 is enough?
with this format we are bringing pg-style data directories to the pageserver. We should be cautious not to use any non-relation files directly, as if we will depend on that -- that would cut out paths to sharding.
system_id was a handy concept =) I'm now working on db creation on the cloud side and was planning to use pageserver for storing databases. I can use timelines for different users/databases (technically they all inherit initdb-created database), but that feels weird. We can create different "repos" in the pageserver on the fly -- but does this degree of freedom brings anything to us? We could just create different branches as a databases with different system_id. Or am I missing some technical aspect of why is that a bad idea?

hlinnaka · 2021-04-18T09:04:53Z

On 18 April 2021 11:13:17 EEST, Stas Kelvich ***@***.***> wrote: Thanks for unifying things, I quickly read diff -- looks good. some random notes: * why do we need timeline_id in branch string ***@***.***/15D3DD8". Isn't branch == timeline and ***@***.***D3DD8 is enough?

Ah, that "2"" is not a timeline id. It's just part of the LSN, the high bits, in the standard Postgres notation.

* with this format we are bringing pg-style data directories to the pageserver. We should be cautious not to use any non-relation files directly, as if we will depend on that -- that would cut out paths to sharding.

Agreed. We should perhaps store the non-relation files in a .tar.gz file in the snapshots dir, to make it more clear that the page server is not supposed to look at them, except when it regurgitates them back to a client that asks for a base backup.

* system_id was a handy concept =) I'm now working on db creation on the cloud side and was planning to use pageserver for storing databases. I can use timelines for different users/databases (technically they all inherit initdb-created database), but that feels weird. We can create different "repos" in the pageserver on the fly -- but does this degree of freedom brings anything to us? We could just create different branches as a databases with different system_id. Or am I missing some technical aspect of why is that a bad idea?

Each systemid should have a separate namespace of timelineids. So if there is timeline "1234abcd" in systemid 1, it should be possible to also have timelineid "1234abcd" in systemid 2. Perhaps treat the systemid as a prefix of the timelineid. But for multi-tenancy, that's not quite enough, it must be possible for multiple tenants to have a cluster with the same systemid. Maybe we need to support having two clusters with the same systemid even within one tenant. Not sure. That situation could arise out of luck, the systemid isn't all that random, and not very long. Or more likely, if you make a copy of the data dir at some point to create a new cluster. At the end of the day, we can't depend on any user- or client-generated ID to distinguish clusters across tenants. We need to have a server-generated "tenant id" as part of the key. - Heikki

hlinnaka · 2021-04-19T11:02:39Z

Rebased, and made the required change to the WAL safekeeper and safekeeper_proxy, to make them also work with the timelines. All the tests pass on my laptop now.

Please take a look. Barring objections, I'll push this later today.

Some notes on what to do next:

This removed the support for restoring from S3. It's not urgent, but we probably want to add that back at some point, but working with the new repository format
This repository format V1 is very dumb, it's just a postgres base backup (called snapshot here) + WAL archive. And there isn't even any code to take new snapshots. And when you create a new timeline, all the history is copied to the new timeline, rather than just referencing the history. So there is a lot of work to be done there. We had a slightly smarter format in the old zenith_pull, zenith_push, zenith_slicedice' commands in the vendor/postgres/src/bin/` directory. Those are now obsolete, and probably should be removed, but I'll leave that for another commit.

control_plane/src/compute.rs

control_plane/src/local_env.rs

pageserver/src/lib.rs

ericseppanen · 2021-04-20T03:40:12Z

pageserver/src/lib.rs

+    pub fn from(b: [u8; 16]) -> ZTimelineId {
+        ZTimelineId(b)
+    }


This should probably be inside impl From<[u8; 16]> for ZTimelineId { ... }

pageserver/src/page_cache.rs

ericseppanen · 2021-04-20T03:45:14Z

pageserver/src/page_service.rs

+            return Err(io::Error::new(
+                io::ErrorKind::InvalidInput,
+                "no null-terminator in string",
+            ));


anyhow::bail! ?

pageserver/src/page_service.rs

ericseppanen · 2021-04-20T04:07:22Z

pageserver/src/restore_local_repo.rs

+
+    let mut last_snapshot_lsn = 0;
+    for direntry in fs::read_dir(&snapshotspath).unwrap() {
+        let filename = direntry.unwrap().file_name().to_str().unwrap().to_owned();


Should we propagate these errors with context instead of panicking on unwrap?

pageserver/src/restore_local_repo.rs

ericseppanen · 2021-04-20T04:45:55Z

pageserver/src/restore_local_repo.rs

+#[allow(non_snake_case)]
+pub fn XLogSegmentOffset(xlogptr: XLogRecPtr, wal_segsz_bytes: usize) -> u32 {
+    return (xlogptr as u32) & (wal_segsz_bytes as u32 - 1);
+}


Why the non_snake_case names?

These functions are reimplementations of PostgreSQL functions/macros with same names.

Ah. I'd vote to snake_case them anyway, but I have been the style police enough for this week, so I'll leave it up to you :)

Ah. I'd vote to snake_case them anyway,

yeah, I could go either way. We could also name them along the lines of pg_XLogSegmentOffset. Although that kind of prefixing isn't very idiomatic in Rust either, because you can use the module name as part of the calls instead, like xlog_utils::XLogSegmentOffset. In #50, I'm moving this so that it becomes postgres_ffi::xlog_utils::XLogSegmentOffset. I hope the postgres_ffi prefix makes it less surprising, the 'ffi' is a hint that there is foreign stuff in it. Then again, that's really long, so we're probably going to use use and not spell out the whole module path.

but I have been the style police enough for this week, so I'll leave it up to you :)

Hey, it's only Tuesday, not casual Friday yet :-).

I was going to suggest that some of these might move inside the types they operate on, e.g.

impl XLogSegment { pub fn get_offset(lsn: Lsn) -> u32 { ... } }

ericseppanen · 2021-04-20T04:47:04Z

pageserver/src/restore_local_repo.rs

+pub const XLOG_FNAME_LEN: usize = 24;
+pub type XLogRecPtr = u64;
+pub type XLogSegNo = u64;
+pub type TimeLineID = u32;


Should we make these newtypes so they can't be mixed up with one another?

E.g.

pub struct XlogRecPtr(pub u64);

Yeah. XLogRecPtr is another name for LSN, so #49 applies here. These are copy-pasted from xlog_utils.rs, so we should refactor these into some generic module. Perhaps just move xlog_utils.rs to a common crate. I'll push this as it is for now, but this certainly needs cleanup.

I like the idea of a crate where we can drop useful functions/types/etc. that are likely to be shared.

pageserver/src/waldecoder.rs

walkeeper/src/bin/wal_acceptor.rs

ericseppanen · 2021-04-20T05:03:56Z

walkeeper/src/wal_service.rs

                    Err(e) => {
-                        panic!(
+                        io_error!(
                            "Control file {:?} is locked by some other process: {}",
                            &control_file_path, e
                        );
                    }


I think we should retire io_error!. .with_context() seems like it handles this case well.

ericseppanen · 2021-04-20T05:04:53Z

walkeeper/src/wal_service.rs

 lazy_static! {
-    pub static ref SYSTEMS: Mutex<HashMap<SystemId, Arc<System>>> = Mutex::new(HashMap::new());
+    pub static ref TIMELINES: Mutex<HashMap<ZTimelineId, Arc<Timeline>>> = Mutex::new(HashMap::new());
 }


Why do we want timelines to be global?

zenith/src/main.rs

ericseppanen

Sorry if this is nitpicky. I don't think I know enough yet to comment on the substance.

pageserver/src/page_cache.rs

walkeeper/src/bin/wal_acceptor.rs

zenith/src/main.rs

knizhnik · 2021-04-19T17:08:49Z

pageserver/src/walreceiver.rs

+}
+
+#[allow(non_snake_case)]
+pub fn XLogFileName(tli: TimeLineID, logSegNo: XLogSegNo, wal_segsz_bytes: usize) -> String {


I have implemented this and other WAL utility functions in xlog_utils.rs.
Should we use it here? And may be move other WAL-related stuff in this module?

Yeah, we really should gather these functions to a common module. The 'wal_decoder.rs' too, perhaps, although that's currently only used in the page server.

…es_ffi

The PostgreSQL FE/BE RowDescription message was built incorrectly, the colums were sent in wrong order, and the command tag was missing NULL-terminator. With these fixes, 'psql' understands the reply and shows it correctly.

This replaces the page server's "datadir" concept. The Page Server now always works with a "Zenith Repository". When you initialize a new repository with "zenith init", it runs initdb and loads an initial basebackup of the freshly-created cluster into the repository, on "main" branch. Repository can hold multiple "timelines", which can be given human-friendly names, making them "branches". One page server simultaneously serves all timelines stored in the repository, and you can have multiple Postgres compute nodes connected to the page server, as long they all operate on a different timeline. There is a new command "zenith branch", which can be used to fork off new branches from existing branches. The repository uses the directory layout desribed as Repository format v1 in neondatabase/rfcs#5. It it *highly* inefficient: - we never create new snapshots. So in practice, it's really just a base backup of the initial empty cluster, and everything else is reconstructed by redoing all WAL - when you create a new timeline, the base snapshot and *all* WAL is copied from the new timeline to the new one. There is no smarts about referencing the old snapshots/wal from the ancestor timeline. To support all this, this commit includes a bunch of other changes: - Implement "basebackup" funtionality in page server. When you initialize a new compute node with "zenith pg create", it connects to the page server, and requests a base backup of the Postgres data directory on that timeline. (the base backup excludes user tables, so it's not as bad as it sounds). - Have page server's WAL receiver write the WAL into timeline dir. This allows running a Page Server and Compute Nodes without a WAL safekeeper, until we get around to integrate that properly into the system. (Even after we integrate WAL safekeeper, this is perhaps how this will operate when you want to run the system on your laptop.) - restore_datadir.rs was renamed to restore_local_repo.rs, and heavily modified to use the new format. It now also restores all WAL. - Page server no longer scans and restores everything into memory at startup. Instead, when the first request is made for a timeline, the timeline is slurped into memory at that point. - The responsibility for telling page server to "callmemaybe" was moved into Postgres libpqpagestore code. Also, WAL producer connstring cannot be specified in the pageserver's command line anymore. - Having multiple "system identifiers" in the same page server is no longer supported. I repurposed much of that code to support multiple timelines, instead. - Implemented very basic, incomplete, support for PostgreSQL's Extended Query Protocol in page_service.rs. Turns out that rust-postgres' copy_out() function always uses the extended query protocol to send out the command, and I'm using that to stream the base backup from the page server. TODO: I haven't fixed the WAL safekeeper for this scheme, so all the integration tests involving safekeepers are failing. My plan is to modify the safekeeper to know about Zenith timelines, too, and modify it to work with the same Zenith repository format. It only needs to care about the '.zenith/timelines/<timeline>/wal' directories.

hlinnaka · 2021-04-20T16:25:45Z

Pushed. I intended to squash the last commit here, "Fixes, per Eric's and Konstantin's comments" before pushing, but forgot. Oops.

This left behind some loose ends that I will work on next:

Refactor xlog_utils.rs and the snippets that this PR copy-pasted from xlog_utils.rs into a separate module.
Re-implement multi-tenancy in page server and WAL safekeepers

ericseppanen · 2021-04-21T04:37:55Z

pageserver/src/basebackup.rs

+            continue;
+        }
+
+        let archive_fname = relpath.to_str().unwrap().clone();


clippy says:

error: using `clone` on a double-reference; this will copy the reference of type `&str` instead of cloning the inner type --> pageserver/src/basebackup.rs:69:29 | 69 | let archive_fname = relpath.to_str().unwrap().clone(); | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | = note: `#[deny(clippy::clone_double_ref)]` on by default = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#clone_double_ref

The .clone() can be safely deleted.

There are several instances of this error in this PR.

ericseppanen · 2021-04-21T04:43:34Z

control_plane/src/compute.rs

 use lazy_static::lazy_static;
+use tar;


You never need to use the name of a crate-- that is always available.

hlinnaka mentioned this pull request Apr 16, 2021

Implement CLI v2 #7

Closed

hlinnaka force-pushed the repository-format-v1 branch from 1f6dea9 to f57d1c7 Compare April 19, 2021 10:55

hlinnaka requested review from kelvich, lubennikovaav, knizhnik and ericseppanen and removed request for lubennikovaav April 19, 2021 11:02

hlinnaka force-pushed the repository-format-v1 branch 3 times, most recently from 5631954 to c3dd1f3 Compare April 19, 2021 12:26