Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement "timelines" in page server #43

Closed
wants to merge 7 commits into from
Closed

Conversation

hlinnaka
Copy link
Contributor

This replaces the page server's "datadir" concept. The Page Server now
always works with a "Zenith Repository". When you initialize a new
repository with "zenith init", it runs initdb and loads an initial
basebackup of the freshly-created cluster into the repository, on "main"
branch. Repository can hold multiple "timelines", which can be given
human-friendly names, making them "branches". One page server simultaneously
serves all timelines stored in the repository, and you can have multiple
Postgres compute nodes connected to the page server, as long they all
operate on a different timeline.

There is a new command "zenith branch", which can be used to fork off
new branches from existing branches.

The repository uses the directory layout desribed as Repository format
v1 in neondatabase/rfcs#5. It it highly inefficient:

  • we never create new snapshots. So in practice, it's really just a base
    backup of the initial empty cluster, and everything else is reconstructed
    by redoing all WAL

  • when you create a new timeline, the base snapshot and all WAL is copied
    from the new timeline to the new one. There is no smarts about
    referencing the old snapshots/wal from the ancestor timeline.

To support all this, this commit includes a bunch of other changes:

  • Implement "basebackup" funtionality in page server. When you initialize
    a new compute node with "zenith pg create", it connects to the page
    server, and requests a base backup of the Postgres data directory on
    that timeline. (the base backup excludes user tables, so it's not
    as bad as it sounds).

  • Have page server's WAL receiver write the WAL into timeline dir. This
    allows running a Page Server and Compute Nodes without a WAL safekeeper,
    until we get around to integrate that properly into the system. (Even
    after we integrate WAL safekeeper, this is perhaps how this will operate
    when you want to run the system on your laptop.)

  • restore_datadir.rs was renamed to restore_local_repo.rs, and heavily
    modified to use the new format. It now also restores all WAL.

  • Page server no longer scans and restores everything into memory at startup.
    Instead, when the first request is made for a timeline, the timeline is
    slurped into memory at that point.

  • The responsibility for telling page server to "callmemaybe" was moved
    into Postgres libpqpagestore code. Also, WAL producer connstring cannot
    be specified in the pageserver's command line anymore.

  • Having multiple "system identifiers" in the same page server is no
    longer supported. I repurposed much of that code to support multiple
    timelines, instead.

  • Implemented very basic, incomplete, support for PostgreSQL's Extended
    Query Protocol in page_service.rs. Turns out that rust-postgres'
    copy_out() function always uses the extended query protocol to send
    out the command, and I'm using that to stream the base backup from the
    page server.

TODO: I haven't fixed the WAL safekeeper for this scheme, so all the
integration tests involving safekeepers are failing. My plan is to modify
the safekeeper to know about Zenith timelines, too, and modify it to work
with the same Zenith repository format. It only needs to care about the
'.zenith/timelines//wal' directories.

@hlinnaka
Copy link
Contributor Author

So, this is what I've been hacking this week. It grew into a much bigger patch than I intended; sorry for the massive PR.

The main point is to:

  1. Introduce the concept of a "Zenith repository". The repository is the persistent storage format used by the page server.
  2. Introduce the concept of "zenith timelines". One repository holds one postgres cluster, but it can contain multiple timeline of the same cluster. One page server serves them all.

I started this originally as a CLI project, but there is actually only few changes to the CLI. Most of the changes are in the page server.

My plan next is to make the changes in the walkeeper so that the walkeeper tests would pass again. But since I've been hacking on this or a while, I wanted to share and get any feedback now.

@hlinnaka hlinnaka mentioned this pull request Apr 16, 2021
@kelvich
Copy link
Contributor

kelvich commented Apr 18, 2021

Thanks for unifying things, I quickly read diff -- looks good.

some random notes:

  • why do we need timeline_id in branch string "mybranch@2/15D3DD8". Isn't branch == timeline and mybranch@15D3DD8 is enough?

  • with this format we are bringing pg-style data directories to the pageserver. We should be cautious not to use any non-relation files directly, as if we will depend on that -- that would cut out paths to sharding.

  • system_id was a handy concept =) I'm now working on db creation on the cloud side and was planning to use pageserver for storing databases. I can use timelines for different users/databases (technically they all inherit initdb-created database), but that feels weird. We can create different "repos" in the pageserver on the fly -- but does this degree of freedom brings anything to us? We could just create different branches as a databases with different system_id. Or am I missing some technical aspect of why is that a bad idea?

@hlinnaka
Copy link
Contributor Author

hlinnaka commented Apr 18, 2021 via email

@hlinnaka hlinnaka force-pushed the repository-format-v1 branch from 1f6dea9 to f57d1c7 Compare April 19, 2021 10:55
@hlinnaka
Copy link
Contributor Author

Rebased, and made the required change to the WAL safekeeper and safekeeper_proxy, to make them also work with the timelines. All the tests pass on my laptop now.

Please take a look. Barring objections, I'll push this later today.

Some notes on what to do next:

  • This removed the support for restoring from S3. It's not urgent, but we probably want to add that back at some point, but working with the new repository format
  • This repository format V1 is very dumb, it's just a postgres base backup (called snapshot here) + WAL archive. And there isn't even any code to take new snapshots. And when you create a new timeline, all the history is copied to the new timeline, rather than just referencing the history. So there is a lot of work to be done there. We had a slightly smarter format in the old zenith_pull, zenith_push, zenith_slicedice' commands in the vendor/postgres/src/bin/` directory. Those are now obsolete, and probably should be removed, but I'll leave that for another commit.

@hlinnaka hlinnaka requested review from kelvich, lubennikovaav, knizhnik and ericseppanen and removed request for lubennikovaav April 19, 2021 11:02
@hlinnaka hlinnaka force-pushed the repository-format-v1 branch 3 times, most recently from 5631954 to c3dd1f3 Compare April 19, 2021 12:26
pageserver/src/lib.rs Outdated Show resolved Hide resolved
pageserver/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines +36 to +44
pub fn from(b: [u8; 16]) -> ZTimelineId {
ZTimelineId(b)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be inside impl From<[u8; 16]> for ZTimelineId { ... }

Comment on lines +180 to +184
return Err(io::Error::new(
io::ErrorKind::InvalidInput,
"no null-terminator in string",
));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyhow::bail! ?


let mut last_snapshot_lsn = 0;
for direntry in fs::read_dir(&snapshotspath).unwrap() {
let filename = direntry.unwrap().file_name().to_str().unwrap().to_owned();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we propagate these errors with context instead of panicking on unwrap?

Comment on lines +324 to +316
#[allow(non_snake_case)]
pub fn XLogSegmentOffset(xlogptr: XLogRecPtr, wal_segsz_bytes: usize) -> u32 {
return (xlogptr as u32) & (wal_segsz_bytes as u32 - 1);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the non_snake_case names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are reimplementations of PostgreSQL functions/macros with same names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. I'd vote to snake_case them anyway, but I have been the style police enough for this week, so I'll leave it up to you :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. I'd vote to snake_case them anyway,

yeah, I could go either way. We could also name them along the lines of pg_XLogSegmentOffset. Although that kind of prefixing isn't very idiomatic in Rust either, because you can use the module name as part of the calls instead, like xlog_utils::XLogSegmentOffset. In #50, I'm moving this so that it becomes postgres_ffi::xlog_utils::XLogSegmentOffset. I hope the postgres_ffi prefix makes it less surprising, the 'ffi' is a hint that there is foreign stuff in it. Then again, that's really long, so we're probably going to use use and not spell out the whole module path.

but I have been the style police enough for this week, so I'll leave it up to you :)

Hey, it's only Tuesday, not casual Friday yet :-).

Copy link
Contributor

@ericseppanen ericseppanen Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to suggest that some of these might move inside the types they operate on, e.g.

impl XLogSegment {
    pub fn get_offset(lsn: Lsn) -> u32 { ... }
}

Comment on lines +319 to +311
pub const XLOG_FNAME_LEN: usize = 24;
pub type XLogRecPtr = u64;
pub type XLogSegNo = u64;
pub type TimeLineID = u32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make these newtypes so they can't be mixed up with one another?

E.g.

pub struct XlogRecPtr(pub u64);

Copy link
Contributor Author

@hlinnaka hlinnaka Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. XLogRecPtr is another name for LSN, so #49 applies here. These are copy-pasted from xlog_utils.rs, so we should refactor these into some generic module. Perhaps just move xlog_utils.rs to a common crate. I'll push this as it is for now, but this certainly needs cleanup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of a crate where we can drop useful functions/types/etc. that are likely to be shared.

Comment on lines 476 to 481
Err(e) => {
panic!(
io_error!(
"Control file {:?} is locked by some other process: {}",
&control_file_path, e
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should retire io_error!. .with_context() seems like it handles this case well.

Comment on lines 356 to 358
lazy_static! {
pub static ref SYSTEMS: Mutex<HashMap<SystemId, Arc<System>>> = Mutex::new(HashMap::new());
pub static ref TIMELINES: Mutex<HashMap<ZTimelineId, Arc<Timeline>>> = Mutex::new(HashMap::new());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want timelines to be global?

zenith/src/main.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@ericseppanen ericseppanen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this is nitpicky. I don't think I know enough yet to comment on the substance.

pageserver/src/page_cache.rs Show resolved Hide resolved
walkeeper/src/bin/wal_acceptor.rs Show resolved Hide resolved
zenith/src/main.rs Outdated Show resolved Hide resolved
}

#[allow(non_snake_case)]
pub fn XLogFileName(tli: TimeLineID, logSegNo: XLogSegNo, wal_segsz_bytes: usize) -> String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented this and other WAL utility functions in xlog_utils.rs.
Should we use it here? And may be move other WAL-related stuff in this module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we really should gather these functions to a common module. The 'wal_decoder.rs' too, perhaps, although that's currently only used in the page server.

@hlinnaka hlinnaka force-pushed the repository-format-v1 branch from 4b838d9 to e7b883c Compare April 20, 2021 14:16
The PostgreSQL FE/BE RowDescription message was built incorrectly,
the colums were sent in wrong order, and the command tag was missing
NULL-terminator. With these fixes, 'psql' understands the reply and
shows it correctly.
This replaces the page server's "datadir" concept. The Page Server now
always works with a "Zenith Repository". When you initialize a new
repository with "zenith init", it runs initdb and loads an initial
basebackup of the freshly-created cluster into the repository, on "main"
branch. Repository can hold multiple "timelines", which can be given
human-friendly names, making them "branches". One page server simultaneously
serves all timelines stored in the repository, and you can have multiple
Postgres compute nodes connected to the page server, as long they all
operate on a different timeline.

There is a new command "zenith branch", which can be used to fork off
new branches from existing branches.

The repository uses the directory layout desribed as Repository format
v1 in neondatabase/rfcs#5. It it *highly* inefficient:
- we never create new snapshots. So in practice, it's really just a base
  backup of the initial empty cluster, and everything else is reconstructed
  by redoing all WAL

- when you create a new timeline, the base snapshot and *all* WAL is copied
  from the new timeline to the new one. There is no smarts about
  referencing the old snapshots/wal from the ancestor timeline.

To support all this, this commit includes a bunch of other changes:

- Implement "basebackup" funtionality in page server. When you initialize
  a new compute node with "zenith pg create", it connects to the page
  server, and requests a base backup of the Postgres data directory on
  that timeline. (the base backup excludes user tables, so it's not
  as bad as it sounds).

- Have page server's WAL receiver write the WAL into timeline dir. This
  allows running a Page Server and Compute Nodes without a WAL safekeeper,
  until we get around to integrate that properly into the system. (Even
  after we integrate WAL safekeeper, this is perhaps how this will operate
  when you want to run the system on your laptop.)

- restore_datadir.rs was renamed to restore_local_repo.rs, and heavily
  modified to use the new format. It now also restores all WAL.

- Page server no longer scans and restores everything into memory at startup.
  Instead, when the first request is made for a timeline, the timeline is
  slurped into memory at that point.

- The responsibility for telling page server to "callmemaybe" was moved
  into Postgres libpqpagestore code. Also, WAL producer connstring cannot
  be specified in the pageserver's command line anymore.

- Having multiple "system identifiers" in the same page server is no
  longer supported. I repurposed much of that code to support multiple
  timelines, instead.

- Implemented very basic, incomplete, support for PostgreSQL's Extended
  Query Protocol in page_service.rs. Turns out that rust-postgres'
  copy_out() function always uses the extended query protocol to send
  out the command, and I'm using that to stream the base backup from the
  page server.

TODO: I haven't fixed the WAL safekeeper for this scheme, so all the
integration tests involving safekeepers are failing. My plan is to modify
the safekeeper to know about Zenith timelines, too, and modify it to work
with the same Zenith repository format. It only needs to care about the
'.zenith/timelines/<timeline>/wal' directories.
@hlinnaka hlinnaka force-pushed the repository-format-v1 branch from e7b883c to 3e0b960 Compare April 20, 2021 14:35
@hlinnaka hlinnaka force-pushed the repository-format-v1 branch from 3e0b960 to 63d3db7 Compare April 20, 2021 14:57
@hlinnaka
Copy link
Contributor Author

Pushed. I intended to squash the last commit here, "Fixes, per Eric's and Konstantin's comments" before pushing, but forgot. Oops.

This left behind some loose ends that I will work on next:

  • Refactor xlog_utils.rs and the snippets that this PR copy-pasted from xlog_utils.rs into a separate module.
  • Re-implement multi-tenancy in page server and WAL safekeepers

@hlinnaka hlinnaka closed this Apr 20, 2021
continue;
}

let archive_fname = relpath.to_str().unwrap().clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clippy says:

error: using `clone` on a double-reference; this will copy the reference of type `&str` instead of cloning the inner type
  --> pageserver/src/basebackup.rs:69:29
   |
69 |         let archive_fname = relpath.to_str().unwrap().clone();
   |                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: `#[deny(clippy::clone_double_ref)]` on by default
   = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#clone_double_ref

The .clone() can be safely deleted.

There are several instances of this error in this PR.

use lazy_static::lazy_static;
use tar;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You never need to use the name of a crate-- that is always available.

@hlinnaka hlinnaka deleted the repository-format-v1 branch April 27, 2021 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants