-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement "timelines" in page server #43
Conversation
So, this is what I've been hacking this week. It grew into a much bigger patch than I intended; sorry for the massive PR. The main point is to:
I started this originally as a CLI project, but there is actually only few changes to the CLI. Most of the changes are in the page server. My plan next is to make the changes in the walkeeper so that the walkeeper tests would pass again. But since I've been hacking on this or a while, I wanted to share and get any feedback now. |
Thanks for unifying things, I quickly read diff -- looks good. some random notes:
|
On 18 April 2021 11:13:17 EEST, Stas Kelvich ***@***.***> wrote:
Thanks for unifying things, I quickly read diff -- looks good.
some random notes:
* why do we need timeline_id in branch string ***@***.***/15D3DD8". Isn't branch == timeline and ***@***.***D3DD8 is enough?
Ah, that "2"" is not a timeline id. It's just part of the LSN, the high bits, in the standard Postgres notation.
* with this format we are bringing pg-style data directories to the pageserver. We should be cautious not to use any non-relation files directly, as if we will depend on that -- that would cut out paths to sharding.
Agreed. We should perhaps store the non-relation files in a .tar.gz file in the snapshots dir, to make it more clear that the page server is not supposed to look at them, except when it regurgitates them back to a client that asks for a base backup.
* system_id was a handy concept =) I'm now working on db creation on the cloud side and was planning to use pageserver for storing databases. I can use timelines for different users/databases (technically they all inherit initdb-created database), but that feels weird. We can create different "repos" in the pageserver on the fly -- but does this degree of freedom brings anything to us? We could just create different branches as a databases with different system_id. Or am I missing some technical aspect of why is that a bad idea?
Each systemid should have a separate namespace of timelineids. So if there is timeline "1234abcd" in systemid 1, it should be possible to also have timelineid "1234abcd" in systemid 2. Perhaps treat the systemid as a prefix of the timelineid. But for multi-tenancy, that's not quite enough, it must be possible for multiple tenants to have a cluster with the same systemid.
Maybe we need to support having two clusters with the same systemid even within one tenant. Not sure. That situation could arise out of luck, the systemid isn't all that random, and not very long. Or more likely, if you make a copy of the data dir at some point to create a new cluster.
At the end of the day, we can't depend on any user- or client-generated ID to distinguish clusters across tenants. We need to have a server-generated "tenant id" as part of the key.
- Heikki
|
1f6dea9
to
f57d1c7
Compare
Rebased, and made the required change to the WAL safekeeper and safekeeper_proxy, to make them also work with the timelines. All the tests pass on my laptop now. Please take a look. Barring objections, I'll push this later today. Some notes on what to do next:
|
5631954
to
c3dd1f3
Compare
pub fn from(b: [u8; 16]) -> ZTimelineId { | ||
ZTimelineId(b) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be inside impl From<[u8; 16]> for ZTimelineId { ... }
return Err(io::Error::new( | ||
io::ErrorKind::InvalidInput, | ||
"no null-terminator in string", | ||
)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anyhow::bail!
?
|
||
let mut last_snapshot_lsn = 0; | ||
for direntry in fs::read_dir(&snapshotspath).unwrap() { | ||
let filename = direntry.unwrap().file_name().to_str().unwrap().to_owned(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we propagate these errors with context instead of panicking on unwrap
?
#[allow(non_snake_case)] | ||
pub fn XLogSegmentOffset(xlogptr: XLogRecPtr, wal_segsz_bytes: usize) -> u32 { | ||
return (xlogptr as u32) & (wal_segsz_bytes as u32 - 1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the non_snake_case names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions are reimplementations of PostgreSQL functions/macros with same names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. I'd vote to snake_case them anyway, but I have been the style police enough for this week, so I'll leave it up to you :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. I'd vote to snake_case them anyway,
yeah, I could go either way. We could also name them along the lines of pg_XLogSegmentOffset
. Although that kind of prefixing isn't very idiomatic in Rust either, because you can use the module name as part of the calls instead, like xlog_utils::XLogSegmentOffset
. In #50, I'm moving this so that it becomes postgres_ffi::xlog_utils::XLogSegmentOffset
. I hope the postgres_ffi
prefix makes it less surprising, the 'ffi' is a hint that there is foreign stuff in it. Then again, that's really long, so we're probably going to use use
and not spell out the whole module path.
but I have been the style police enough for this week, so I'll leave it up to you :)
Hey, it's only Tuesday, not casual Friday yet :-).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to suggest that some of these might move inside the types they operate on, e.g.
impl XLogSegment {
pub fn get_offset(lsn: Lsn) -> u32 { ... }
}
pub const XLOG_FNAME_LEN: usize = 24; | ||
pub type XLogRecPtr = u64; | ||
pub type XLogSegNo = u64; | ||
pub type TimeLineID = u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make these newtypes so they can't be mixed up with one another?
E.g.
pub struct XlogRecPtr(pub u64);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. XLogRecPtr is another name for LSN, so #49 applies here. These are copy-pasted from xlog_utils.rs
, so we should refactor these into some generic module. Perhaps just move xlog_utils.rs to a common crate. I'll push this as it is for now, but this certainly needs cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of a crate where we can drop useful functions/types/etc. that are likely to be shared.
Err(e) => { | ||
panic!( | ||
io_error!( | ||
"Control file {:?} is locked by some other process: {}", | ||
&control_file_path, e | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should retire io_error!
. .with_context()
seems like it handles this case well.
lazy_static! { | ||
pub static ref SYSTEMS: Mutex<HashMap<SystemId, Arc<System>>> = Mutex::new(HashMap::new()); | ||
pub static ref TIMELINES: Mutex<HashMap<ZTimelineId, Arc<Timeline>>> = Mutex::new(HashMap::new()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want timelines to be global?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if this is nitpicky. I don't think I know enough yet to comment on the substance.
} | ||
|
||
#[allow(non_snake_case)] | ||
pub fn XLogFileName(tli: TimeLineID, logSegNo: XLogSegNo, wal_segsz_bytes: usize) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented this and other WAL utility functions in xlog_utils.rs.
Should we use it here? And may be move other WAL-related stuff in this module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we really should gather these functions to a common module. The 'wal_decoder.rs' too, perhaps, although that's currently only used in the page server.
4b838d9
to
e7b883c
Compare
The PostgreSQL FE/BE RowDescription message was built incorrectly, the colums were sent in wrong order, and the command tag was missing NULL-terminator. With these fixes, 'psql' understands the reply and shows it correctly.
This replaces the page server's "datadir" concept. The Page Server now always works with a "Zenith Repository". When you initialize a new repository with "zenith init", it runs initdb and loads an initial basebackup of the freshly-created cluster into the repository, on "main" branch. Repository can hold multiple "timelines", which can be given human-friendly names, making them "branches". One page server simultaneously serves all timelines stored in the repository, and you can have multiple Postgres compute nodes connected to the page server, as long they all operate on a different timeline. There is a new command "zenith branch", which can be used to fork off new branches from existing branches. The repository uses the directory layout desribed as Repository format v1 in neondatabase/rfcs#5. It it *highly* inefficient: - we never create new snapshots. So in practice, it's really just a base backup of the initial empty cluster, and everything else is reconstructed by redoing all WAL - when you create a new timeline, the base snapshot and *all* WAL is copied from the new timeline to the new one. There is no smarts about referencing the old snapshots/wal from the ancestor timeline. To support all this, this commit includes a bunch of other changes: - Implement "basebackup" funtionality in page server. When you initialize a new compute node with "zenith pg create", it connects to the page server, and requests a base backup of the Postgres data directory on that timeline. (the base backup excludes user tables, so it's not as bad as it sounds). - Have page server's WAL receiver write the WAL into timeline dir. This allows running a Page Server and Compute Nodes without a WAL safekeeper, until we get around to integrate that properly into the system. (Even after we integrate WAL safekeeper, this is perhaps how this will operate when you want to run the system on your laptop.) - restore_datadir.rs was renamed to restore_local_repo.rs, and heavily modified to use the new format. It now also restores all WAL. - Page server no longer scans and restores everything into memory at startup. Instead, when the first request is made for a timeline, the timeline is slurped into memory at that point. - The responsibility for telling page server to "callmemaybe" was moved into Postgres libpqpagestore code. Also, WAL producer connstring cannot be specified in the pageserver's command line anymore. - Having multiple "system identifiers" in the same page server is no longer supported. I repurposed much of that code to support multiple timelines, instead. - Implemented very basic, incomplete, support for PostgreSQL's Extended Query Protocol in page_service.rs. Turns out that rust-postgres' copy_out() function always uses the extended query protocol to send out the command, and I'm using that to stream the base backup from the page server. TODO: I haven't fixed the WAL safekeeper for this scheme, so all the integration tests involving safekeepers are failing. My plan is to modify the safekeeper to know about Zenith timelines, too, and modify it to work with the same Zenith repository format. It only needs to care about the '.zenith/timelines/<timeline>/wal' directories.
e7b883c
to
3e0b960
Compare
3e0b960
to
63d3db7
Compare
Pushed. I intended to squash the last commit here, "Fixes, per Eric's and Konstantin's comments" before pushing, but forgot. Oops. This left behind some loose ends that I will work on next:
|
continue; | ||
} | ||
|
||
let archive_fname = relpath.to_str().unwrap().clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clippy says:
error: using `clone` on a double-reference; this will copy the reference of type `&str` instead of cloning the inner type
--> pageserver/src/basebackup.rs:69:29
|
69 | let archive_fname = relpath.to_str().unwrap().clone();
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: `#[deny(clippy::clone_double_ref)]` on by default
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#clone_double_ref
The .clone()
can be safely deleted.
There are several instances of this error in this PR.
use lazy_static::lazy_static; | ||
use tar; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You never need to use
the name of a crate-- that is always available.
This replaces the page server's "datadir" concept. The Page Server now
always works with a "Zenith Repository". When you initialize a new
repository with "zenith init", it runs initdb and loads an initial
basebackup of the freshly-created cluster into the repository, on "main"
branch. Repository can hold multiple "timelines", which can be given
human-friendly names, making them "branches". One page server simultaneously
serves all timelines stored in the repository, and you can have multiple
Postgres compute nodes connected to the page server, as long they all
operate on a different timeline.
There is a new command "zenith branch", which can be used to fork off
new branches from existing branches.
The repository uses the directory layout desribed as Repository format
v1 in neondatabase/rfcs#5. It it highly inefficient:
we never create new snapshots. So in practice, it's really just a base
backup of the initial empty cluster, and everything else is reconstructed
by redoing all WAL
when you create a new timeline, the base snapshot and all WAL is copied
from the new timeline to the new one. There is no smarts about
referencing the old snapshots/wal from the ancestor timeline.
To support all this, this commit includes a bunch of other changes:
Implement "basebackup" funtionality in page server. When you initialize
a new compute node with "zenith pg create", it connects to the page
server, and requests a base backup of the Postgres data directory on
that timeline. (the base backup excludes user tables, so it's not
as bad as it sounds).
Have page server's WAL receiver write the WAL into timeline dir. This
allows running a Page Server and Compute Nodes without a WAL safekeeper,
until we get around to integrate that properly into the system. (Even
after we integrate WAL safekeeper, this is perhaps how this will operate
when you want to run the system on your laptop.)
restore_datadir.rs was renamed to restore_local_repo.rs, and heavily
modified to use the new format. It now also restores all WAL.
Page server no longer scans and restores everything into memory at startup.
Instead, when the first request is made for a timeline, the timeline is
slurped into memory at that point.
The responsibility for telling page server to "callmemaybe" was moved
into Postgres libpqpagestore code. Also, WAL producer connstring cannot
be specified in the pageserver's command line anymore.
Having multiple "system identifiers" in the same page server is no
longer supported. I repurposed much of that code to support multiple
timelines, instead.
Implemented very basic, incomplete, support for PostgreSQL's Extended
Query Protocol in page_service.rs. Turns out that rust-postgres'
copy_out() function always uses the extended query protocol to send
out the command, and I'm using that to stream the base backup from the
page server.
TODO: I haven't fixed the WAL safekeeper for this scheme, so all the
integration tests involving safekeepers are failing. My plan is to modify
the safekeeper to know about Zenith timelines, too, and modify it to work
with the same Zenith repository format. It only needs to care about the
'.zenith/timelines//wal' directories.