The traditional way to take backups in an RDBMS is to dump the data across all (or the desired set of) the tables in the database by performing a scan operation. However, in YugabyteDB - a distributed SQL database that is often considered for its massive scalability, the data set could become quite large, making a scan-based backup practically infeasible. The distributed backups and restore feature is aimed at making backups and restores very efficient even on large data sets.
Note: An in-cluster snapshot refers to a read-only copy of the database created by creating a list of immutable files. A backup generally is an off-cluster copy of a snapshot of the database. For the purposes of this design, we will be using backups to refer to snapshots with the understanding that copying a snapshot off the cluster is a user driven exercise.
- A snapshot should be consistent with on-going transactions. A transaction could affect rows across multiple tablets (potentially across multiple nodes). A snapshot should either completely include all the changes made by the transaction or fully exclude them.
- If any transaction
t
is included into a snapshot, then all transactions that committed prior to thist
should be included into the snapshot.
- All updates/transactions committed before the snapshot request is initiated should be included in the snapshot. Transactions in progress at the time of issuing the snapshot request may or may not be included into the snapshot.
- Time to perform an in-cluster snapshot should not depend on the dataset size. This implies that the database should not need to scan the data stored in order to perform a snapshot.
- It should be possible to restore snapshots from a cluster into the same cluster or any other target cluster (with different number/type of nodes).
- Authentication and authorization - support backing up users, roles and permissions
- Encryption of data at rest - support backuping up data with encryption preserved as well as un-encrypted
- This functionality should work across all the YugabyteDB APIs (YSQL and YCQL)
- Extend this functionality to the YSQL grammar to simplify this process
The goals described above are for performing a base backup of the table. A point in time restore or PITR refers to being able to restore to a specified hybrid timestamp (which correlates with the physical time, but could differ up to the max clock skew). This could be achieved by one of the following techniques:
- Starting from a base backup, apply WAL updates forward in time till the desired hybrid timestamp. This would be the same as starting from an older state of the cluster and moving it forward by applying updates.
- Starting from the data in the cluster (or a later in-cluster snapshot), remove all updates applied after the desired hybrid timestamp. This is semantically the same as applying the undo log to abort select transactions.
Note: The former technique lends itself to implementing incremental backups by saving the WAL files on each node. The latter technique only allows point in time restores, and cannot be used to implement incremental backups.
Support for point in time restores is planned, and will depend on the ability to perform base backups.
The snapshot will be initiated from the YSQL or the YCQL query layer, because this would have a dependency on the schema. In the case of either API, snapshots can be performed at the following granularities:
- An entire database
- A set of tables and all their relations (indexes, unique constraints,
SERIAL
types for auto-generated ids, etc) - A single table
Note that the current plan is to support the following levels of atomicity in a backup:
- YSQL snapshots: An entire database would get backed up. This would include all relations such as indexes, unique constraints,
SERIAL
types for auto-generated ids, foreign keys, etc. - YCQL snapshots: A table and it's direct relations would be backed up.
The request to perform a point-in-time snapshot of the cluster is initiated via the following API call to the YB-Master leader.
CreateSnapshot(<list of tablets>)
This API request specifies the set of tablets that need to be backed up. The set of tablets determines the scope of what is being backed up - for example, snapshots can be performed at the following granularities:
- An entire database
- A set of tables and all their relations (indexes, unique constraints,
SERIAL
types for auto-generated ids, etc) - A single table
In order to ensure that a snapshot is consistent, it has to performed as of a fixed, cluster-wide timestamp, referred to subsequently as the snapshot hybrid timestamp, or the snapshot-timestamp
for short. Upon receiving a PerformSnapshot()
API call, the YB-Master leader first picks a suitable snapshot-timestamp
.
It is important to ensure that there are no changes to the data in the database older than the snapshot-timestamp
after a snapshot is taken, while ensuring that all the updates made before the snapshot request are included. Based on how the distributed transactions algorithm of YugabyteDB works, we already know that in order for all recently written transaction to be included, we would need the timestamp that is computed as follows:
snapshot-timestamp = current_physical_time_on_node() + max_clock_skew_in_cluster
Further, in order for the database to ensure that no updates older than the snapshot-timestamp
will be applied, we would need to wait till the hybrid logical time on the YB-Master leader advances to atleast the following:
snapshot-start-time = snapshot-timestamp + max_clock_skew_in_cluster
Note: It is possible to choose the
snapshot-timestamp
ascurrent_physical_time_on_node() - max_clock_skew_in_cluster
and start thesnapshot-start-time
ascurrent_physical_time_on_node()
. However, this choice violate the recency design goal, because some of the updates performed in the( current_physical_time_on_node() - max_clock_skew_in_cluster, current_physical_time_on_node() + max_clock_skew_in_cluster )
time window might not make it into the snapshot.
The system catalog would get backed up in the YB-Master, which includes the schema of the database at the time of the backup. This backup would be very similar to what is captured by the output of:
ysql_dump --schema-only
The YB-Master leader sends RPCs to all YB-TServers asking them to create a consistent snapshot at the hybrid logical time represented by snapshot-timestamp
. These API calls would look as follows:
YB-Master leader --> each of the YB-TServers (per tablet):
SnapshotTabletAtHybridTimestamp(snapshot-timestamp, tablet)
Upon receiving this RPC call, the YB-TServers would resolve all provisional writes (writes that are a part of recent transactions) by consulting the transaction status table. Note that by virtue of waiting out the max clock skew, none of the currently PENDING
transaction can get a commit timestamp lower than snapshot-timestamp
.
Once all the APPLY
records for the provisional writes are replicated across the tablet peers, the tablet leader replicates a SNAPSHOT
record containing the snapshot-timestamp
. Applying this record into the tablet Raft log creates a DocDB checkpoint which includes the following:
- A hardlink based snapshot of all the files (which are immutable) for the tablet. This snapshot is created in a temporary directory.
- The temporary checkpoint above is updated with the necessary, such as the
snapshot-timestamp
above which all updates should be ignored when reading these files. - The temporary checkpoint is finalized and moved into the snapshot folder.
- Once these steps are done, a response is sent to the YB-Master indicating the success of the tablet snapshot.
The above steps ensure that the data does not have to be rewritten, making the operation efficient for larger data sets.
In order to successfully snapshot data at a given hybrid time denoted by snapshot-timestamp
, we need to guarantee that partitioned away nodes do not cleanup necessary information. For example, we would need to retain the hybrid timestamp of the actual updates in the data files so that entries later than snapshot-timestamp
are ignored. The time window in the past for which these commit timestamps are retained is called the flashback-read-cutoff-timestamp
. In order to ensure that these these timestamps are retained, the flashback-read-cutoff-timestamp
is propagated from the tablet leaders to the followers, and use DocDB anchors while taking the snapshot.
- An anchor on a tablet follower is taken the same way as a regular read operation on the tablet leader.
- The anchor should be held for the duration of replication of the snapshot Raft record.
- The tablet leader is the primary authority over assigning
flashback-read-cutoff-timestamp
. No tablet follower should start a compaction with a higherflashback-read-cutoff-timestamp
than the committedflashback-read-cutoff-timestamp
. - The
flashback-read-cutoff-timestamp
value is recorded in the Raft log.
One the YB-Master hears from all the tablets with a success message for their snapshots, it marks the overall snapshot as a success in the system catalog. The snapshot is now ready to use.
Note the following:
- The snapshot represents a distributed, immutable set of files across the various nodes in the cluster.
- The replication factor of the snapshot is the same as that of the table that was snapshotted. This is done to ensure the fault tolerance of the snapshot.