-
Notifications
You must be signed in to change notification settings - Fork 90
Description
Feature Request
Problem
Currently, users (along with admins) do not have a simple, intuitive means to perform restricted backups and restore operations. Workarounds typically place a large burden on user to parse the pipeline or involve server-side support. A possible solution could be to define methods such as:
dj.backup(backup_root_path, table)
This would define a working directory for the backup and a table
as the 'anchor' for the backup. The table may have a restriction condition to restricts the records in table
and its descendants. With these records, the method would determine all the child and parent dependencies (along with any forks resulting from Master-Part relationships). Once all records in lineage are associated with table
, they would be read and compressed into an appropriate file format e.g. HDF5, NPZ, Parquet, etc. Additionally, a restore.py
script could be written that specifies the DataJoint table classes with a last step to decompress and ingest the resulting backup.
dj.restore(backup_root_path, database_prefix, connection=None)
This would define a working directory and a namespace (i.e. database_prefix
) under which to 'load' all of the backup data into. Specifying connection
would set the target server location but default to dj.conn()
if set to None
.
These 2 routines also provide the mechanisms for exporting/publishing data from any given DataJoint pipeline.
Requirements
- Create a compressed representation of a DataJoint pipeline that can be restricted to a particular subset in origin
- The saved data must be self-describing and accessible by standard tools
- Load data into a target database server under a specific schema prefix
- Loading must work if the data is already partially loaded, allowing for simple synchronization of new data.
- Maintain comparable (or better) performance to 70% of
mysqldump
's runtime.
Justification
- Exposes functionality to typical user looking to 'copy' a pipeline as a local workable version.
- Allows DataJoint to provide admin level functionality to provide a means to automate backups, define disaster relief processes, etc.
- Provides an additional method for sharing the data outside the data pipeline.
Alternative Considerations
Current workaround for this involves manual routines by user or server side support via (mysqldump
, volume-based backups). Both present significant challenges for typical user.