Skip to content

Provide a backup and restore utility for DataJoint pipelines #864

@guzman-raphael

Description

@guzman-raphael

Feature Request

Problem

Currently, users (along with admins) do not have a simple, intuitive means to perform restricted backups and restore operations. Workarounds typically place a large burden on user to parse the pipeline or involve server-side support. A possible solution could be to define methods such as:

dj.backup(backup_root_path, table)

This would define a working directory for the backup and a table as the 'anchor' for the backup. The table may have a restriction condition to restricts the records in table and its descendants. With these records, the method would determine all the child and parent dependencies (along with any forks resulting from Master-Part relationships). Once all records in lineage are associated with table, they would be read and compressed into an appropriate file format e.g. HDF5, NPZ, Parquet, etc. Additionally, a restore.py script could be written that specifies the DataJoint table classes with a last step to decompress and ingest the resulting backup.

dj.restore(backup_root_path, database_prefix, connection=None)

This would define a working directory and a namespace (i.e. database_prefix) under which to 'load' all of the backup data into. Specifying connection would set the target server location but default to dj.conn() if set to None.

These 2 routines also provide the mechanisms for exporting/publishing data from any given DataJoint pipeline.

Requirements

  • Create a compressed representation of a DataJoint pipeline that can be restricted to a particular subset in origin
  • The saved data must be self-describing and accessible by standard tools
  • Load data into a target database server under a specific schema prefix
  • Loading must work if the data is already partially loaded, allowing for simple synchronization of new data.
  • Maintain comparable (or better) performance to 70% of mysqldump's runtime.

Justification

  • Exposes functionality to typical user looking to 'copy' a pipeline as a local workable version.
  • Allows DataJoint to provide admin level functionality to provide a means to automate backups, define disaster relief processes, etc.
  • Provides an additional method for sharing the data outside the data pipeline.

Alternative Considerations

Current workaround for this involves manual routines by user or server side support via (mysqldump, volume-based backups). Both present significant challenges for typical user.

Additional Research and Context

  • Reference for NPZ files.
  • Reference for Parquet files.

Metadata

Metadata

Labels

enhancementIndicates new improvements

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions