Usability: Allow exporting of (parts of) the provenance graph with just metadata #18

sphuber · 2023-03-31T10:19:12Z

Motivation

Currently, when exporting (parts of) the provenance graph, all data that is contained within the nodes is epxorted. However, there are certain situations where one may want to only include the metadata of certain nodes. Metadata here refers to the minimal representation of the node within the graph, i.e, the UUID, pk, owner, ctime, mtime, label, description and its links to other nodes.

Some examples where this may come in handy:

Certain nodes may contain sensitive or proprietary data and cannot be shared.
- A concrete example of this is the MC3D project. Some of the starting structures were defined by CIF files provided by the Pauling file. The license allowed the relaxed derivative structures to be published, but not the initial structures. This essentially means that the provenance of the relaxed structure from the initial structure can not be included, because besides the input structure itself, even the calculations contain the structure definition in the input and output files. Manually "cleaning" the nodes before exporting them is not feasible, so the result is that the provenance has to effectively remain private, defeating a large part of the purpose of AiiDA.
- Another well known case is that of the aiida-vasp plugin. VASP is a proprietary code and also the pseudopotentials that come with the license cannot be shared. Since pseudopotentials are inputs to the calculations, this would essentially prohibit the calculations from being shared. The aiida-vasp plugin came up with a custom solution to have two Data plugins: the PotcarData and the PotcarFileData. The former contains just enough metadata to identify the pseudopotential and it would be linked to the latter that contains the actual proprietary pseudopotential content. The PotcarData would be used as the input to calculations and would therefore be included in exported provenance graphs, thereby preventing proprietary data from leaking.
Provenance graphs can become large and the amount of data that is attached to it even moreso. This makes sharing provenance graphs costly. For certain use cases, it may be useful to share just the topology of the graph, without all the data attached to it. This may help applications in visualizing the graph without having to fetch all the data. Once the actual data is required, it can then simply be retrieved by using the metadata of the nodes in the graph to identify them

Desired Outcome

It should be possible to export merely the network of (part of) the provenance graph with just the metadata of the nodes. At a minimum, this should be controlled with a boolean switch. In a more advanced case, one could think of a solution that allows to define on a per-node basis, whether data should be included with the metadata.

Impact

There are already multiple known use-cases, as described in the motivation, that either had to develop costly custom solutions, or simply could not make entire parts of the provenance graph public, which is a detriment to the reproducibility of work performed using AiiDA. Providing a generic solution for this in aiida-core would reduce development and maintenance costs for plugin packages, and increase the reproducibility of science by making it easier to share data.

Complexity

A solution that implements the minimal required functionality of providing a switch during export, could perhaps be implemented by simply adding an argument to the function that creates an export archive. This would probably not be complicated to implement, although it has to be seen how the import code would deal with this. If it cannot be imported, the export archive is of little use. However, if the "incomplete" nodes are imported, all of a sudden all other Python API code can no longer bank on certain data being present. The partial nodes should somehow be clearly marked as such, and the API should have safeguards in the implementation to warn when data is being accessed that is not there.

So although, at a first glance, a simple solution may be possible, it is actually more likely that there will be more complex knock-on effects later down the road. Combined with the very limited flexibility of that minimal feature set of an on-off switch, it might be necessary to go for a more thorough redesign.

If we only take nodes for now (to simplify the discussion), and ignore other ORM entities such as Computers, Users, etc., a potential solution might be to create two node tables. The first table (let's call it DbNodeRef) would merely store the node's metadata and its position within the entire provenance graph. The second table (let's say DbNode) would contain the actual data of the node. There would be a many-to-one mapping of rows in DbNodeRef to DbNode. When exporting, a user could select for each node whether to export simply the DbNodeRef (containing just the metadata) or to include the linked DbNode entry (thereby including all the data). Importing of partial graphs would now be easy as it would simply import the DbNodeRef entries and only include and link the DbNode entries for those nodes that had one exported. By building this distinction of the node metadata and data into the database level, it would be easy to detect which nodes are complete and which are merely the metadata reference.

As a beneficial side-effect, it would also solve the problem of data duplication. Currently if you store a node twice, the data will be stored twice as well. With this new concept, there could be two rows in DbNodeRef that each point to a single row in DbNodeRef. Some data duplication is already implemented for the file repository of the default psql_dos storage backend, but this does not go for the database or for other storage backends.

Progress

This idea is currently completely hypothetical and no concrete steps in fleshing out the design or implementing it have been undertaken.

The text was updated successfully, but these errors were encountered:

sphuber added the roadmap/proposed A roadmap item that has been proposed but not yet processed label Mar 31, 2023

sphuber mentioned this issue Mar 31, 2023

Fully migrate old roadmap #6

Open

27 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usability: Allow exporting of (parts of) the provenance graph with just metadata #18

Usability: Allow exporting of (parts of) the provenance graph with just metadata #18

sphuber commented Mar 31, 2023

Usability: Allow exporting of (parts of) the provenance graph with just metadata #18

Usability: Allow exporting of (parts of) the provenance graph with just metadata #18

Comments

sphuber commented Mar 31, 2023

Motivation

Desired Outcome

Impact

Complexity

Progress