You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, when exporting (parts of) the provenance graph, all data that is contained within the nodes is epxorted. However, there are certain situations where one may want to only include the metadata of certain nodes. Metadata here refers to the minimal representation of the node within the graph, i.e, the UUID, pk, owner, ctime, mtime, label, description and its links to other nodes.
Some examples where this may come in handy:
Certain nodes may contain sensitive or proprietary data and cannot be shared.
A concrete example of this is the MC3D project. Some of the starting structures were defined by CIF files provided by the Pauling file. The license allowed the relaxed derivative structures to be published, but not the initial structures. This essentially means that the provenance of the relaxed structure from the initial structure can not be included, because besides the input structure itself, even the calculations contain the structure definition in the input and output files. Manually "cleaning" the nodes before exporting them is not feasible, so the result is that the provenance has to effectively remain private, defeating a large part of the purpose of AiiDA.
Another well known case is that of the aiida-vasp plugin. VASP is a proprietary code and also the pseudopotentials that come with the license cannot be shared. Since pseudopotentials are inputs to the calculations, this would essentially prohibit the calculations from being shared. The aiida-vasp plugin came up with a custom solution to have two Data plugins: the PotcarData and the PotcarFileData. The former contains just enough metadata to identify the pseudopotential and it would be linked to the latter that contains the actual proprietary pseudopotential content. The PotcarData would be used as the input to calculations and would therefore be included in exported provenance graphs, thereby preventing proprietary data from leaking.
Provenance graphs can become large and the amount of data that is attached to it even moreso. This makes sharing provenance graphs costly. For certain use cases, it may be useful to share just the topology of the graph, without all the data attached to it. This may help applications in visualizing the graph without having to fetch all the data. Once the actual data is required, it can then simply be retrieved by using the metadata of the nodes in the graph to identify them
Desired Outcome
It should be possible to export merely the network of (part of) the provenance graph with just the metadata of the nodes. At a minimum, this should be controlled with a boolean switch. In a more advanced case, one could think of a solution that allows to define on a per-node basis, whether data should be included with the metadata.
Impact
There are already multiple known use-cases, as described in the motivation, that either had to develop costly custom solutions, or simply could not make entire parts of the provenance graph public, which is a detriment to the reproducibility of work performed using AiiDA. Providing a generic solution for this in aiida-core would reduce development and maintenance costs for plugin packages, and increase the reproducibility of science by making it easier to share data.
Complexity
A solution that implements the minimal required functionality of providing a switch during export, could perhaps be implemented by simply adding an argument to the function that creates an export archive. This would probably not be complicated to implement, although it has to be seen how the import code would deal with this. If it cannot be imported, the export archive is of little use. However, if the "incomplete" nodes are imported, all of a sudden all other Python API code can no longer bank on certain data being present. The partial nodes should somehow be clearly marked as such, and the API should have safeguards in the implementation to warn when data is being accessed that is not there.
So although, at a first glance, a simple solution may be possible, it is actually more likely that there will be more complex knock-on effects later down the road. Combined with the very limited flexibility of that minimal feature set of an on-off switch, it might be necessary to go for a more thorough redesign.
If we only take nodes for now (to simplify the discussion), and ignore other ORM entities such as Computers, Users, etc., a potential solution might be to create two node tables. The first table (let's call it DbNodeRef) would merely store the node's metadata and its position within the entire provenance graph. The second table (let's say DbNode) would contain the actual data of the node. There would be a many-to-one mapping of rows in DbNodeRef to DbNode. When exporting, a user could select for each node whether to export simply the DbNodeRef (containing just the metadata) or to include the linked DbNode entry (thereby including all the data). Importing of partial graphs would now be easy as it would simply import the DbNodeRef entries and only include and link the DbNode entries for those nodes that had one exported. By building this distinction of the node metadata and data into the database level, it would be easy to detect which nodes are complete and which are merely the metadata reference.
As a beneficial side-effect, it would also solve the problem of data duplication. Currently if you store a node twice, the data will be stored twice as well. With this new concept, there could be two rows in DbNodeRef that each point to a single row in DbNodeRef. Some data duplication is already implemented for the file repository of the default psql_dos storage backend, but this does not go for the database or for other storage backends.
Progress
This idea is currently completely hypothetical and no concrete steps in fleshing out the design or implementing it have been undertaken.
The text was updated successfully, but these errors were encountered:
Motivation
Currently, when exporting (parts of) the provenance graph, all data that is contained within the nodes is epxorted. However, there are certain situations where one may want to only include the metadata of certain nodes. Metadata here refers to the minimal representation of the node within the graph, i.e, the UUID, pk, owner, ctime, mtime, label, description and its links to other nodes.
Some examples where this may come in handy:
Certain nodes may contain sensitive or proprietary data and cannot be shared.
aiida-vasp
plugin. VASP is a proprietary code and also the pseudopotentials that come with the license cannot be shared. Since pseudopotentials are inputs to the calculations, this would essentially prohibit the calculations from being shared. Theaiida-vasp
plugin came up with a custom solution to have twoData
plugins: thePotcarData
and thePotcarFileData
. The former contains just enough metadata to identify the pseudopotential and it would be linked to the latter that contains the actual proprietary pseudopotential content. ThePotcarData
would be used as the input to calculations and would therefore be included in exported provenance graphs, thereby preventing proprietary data from leaking.Provenance graphs can become large and the amount of data that is attached to it even moreso. This makes sharing provenance graphs costly. For certain use cases, it may be useful to share just the topology of the graph, without all the data attached to it. This may help applications in visualizing the graph without having to fetch all the data. Once the actual data is required, it can then simply be retrieved by using the metadata of the nodes in the graph to identify them
Desired Outcome
It should be possible to export merely the network of (part of) the provenance graph with just the metadata of the nodes. At a minimum, this should be controlled with a boolean switch. In a more advanced case, one could think of a solution that allows to define on a per-node basis, whether data should be included with the metadata.
Impact
There are already multiple known use-cases, as described in the motivation, that either had to develop costly custom solutions, or simply could not make entire parts of the provenance graph public, which is a detriment to the reproducibility of work performed using AiiDA. Providing a generic solution for this in
aiida-core
would reduce development and maintenance costs for plugin packages, and increase the reproducibility of science by making it easier to share data.Complexity
A solution that implements the minimal required functionality of providing a switch during export, could perhaps be implemented by simply adding an argument to the function that creates an export archive. This would probably not be complicated to implement, although it has to be seen how the import code would deal with this. If it cannot be imported, the export archive is of little use. However, if the "incomplete" nodes are imported, all of a sudden all other Python API code can no longer bank on certain data being present. The partial nodes should somehow be clearly marked as such, and the API should have safeguards in the implementation to warn when data is being accessed that is not there.
So although, at a first glance, a simple solution may be possible, it is actually more likely that there will be more complex knock-on effects later down the road. Combined with the very limited flexibility of that minimal feature set of an on-off switch, it might be necessary to go for a more thorough redesign.
If we only take nodes for now (to simplify the discussion), and ignore other ORM entities such as
Computers
,Users
, etc., a potential solution might be to create two node tables. The first table (let's call itDbNodeRef
) would merely store the node's metadata and its position within the entire provenance graph. The second table (let's sayDbNode
) would contain the actual data of the node. There would be a many-to-one mapping of rows inDbNodeRef
toDbNode
. When exporting, a user could select for each node whether to export simply theDbNodeRef
(containing just the metadata) or to include the linkedDbNode
entry (thereby including all the data). Importing of partial graphs would now be easy as it would simply import theDbNodeRef
entries and only include and link theDbNode
entries for those nodes that had one exported. By building this distinction of the node metadata and data into the database level, it would be easy to detect which nodes are complete and which are merely the metadata reference.As a beneficial side-effect, it would also solve the problem of data duplication. Currently if you store a node twice, the data will be stored twice as well. With this new concept, there could be two rows in
DbNodeRef
that each point to a single row inDbNodeRef
. Some data duplication is already implemented for the file repository of the defaultpsql_dos
storage backend, but this does not go for the database or for other storage backends.Progress
This idea is currently completely hypothetical and no concrete steps in fleshing out the design or implementing it have been undertaken.
The text was updated successfully, but these errors were encountered: