-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Usability: Provide generic mechanism to serialize and deserialize data from and to AiiDA data #17
Comments
Can you elaborate on why that is a problem for the (de)serialization of the It's not like Data ends up directly storing pymatgen.core.Structure / ase.Atoms instances, so as long as our own Data instances are (de)serializable, I think it's fine to have ways of constructing it from non-serializable data types. |
In practical terms, I think the part below is what has posed difficulty (in particular in the implementation of
Supporting a set of JSON fields per entity is straightforward in most tools (pydantic for REST API / graphene for GraphQL), but downloading and uploading files typically requires custom solutions. I'm not sure dumping the blobs from the file repository into a JSON response is really the way to go here... if we really believe that is the case, does this mean we should get rid of the distinction between database and file repository altogether? I could imagine a JSON serialization format that includes pointers to binary blobs, and just have a generic "get blob" interface. It would be great to spend some thought on this question. |
It is not a problem per sé, as long as it just one of the ways of constructing an instance. The current problem is that it is the only way of doing things. Given the fact that on top of that, there is no way to introspect what and how serialized data is stored (since there is no schema) third-party applications will have to hard-code mappings for data plugins. What we want is a design where, for any entity, the data schema can be discovered dynamically such that data can be serialized and deserialized. This is definitely possible in principle, just not in a backwards-compatible way, for, at the very least, the |
Honestly, the exact specific of the serialization format are not even that important, as long as there is one. I am not necessarily advocating for JSON with integrated blobs, just saying that these are the principal data types, so anything that could support both, could be a potential candidate.
Even if we have a shared serialization format to export data out of AiiDA, that does not mean that the exact same format has to be used in the storage. Of course, if for efficiency reasons one would want that, it is already possible to implement a |
Motivation
This user-story is slightly more technical then it should be, but there is no real other way of phrasing it. Essentially it boils down to the following:
One of AiiDA's main functions for a user is to store large quantities of data. While the Python API provides many tools to interact with and manipulate this data, sooner or later the data will have to leave AiiDA. Conversely, to start working with AiiDA, data will have to be ingested. In other words, data stored in AiiDA will have to be serialized into a certain format when leaving its database, and data has to be deserialized from a certain format when it is ingested.
Currently, there are already two major tools that implement such a (de)serialization:
aiida-core
and theaiida-restapi
package)The REST API uses JSON to (de)serialize data but it implements custom translators to do so. This is the core problem: every tool currently has to implement their own code to (de)serialize data since the Python ORM cannot be used. Moreover, there is no single mechanism to determine the "schema" of a piece of data, so it has to be hardcoded.
Desired Outcome
Ideally, AiiDA would provide a generic mechanism to serialize and deserialize any data that can be stored within its database. This would essentially require each ORM type to define a schema of its data structure that can be requested by a client of the API. This would allow external applications to write utilities to reliably extract data from AiiDA or store data within it. The key here is that it should not be necessary to write custom serializers for plugins, but that they are automatically supported through the general formalism.
Impact
A successful solution will touch many other use-cases, such as already mentioned the REST and web APIs (for example see #16).
Complexity
In principle, all data in AiiDA is stored either as JSON-serializable data (in the PostgreSQL database) or as binary blobs (in the file repository). The simplest approach then would be to have a JSON-extended format that includes support for binary blobs. But the exact serialization format is not the real problem, other solutions could be used. The main question is how to have all data in AiiDA define a schema. We could do this for the ORM entities that are shipped with
aiida-core
but the tricky part is that this should also work for plugins, such asData
subclasses. A solution should be generic and work regardless of any custom plugins that are installed.The real difficulty is that, as it is implemented currently, the interface of the
Data
class, and especially the way they are constructed, allow the use of arbitrary types. For example theStructureData
allows its constructions through thepymatgen.core.Structure
orase.Atoms
types. These are typically not generically serializable. It might be necessary to change the interface ofData
to force it to declare statically its data schema and allow construction of an instance through from serialized data without requiring Python types. Unfortunately, this change would almost certainly require backwards incompatible changes to theData
interface.Background
This issue has already been discussed in
aiida-core
, see this issue where it is being tracked. No concrete advances have been made yet.Progress
So far, no concrete progress has been made in addressing this problem.
The text was updated successfully, but these errors were encountered: