-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the JsonableData
data plugin
#5017
Add the JsonableData
data plugin
#5017
Conversation
aiida/orm/nodes/data/msonable.py
Outdated
|
||
""" | ||
|
||
def __init__(self, obj, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add type checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added type checking, but since we are very flexible, all we can say is typing.Any
pretty much. Or is there some better typing we can have still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the best way is to define a Protocol: https://docs.python.org/3/library/typing.html#typing.Protocol, although you need the typing_extensions backport package for python < 3.8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhubonan it would be great if you could give this a spin and see whether this is useful for your use case. |
Thanks for this. I will have a go! |
I just had a try and it works great for me. It can be very useful for storing the provenance of, say a phase diagram construction performed with pymatgen. At the moment I do this by manually converting AiiDA's records into With this new capability, the pymatgen's entries (which has no provenance at all, so I have to manually add in the UUID....) can be stored and linked to AiiDA's workflows with the rovenance), and each phase diagram will also automatically have the full provenance record! One thing that I realised is that However, I think it is too code specific and not really a standard. I am happy to just having the |
I have found a problem with further testing. Not all MSONable objects can be serialised with standard JSON. For example, some pymatgen objects contain NumPy array and NaN values, the latter cannot be stored in PostgreSQL directly because JSON dost not support NaN. I can see that they have implemented special decoders/encoders for these cases to get around it: Example content with
|
Oh actually it is the python's JSON encoder/decoder that does the work of sorting out NaN and infinity: https://docs.python.org/3/library/json.html#json.JSONDecoder |
Hmm, for numpy arrays, what we could think of doing is after calling
Regarding the NaN values, we indeed don't support this because PostgreSQL doesn't support it in the JSONB fields: https://www.postgresql.org/docs/current/datatype-json.html#JSON-TYPE-MAPPING-TABLE |
https://guide.materialsvirtuallab.org/monty/_modules/monty/json.html#MontyEncoder We should able to add this to a pair of encoding and decoding functions that process the https://github.com/python/cpython/blob/3.9/Lib/json/encoder.py#L204 |
Sorry, I think I have mixed some things up. Just to clarify, the support of NumPy array is not something we need to do on our side. Individual classes should implement If we really want this for any class, we could add this functionality by doing the same round trip thing or calling the So the only thing we need to add here is for processing the nan and inf values. I have raised a PR to the branch of this PR @sphuber. |
So the |
No, what I meant is that in my case they just fail because of things like NaN or -inf being not JSON serializable. The numpy issue is a similar but different problem. Like you said the For example:
Because we don't store the content in the database as string, but rather passing the JSON directly to PostgreSQL, hence we don't go through the encoder/decoder stages, but they are needed for many of the classes to work. As you can see, there is an asymmetry - To have the same level of support, we need to make sure when serializing we also go through the encoder. This can be done by processing objects using the In addition, similar pre-procssing and post-processing steps are needed for the |
Two notes that came to mind whilst discussing with @ramirezfranciscof:
|
I think that the MSONable does try to record the module version (not enforced) and when deserialising it is not checked.
I agree. There is no need to check if an object is actually a subclass of |
Note that the whole idea of this plugin was to make |
I probably got you confused... from the specification of What I meant is that we do not have to make We can remove the step using the |
I am afraid I am still confused 😅
Exactly, but the problem is that some (if not many or all) implementations do not respect this requirement at all.
I agree, but apparently some classes don't do this. The whole reason we went in this direction is because you reported that you found that certain I am fine with adding a more generic class, that is not necessarily linked to |
I was wrong about this - most of the |
@sphuber I have made some suggestions on the changes here: sphuber#13. Here is a ad-hoc test script I use:
which is to pull some data entry from the materials project, create a phase diagram and store it into AiiDA. The |
385b4a4
to
d8f88f2
Compare
Codecov Report
@@ Coverage Diff @@
## develop #5017 +/- ##
===========================================
- Coverage 80.39% 80.25% -0.13%
===========================================
Files 529 516 -13
Lines 36882 36814 -68
===========================================
- Hits 29649 29543 -106
- Misses 7233 7271 +38
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Decoupling from the MSONable
class is sensible as the to_dict
, from_dict
interface is a general one - MSONable
simply includes convenient default implementations. The issue with common but non-standard JSON objects such as -inf
, inf
and nan
are also addressed.
d8f88f2
to
8104943
Compare
I think we have converged on the best design. I have updated the PR to reflect this and have renamed all classes to properly reflect the fact that we support anything that is JSON-serializable. In that sense it is no longer directly tied to @chrisjsewell I also addressed the remarks from your initial PR so this PR is now ready for final review. |
MsonableData
data pluginJsonableData
data plugin
b5a38cb
to
6513a79
Compare
@chrisjsewell typing is updated. Ready for final review |
@chrisjsewell I will be merging this tomorrow. Let me know if you still want to review it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cheers, just a minor nitpick
|
||
""" | ||
try: | ||
return self._obj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, wait when would this raise an AttributeError
? do we not want to always reload the object, in case the attributes have changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeh the attribute error would be for if its loaded I guess, but yeh what if you did:
data = JsonableData(obj)
data.set_attrbiute('x', 1)
data.obj
would the user expect the obj to be updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This is only really a problem though for unstored nodes and I doubt that users will be manually changing the attributes of the serialized object. Since the whole point of this class is to allow simple wrapping of serializable instances, there seems little point to be manually changing the attributes after construction but before storing. If we do want to support this as a use case, then we should indeed always fully reconstruct the obj from the database. This will introduce a significant performance hit since it needs to hit the db and deserialize the entire object. I would argue that this is not worth it, since these nodes will used most often to retrieve the obj
from a stored node, rather than mutating unstored versions of them.
TL;DR: I would keep the current implementation and at most just add a warning in the obj
property docstring that the thing is cached and so messing with attributes manually can break things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh that sounds fine to me 👍
This data plugin is designed to make it easy to wrap instances of arbitrary Python classes and store them as a node in the database. The only requirement is that the class implements the `as_dict` method, which should return a dictionary that represents the instance which is JSON-serializable. The latter means that it can be serialized by the `JsonEncoder` of the `json` built-in module. If the class also implements a `from_dict` method, that can consume the dictionary that is returned by `as_dict`, to reconstruct an instance, the original instance that is wrapped by the node (loaded from the database), can be recovered through the `obj` property.
6513a79
to
9537ef4
Compare
Fixes #4975
This data plugin is designed to make it easy to wrap instances of
arbitrary Python classes and store them as a node in the database. The
only requirement is that the class implements the
as_dict
method,which should return a dictionary that represents the instance which is
JSON-serializable. The latter means that it can be serialized by the
JsonEncoder
of thejson
built-in module.If the class also implements a
from_dict
method, that can consume thedictionary that is returned by
as_dict
, to reconstruct an instance,the original instance that is wrapped by the node (loaded from the
database), can be recovered through the
obj
property.TheMSONable
class, provided by themonty
library, is designed tomake arbitrary classes easily JSON-serializable such that they can be
serialized and, for example, stored in a database. The class is used
extensively in
pymatgen
and many of their classes are MSONable. Sincethe
pymatgen
classes are used often in the current AiiDA community, itwould be nice if these objects can be easily stored in the provenance
graph.
The newMsonableData
data plugin, wraps any instance of aMSONable
class. When constructed, it calls
as_dict
which by definition returnsa JSON-serialized version of the object, which we can therefore store in
the nodes attributes and allow it to be stored. The
obj
property willreturn the original MSONable instance as it deserializes it from the
serialized version stored in the attributes.