-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve parsing capabilities of CifData class #1257
Improve parsing capabilities of CifData class #1257
Conversation
2707090
to
4ee926b
Compare
The CifParser of pymatgen allows to set the parameter site_tolerance which will be used to determine whether two sites overlap and if created by symmetry they should be merged as one. By default this value is 1E-4 in v4.5.3, which is rather tight and often leads to a created structure with overlapping atoms. To prevent this one should be able to specify this parameter, however, unfortunately it is not supported by the other library 'ase' that AiiDA supports internally for structure generation from CifData nodes.
The CifParser of pymatgen will check the atomic occupations of the parsed cif and if they exceed unity, a warning will be emitted and the parsing of the structure fails. It is possible to give a certain tolerance to allow invalid occupations through the occupation_tolerance of the CifParser constructor. Any occupations between one and this tolerance will be rounded to one. Pymatgen does not distinguish between various parsing errors, any failure just raises a ValueError. However, we would like to distinguish between well defined errors such as the incorrect occupations case. As a solution, if the parsing fails, we attempt a second time, this time setting the occupation tolerance to an unrealistic high value. If the parsing now succeeds, the original failure was due to the occupations being exceeded and we raise a specific error InvalidOccupationsError. If the second parse attempt also failed, there was some unknown parsing error and we simply raise a ValueError.
Cif files, for example from COD, can not define any atomic sites or have atomic species that are not elemental, such as Deuterium, Tritium or even whole water molecules. Cif files without atomic sites or with these non elemental species cannot be converted to AiiDA StructureData objects. These new CifData properties make it easy to check whether the cif file has these issues, before we attempt to parse a structure from it using either ASE or pymatgen.
4ee926b
to
098d500
Compare
aiida/orm/data/cif.py
Outdated
@@ -93,7 +103,7 @@ def symop_string_from_symop_matrix_tr(matrix, tr=(0, 0, 0), eps=0): | |||
|
|||
|
|||
@optional_inline | |||
def _get_aiida_structure_ase_inline(cif, parameters): | |||
def _get_aiida_structure_ase_inline(cif, parameters=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defaults with non database objects such as None (here, parameters=None) should be avoided in optional_inline calculations...
aiida/orm/data/cif.py
Outdated
return {'structure': StructureData(ase=cif.get_ase(**kwargs))} | ||
|
||
|
||
@optional_inline | ||
def _get_aiida_structure_pymatgen_inline(cif=None, parameters=None): | ||
def _get_aiida_structure_pymatgen_inline(cif, parameters=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above (on parameters=None)
aiida/orm/data/cif.py
Outdated
raise ValueError('pymatgen failed to provide a structure from the cif file') | ||
else: | ||
# If it now succeeds, non-unity occupancies were the culprit | ||
raise InvalidOccupationsError('detected atomic sites with an occupation number exceeding the tolerance') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a more explicit error message like
"detected atomic sites with a total occupation number higher than 1"
aiida/orm/data/cif.py
Outdated
""" | ||
from aiida.common.constants import elements | ||
|
||
known_species = [element['symbol'] for index, element in elements.items()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
known_species = [element['symbol'] for element in elements.values()]
I know, I'm picky today...
aiida/orm/data/cif.py
Outdated
known_species = [element['symbol'] for index, element in elements.items()] | ||
|
||
for formula in self.get_formulae(): | ||
species = parse_formula(formula).keys() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that
parse_formula('D1 H Li3 Te2.5 H3')
(H is repeated)
returns
{'D': 1, 'H': 3, 'Li': 3, 'Te': 2.5}
which is ok here (we don't care about how many H atoms there are), but maybe not in general.
These converter functions are optional inlines, which means that if they are called with `store=True`, AiiDA will try to store the input arguments, but if the `parameters` were not specified, it will try to store the `None` default, which of course fails. The solution is to use `**kwargs`. With this a user is not required to specify an empty dictionary or `ParameterData` node, and the function can be used both as inline and as regular function.
Tests have shown that `pymatgen` is a lot more robust in parsing cif files and complains about files that have nonsensical data such as atomic occupations that exceed unity. We therefore swap the default converter engine from `ase` to `pymatgen`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful.
Fixes #1256
This PR addresses a couple of points and improvements to the parsing of
CifData
filesinto
StructureData
objects. It implements two new properties forCifData
:has_atomic_sites
has_unknown_species
The former will check if there any atomic sites defined at all and the latter will check whether
any species in the formulae are unknown, which is to say, they are not elements listed in the
aiida.common.constants.elements
dictionary. In both cases, it will be impossible to createa
StructureData
node out of thoseCifData
objects and so one does not even have to tryto parse them with either
ase
orpymatgen
.Additionally, we improve the automatic structure converter
_get_aiida_structure_pymatgen_inline
that uses
pymatgen
to parse the cif content to return apymatgen
structure object, which isthen used to create a
StructureData
node. There are two improvements here:The
CifParser
ofpymatgen
allows to set the parametersite_tolerance
which will be usedto determine whether two sites overlap and if created by symmetry they should be merged as one.
By default this value is
1E-4
inv4.5.3
, which is rather tight and often leads to a created structurewith overlapping atoms. To prevent this one should be able to specify this parameter, however,
unfortunately it is not supported by the other library
ase
that AiiDA supports internally for structuregeneration from
CifData
nodes.The
CifParser
ofpymatgen
will check the atomic occupations of the parsed cif and if theyexceed unity, a warning will be emitted and the parsing of the structure fails. It is possible to give
a certain tolerance to allow invalid occupations through the
occupation_tolerance
of theCifParser
constructor. Any occupations between one and this tolerance will be rounded to one.
Pymatgen
doesnot distinguish between various parsing errors, any failure just raises a
ValueError
. However, wewould like to distinguish between well defined errors such as the incorrect occupations case.
As a solution, if the parsing fails, we attempt a second time, this time setting the occupation tolerance
to an unrealistic high value. If the parsing now succeeds, the original failure was due to the occupations
being exceeded and we raise a specific error
InvalidOccupationsError
. If the second parse attemptalso failed, there was some unknown parsing error and we simply raise a
ValueError
.