Improve parsing capabilities of CifData class #1257

sphuber · 2018-03-10T09:58:57Z

This PR addresses a couple of points and improvements to the parsing of CifData files
into StructureData objects. It implements two new properties for CifData:

has_atomic_sites
has_unknown_species

The former will check if there any atomic sites defined at all and the latter will check whether
any species in the formulae are unknown, which is to say, they are not elements listed in the
aiida.common.constants.elements dictionary. In both cases, it will be impossible to create
a StructureData node out of those CifData objects and so one does not even have to try
to parse them with either ase or pymatgen.

Additionally, we improve the automatic structure converter _get_aiida_structure_pymatgen_inline
that uses pymatgen to parse the cif content to return a pymatgen structure object, which is
then used to create a StructureData node. There are two improvements here:

The CifParser of pymatgen allows to set the parameter site_tolerance which will be used
to determine whether two sites overlap and if created by symmetry they should be merged as one.
By default this value is 1E-4 in v4.5.3, which is rather tight and often leads to a created structure
with overlapping atoms. To prevent this one should be able to specify this parameter, however,
unfortunately it is not supported by the other library ase that AiiDA supports internally for structure
generation from CifData nodes.
The CifParser of pymatgen will check the atomic occupations of the parsed cif and if they
exceed unity, a warning will be emitted and the parsing of the structure fails. It is possible to give
a certain tolerance to allow invalid occupations through the occupation_tolerance of the CifParser
constructor. Any occupations between one and this tolerance will be rounded to one. Pymatgen does
not distinguish between various parsing errors, any failure just raises a ValueError. However, we
would like to distinguish between well defined errors such as the incorrect occupations case.
As a solution, if the parsing fails, we attempt a second time, this time setting the occupation tolerance
to an unrealistic high value. If the parsing now succeeds, the original failure was due to the occupations
being exceeded and we raise a specific error InvalidOccupationsError. If the second parse attempt
also failed, there was some unknown parsing error and we simply raise a ValueError.

The CifParser of pymatgen allows to set the parameter site_tolerance which will be used to determine whether two sites overlap and if created by symmetry they should be merged as one. By default this value is 1E-4 in v4.5.3, which is rather tight and often leads to a created structure with overlapping atoms. To prevent this one should be able to specify this parameter, however, unfortunately it is not supported by the other library 'ase' that AiiDA supports internally for structure generation from CifData nodes.

The CifParser of pymatgen will check the atomic occupations of the parsed cif and if they exceed unity, a warning will be emitted and the parsing of the structure fails. It is possible to give a certain tolerance to allow invalid occupations through the occupation_tolerance of the CifParser constructor. Any occupations between one and this tolerance will be rounded to one. Pymatgen does not distinguish between various parsing errors, any failure just raises a ValueError. However, we would like to distinguish between well defined errors such as the incorrect occupations case. As a solution, if the parsing fails, we attempt a second time, this time setting the occupation tolerance to an unrealistic high value. If the parsing now succeeds, the original failure was due to the occupations being exceeded and we raise a specific error InvalidOccupationsError. If the second parse attempt also failed, there was some unknown parsing error and we simply raise a ValueError.

Cif files, for example from COD, can not define any atomic sites or have atomic species that are not elemental, such as Deuterium, Tritium or even whole water molecules. Cif files without atomic sites or with these non elemental species cannot be converted to AiiDA StructureData objects. These new CifData properties make it easy to check whether the cif file has these issues, before we attempt to parse a structure from it using either ASE or pymatgen.

nmounet · 2018-03-12T16:39:01Z

aiida/orm/data/cif.py

@@ -93,7 +103,7 @@ def symop_string_from_symop_matrix_tr(matrix, tr=(0, 0, 0), eps=0):


 @optional_inline
-def _get_aiida_structure_ase_inline(cif, parameters):
+def _get_aiida_structure_ase_inline(cif, parameters=None):


defaults with non database objects such as None (here, parameters=None) should be avoided in optional_inline calculations...

nmounet · 2018-03-12T16:41:41Z

aiida/orm/data/cif.py

    return {'structure': StructureData(ase=cif.get_ase(**kwargs))}


 @optional_inline
-def _get_aiida_structure_pymatgen_inline(cif=None, parameters=None):
+def _get_aiida_structure_pymatgen_inline(cif, parameters=None):


same comment as above (on parameters=None)

nmounet · 2018-03-12T16:51:42Z

aiida/orm/data/cif.py

+            raise ValueError('pymatgen failed to provide a structure from the cif file')
+        else:
+            # If it now succeeds, non-unity occupancies were the culprit
+            raise InvalidOccupationsError('detected atomic sites with an occupation number exceeding the tolerance')


maybe a more explicit error message like
"detected atomic sites with a total occupation number higher than 1"

nmounet · 2018-03-12T17:05:31Z

aiida/orm/data/cif.py

+        """
+        from aiida.common.constants import elements
+
+        known_species = [element['symbol'] for index, element in elements.items()]


known_species = [element['symbol'] for element in elements.values()]

I know, I'm picky today...

nmounet · 2018-03-12T17:10:05Z

aiida/orm/data/cif.py

+        known_species = [element['symbol'] for index, element in elements.items()]
+
+        for formula in self.get_formulae():
+            species = parse_formula(formula).keys()


note that
parse_formula('D1 H Li3 Te2.5 H3')
(H is repeated)
returns
{'D': 1, 'H': 3, 'Li': 3, 'Te': 2.5}
which is ok here (we don't care about how many H atoms there are), but maybe not in general.

These converter functions are optional inlines, which means that if they are called with `store=True`, AiiDA will try to store the input arguments, but if the `parameters` were not specified, it will try to store the `None` default, which of course fails. The solution is to use `**kwargs`. With this a user is not required to specify an empty dictionary or `ParameterData` node, and the function can be used both as inline and as regular function.

Tests have shown that `pymatgen` is a lot more robust in parsing cif files and complains about files that have nonsensical data such as atomic occupations that exceed unity. We therefore swap the default converter engine from `ase` to `pymatgen`.

nmounet

Beautiful.

sphuber requested a review from nmounet March 10, 2018 09:58

sphuber force-pushed the fix_1256_cif_structure_site_tolerance branch 2 times, most recently from 2707090 to 4ee926b Compare March 12, 2018 16:19

sphuber changed the title ~~Add support for site_tolerance in CifData._get_aiida_structure~~ Improve parsing capabilities of CifData class Mar 12, 2018

sphuber added 3 commits March 12, 2018 17:28

sphuber force-pushed the fix_1256_cif_structure_site_tolerance branch from 4ee926b to 098d500 Compare March 12, 2018 16:28

nmounet reviewed Mar 12, 2018

View reviewed changes

sphuber added 2 commits March 12, 2018 18:56

nmounet approved these changes Mar 12, 2018

View reviewed changes

nmounet merged commit b8b8c61 into aiidateam:workflows Mar 12, 2018

sphuber deleted the fix_1256_cif_structure_site_tolerance branch March 12, 2018 18:49

sphuber mentioned this pull request Mar 13, 2018

Add support for site_tolerance in CifData pymatgen converter #1256

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve parsing capabilities of CifData class #1257

Improve parsing capabilities of CifData class #1257

sphuber commented Mar 10, 2018 •

edited

Loading

nmounet Mar 12, 2018 •

edited

Loading

nmounet Mar 12, 2018

nmounet Mar 12, 2018

nmounet Mar 12, 2018

nmounet Mar 12, 2018

nmounet left a comment

Improve parsing capabilities of CifData class #1257

Improve parsing capabilities of CifData class #1257

Conversation

sphuber commented Mar 10, 2018 • edited Loading

nmounet Mar 12, 2018 • edited Loading

Choose a reason for hiding this comment

nmounet Mar 12, 2018

Choose a reason for hiding this comment

nmounet Mar 12, 2018

Choose a reason for hiding this comment

nmounet Mar 12, 2018

Choose a reason for hiding this comment

nmounet Mar 12, 2018

Choose a reason for hiding this comment

nmounet left a comment

Choose a reason for hiding this comment

sphuber commented Mar 10, 2018 •

edited

Loading

nmounet Mar 12, 2018 •

edited

Loading