Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parsing capabilities of CifData class #1257

Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Mar 10, 2018

Fixes #1256

This PR addresses a couple of points and improvements to the parsing of CifData files
into StructureData objects. It implements two new properties for CifData:

  • has_atomic_sites
  • has_unknown_species

The former will check if there any atomic sites defined at all and the latter will check whether
any species in the formulae are unknown, which is to say, they are not elements listed in the
aiida.common.constants.elements dictionary. In both cases, it will be impossible to create
a StructureData node out of those CifData objects and so one does not even have to try
to parse them with either ase or pymatgen.

Additionally, we improve the automatic structure converter _get_aiida_structure_pymatgen_inline
that uses pymatgen to parse the cif content to return a pymatgen structure object, which is
then used to create a StructureData node. There are two improvements here:

  • The CifParser of pymatgen allows to set the parameter site_tolerance which will be used
    to determine whether two sites overlap and if created by symmetry they should be merged as one.
    By default this value is 1E-4 in v4.5.3, which is rather tight and often leads to a created structure
    with overlapping atoms. To prevent this one should be able to specify this parameter, however,
    unfortunately it is not supported by the other library ase that AiiDA supports internally for structure
    generation from CifData nodes.

  • The CifParser of pymatgen will check the atomic occupations of the parsed cif and if they
    exceed unity, a warning will be emitted and the parsing of the structure fails. It is possible to give
    a certain tolerance to allow invalid occupations through the occupation_tolerance of the CifParser
    constructor. Any occupations between one and this tolerance will be rounded to one. Pymatgen does
    not distinguish between various parsing errors, any failure just raises a ValueError. However, we
    would like to distinguish between well defined errors such as the incorrect occupations case.
    As a solution, if the parsing fails, we attempt a second time, this time setting the occupation tolerance
    to an unrealistic high value. If the parsing now succeeds, the original failure was due to the occupations
    being exceeded and we raise a specific error InvalidOccupationsError. If the second parse attempt
    also failed, there was some unknown parsing error and we simply raise a ValueError.

@sphuber sphuber requested a review from nmounet March 10, 2018 09:58
@sphuber sphuber force-pushed the fix_1256_cif_structure_site_tolerance branch 2 times, most recently from 2707090 to 4ee926b Compare March 12, 2018 16:19
@sphuber sphuber changed the title Add support for site_tolerance in CifData._get_aiida_structure Improve parsing capabilities of CifData class Mar 12, 2018
sphuber added 3 commits March 12, 2018 17:28
The CifParser of pymatgen allows to set the parameter site_tolerance
which will be used to determine whether two sites overlap and if
created by symmetry they should be merged as one. By default this
value is 1E-4 in v4.5.3, which is rather tight and often leads to
a created structure with overlapping atoms. To prevent this one
should be able to specify this parameter, however, unfortunately
it is not supported by the other library 'ase' that AiiDA supports
internally for structure generation from CifData nodes.
The CifParser of pymatgen will check the atomic occupations of the
parsed cif and if they exceed unity, a warning will be emitted and
the parsing of the structure fails. It is possible to give a certain
tolerance to allow invalid occupations through the occupation_tolerance
of the CifParser constructor. Any occupations between one and this
tolerance will be rounded to one.

Pymatgen does not distinguish between various parsing errors, any
failure just raises a ValueError. However, we would like to distinguish
between well defined errors such as the incorrect occupations case.
As a solution, if the parsing fails, we attempt a second time, this time
setting the occupation tolerance to an unrealistic high value. If the
parsing now succeeds, the original failure was due to the occupations
being exceeded and we raise a specific error InvalidOccupationsError.
If the second parse attempt also failed, there was some unknown parsing
error and we simply raise a ValueError.
Cif files, for example from COD, can not define any atomic sites or
have atomic species that are not elemental, such as Deuterium, Tritium
or even whole water molecules. Cif files without atomic sites or with
these non elemental species cannot be converted to AiiDA StructureData
objects. These new CifData properties make it easy to check whether
the cif file has these issues, before we attempt to parse a structure
from it using either ASE or pymatgen.
@sphuber sphuber force-pushed the fix_1256_cif_structure_site_tolerance branch from 4ee926b to 098d500 Compare March 12, 2018 16:28
@@ -93,7 +103,7 @@ def symop_string_from_symop_matrix_tr(matrix, tr=(0, 0, 0), eps=0):


@optional_inline
def _get_aiida_structure_ase_inline(cif, parameters):
def _get_aiida_structure_ase_inline(cif, parameters=None):
Copy link

@nmounet nmounet Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaults with non database objects such as None (here, parameters=None) should be avoided in optional_inline calculations...

return {'structure': StructureData(ase=cif.get_ase(**kwargs))}


@optional_inline
def _get_aiida_structure_pymatgen_inline(cif=None, parameters=None):
def _get_aiida_structure_pymatgen_inline(cif, parameters=None):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above (on parameters=None)

raise ValueError('pymatgen failed to provide a structure from the cif file')
else:
# If it now succeeds, non-unity occupancies were the culprit
raise InvalidOccupationsError('detected atomic sites with an occupation number exceeding the tolerance')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a more explicit error message like
"detected atomic sites with a total occupation number higher than 1"

"""
from aiida.common.constants import elements

known_species = [element['symbol'] for index, element in elements.items()]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

known_species = [element['symbol'] for element in elements.values()]

I know, I'm picky today...

known_species = [element['symbol'] for index, element in elements.items()]

for formula in self.get_formulae():
species = parse_formula(formula).keys()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that
parse_formula('D1 H Li3 Te2.5 H3')
(H is repeated)
returns
{'D': 1, 'H': 3, 'Li': 3, 'Te': 2.5}
which is ok here (we don't care about how many H atoms there are), but maybe not in general.

sphuber added 2 commits March 12, 2018 18:56
These converter functions are optional inlines, which means that
if they are called with `store=True`, AiiDA will try to store the
input arguments, but if the `parameters` were not specified, it
will try to store the `None` default, which of course fails.
The solution is to use `**kwargs`. With this a user is not required
to specify an empty dictionary or `ParameterData` node, and the
function can be used both as inline and as regular function.
Tests have shown that `pymatgen` is a lot more robust in parsing cif
files and complains about files that have nonsensical data such as
atomic occupations that exceed unity. We therefore swap the default
converter engine from `ase` to `pymatgen`.
Copy link

@nmounet nmounet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful.

@nmounet nmounet merged commit b8b8c61 into aiidateam:workflows Mar 12, 2018
@sphuber sphuber deleted the fix_1256_cif_structure_site_tolerance branch March 12, 2018 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants