Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chemical identity comparison for Molecules and Species #1329

Merged
merged 20 commits into from
May 22, 2019

Conversation

mliu49
Copy link
Contributor

@mliu49 mliu49 commented Mar 22, 2018

Improvements:

  • Molecules can be instantiated with SMILES or InChI
  • New read-only properties for Molecules and Species (fingerprint, InChI, SMILES, multiplicity)
  • New is_same method checks "chemical identity" via InChI and multiplicity (better name suggestions welcome)
  • Use is_same in checkForExistingSpecies() instead of isomorphism. Testing has shown that this is much faster and also more accurate.

April 2019 Revision

I have finally refactored and updated this branch. There are a number of changes from the prior version, but I decided not to make a new PR.

I decided to abandon the idea of creating a new is_same method, since it seemed to add unnecessary complexity, especially since people are already familiar with using isIsomorphic and isIdentical.

Major changes

  • Introduce strict argument to almost all isomorphism related methods. The default is strict=True which is the same as current behavior. Specifying strict=False ignores electrons when comparing atoms and bonds, meaning bond orders, radicals, and lone pairs are not considered at all. The result is a resonance structure independent comparison.
  • Use strict=False comparisons for:
    • Checking if species exist
    • Reaction degeneracy determination
    • Product checking during reaction generation
  • This allows us to do these tasks without extra resonance structure generation. While isomorphism with strict=False is slower than strict isomorphism (due to more potential matches), it is more than compensated for by eliminating resonance structure generation.
  • This also eliminates the need for keeping non-representative resonance structures in the molecule list for Species. Most of that code was removed from checkForExistingSpecies, though non-reactive Molecules can still be created.

Quality of life changes

Bugfixes

  • In ensure_independent_atom_ids, make sure we select a reactive molecule when generating resonance structures
  • In ensure_independent_atom_ids, pass keep_isomorphic=True to ensure_species for accurate degeneracy determination

@mliu49 mliu49 added the Status: Ready for Review PR is complete and ready to be reviewed label Mar 22, 2018
@codecov
Copy link

codecov bot commented Mar 22, 2018

Codecov Report

Merging #1329 into master will decrease coverage by 0.12%.
The diff coverage is 19.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1329      +/-   ##
==========================================
- Coverage   41.67%   41.54%   -0.13%     
==========================================
  Files         176      176              
  Lines       29102    29103       +1     
  Branches     5988     5975      -13     
==========================================
- Hits        12127    12090      -37     
- Misses      16138    16181      +43     
+ Partials      837      832       -5
Impacted Files Coverage Δ
rmgpy/molecule/translator.py 0% <ø> (ø) ⬆️
rmgpy/molecule/molecule.py 0% <0%> (ø) ⬆️
rmgpy/reaction.py 0% <0%> (ø) ⬆️
rmgpy/molecule/group.py 0% <0%> (ø) ⬆️
rmgpy/species.py 0% <0%> (ø) ⬆️
rmgpy/data/kinetics/common.py 70.08% <100%> (ø) ⬆️
rmgpy/rmg/model.py 38.88% <100%> (-1.23%) ⬇️
rmgpy/data/kinetics/database.py 46.65% <33.33%> (ø) ⬆️
rmgpy/tools/isotopes.py 50.1% <66.66%> (ø) ⬆️
rmgpy/data/kinetics/family.py 52.66% <71.42%> (-0.32%) ⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9a07f43...b3ff5c5. Read the comment docs.

@mliu49 mliu49 force-pushed the species_identity branch from 0e1c3c4 to c3e65d9 Compare April 10, 2018 18:54
@mliu49 mliu49 force-pushed the species_identity branch from c3e65d9 to d124e5c Compare May 9, 2018 20:14
@mliu49 mliu49 force-pushed the species_identity branch from d124e5c to 8cef586 Compare May 29, 2018 22:54
@mliu49 mliu49 force-pushed the species_identity branch 2 times, most recently from c46bd06 to f9d2f17 Compare June 19, 2018 02:44
Copy link
Contributor

@mjohnson541 mjohnson541 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this look pretty good except for the naming issues we already talked about, I just have a few questions and comments.

@@ -136,6 +136,10 @@ cdef class Molecule(Graph):
cdef str _inchi
cdef str _smiles

cpdef bint is_equal(self, Molecule other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this function come from?

if spec is not None and spec.is_same(molecule):
self.speciesCache.pop(i)
self.speciesCache.insert(0, spec)
# Check whether the resonance form is already included in the species
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't all of the resonance forms we care about for the species already be generated? Can we avoid the isomorphism check at the end?

@@ -412,24 +412,30 @@ def matchesSpecies(self, reactants, products=None):
else:
return False

def isIsomorphic(self, other, eitherDirection=True, checkIdentical = False,
checkOnlyLabel = False, checkTemplateRxnProducts=False):
def isIsomorphic(self, other, eitherDirection=True, level=1, checkTemplateRxnProducts=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really should rename this to something more general...like is_same, but I'm not sure how much of a pain that would be.

r1 = self.makeReaction('A=B')
self.assertTrue(r1.isIsomorphic(self.makeReaction('a=B')))
self.assertTrue(r1.isIsomorphic(self.makeReaction('b=A')))
self.assertFalse(r1.isIsomorphic(self.makeReaction('B=a'),eitherDirection=False))
self.assertFalse(r1.isIsomorphic(self.makeReaction('B=a'), eitherDirection=False))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These whitespace changes should be in a separate commit

@@ -394,18 +394,20 @@ def matchesSpecies(self, reactants, products=None):
Compares the provided reactants and products against the reactants
and products of this reaction. Both directions are checked.

Uses level 2 of `isomorphis_species_lists`, which compares InChI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

del mol_elementcount[key]
# If we didn't find anything, that would mean the result is wrong
conditions.append(found)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the check below (inchi_elementcount == mol_elementcount) works because util.retrieveElementCount(identifier) doesn't account for isotopes and the above loop moves isotopes to there corresponding element?

@@ -1841,17 +1841,13 @@ def generate_products_and_reactions(order):

products0 = reaction.products[:] if forward else reaction.reactants[:]

# For aromatics, generate aromatic resonance structures to accurately identify isomorphic species
if prod_resonance:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can we get rid of this stuff?

@mliu49 mliu49 added Status: Under Revision Currently under revision following reviewer comments and removed Status: Ready for Review PR is complete and ready to be reviewed labels Jun 28, 2018
@mliu49
Copy link
Contributor Author

mliu49 commented Aug 22, 2018

This PR is out of date now. The commits related to InChI generation for isotopes has been merged via #1453. The species comparison changes will be refactored and a new PR will be opened once that's ready.

@mliu49 mliu49 force-pushed the species_identity branch from f9d2f17 to 694c60e Compare April 4, 2019 19:53
@mliu49 mliu49 dismissed mjohnson541’s stale review April 4, 2019 19:55

Major changes have been made to this PR

@mliu49 mliu49 force-pushed the species_identity branch from 694c60e to fc815fc Compare April 4, 2019 22:33
@mliu49 mliu49 added Complexity: Medium Module: Molecule Status: Ready for Review PR is complete and ready to be reviewed Topic: Performance and removed Status: Under Revision Currently under revision following reviewer comments labels Apr 4, 2019
@mliu49
Copy link
Contributor Author

mliu49 commented Apr 4, 2019

Please see the updated initial comment for summary and explanation of the new changes.

Performance comparison with minimal example:

  • master: 430 seconds
  • species_identity: 370 seconds

This is a 14% reduction in runtime. I expect more significant differences for models with larger species and/or more resonance structures.

The model core is identical on these two branches. The edge on this branch has 7 fewer species, most likely due to species being identified as duplicates using the non-strict comparison. I think this is a positive change though, since this is a more robust method of identifying whether species are truly the same (as compared to generating resonance structures).

RMG-tests results:

test time benchmark time new memory benchmark memory new
aromatics 00:00:01:52 00:00:01:26 599.58 579.04
nitrogen 00:00:02:03 00:00:01:52 684.27 667.16
sulfur 00:00:01:41 00:00:01:24 675.57 657.08
oxidation 00:00:02:43 00:00:02:18 559.37 549.97
liquid oxidation 00:00:03:58 00:00:03:34 715.51 692.46
eg1 00:00:01:33 00:00:01:21 684.74 659.29

@mliu49 mliu49 force-pushed the species_identity branch 2 times, most recently from 01aaf8f to e0ce75d Compare April 5, 2019 21:35
@mliu49 mliu49 force-pushed the species_identity branch from e0ce75d to 719f66b Compare April 26, 2019 19:47
@mliu49 mliu49 mentioned this pull request Apr 29, 2019
5 tasks
@mliu49 mliu49 force-pushed the species_identity branch from 719f66b to f549004 Compare May 7, 2019 18:11
mliu49 added 3 commits May 17, 2019 18:17
More pythonic and prevents extra getter/setter functions from
cluttering the namespace.
Allow instantiation using either SMILES or InChI
Change SMILES and InChI to all lowercase for easier typing
Make smiles and inchi read only properties that generate and
save the respective identifiers to private attributes
mliu49 added 17 commits May 17, 2019 18:17
For strict=False, ignores electrons (i.e. bond order, radicals, lone pairs, etc.)
Only implemented for Molecule isomorphism, exceptions raised in other cases
For atoms, strict argument is added to the `equivalent` method
For bonds, strict argument is handled within vf2
Because its functionality is replaced by the strict argument
Also add strict argument which is passed to isIsomorphic
Since this option allows us to compare resonance structures directly,
we don't need to worry about nonreactive resonance structures or
aromatic resonance structures.

Also, the return value of checkForExistingSpecies is reduced to
a single value, since returning whether or not the species
was found is redundant with returning the species itself.
No need to ensure species - the reactions will always contain
Molecules at this point, so either Molecule or Species objects
are acceptable for the product list to compare with

Aromatic resonance structure generation can be omitted by using
the strict=False option, which neglects electrons during the
isomorphism comparison
Remove checks that the product list is converted in place to
Species objects, since that step was removed
Similar to strict option for isomorphism, setting strict=False
ignores electrons and bond orders for the comparison, which
is useful for comparing different resonance structures
Of Molecule, Species, and Reaction, which is passed to Graph.isMappingValid
Instead, use the strict=False option to ignore resonance issues

Change ensure_species to not generate resonance structures by default

Add call to ensure_independent_atom_ids in addReverseAttribute
because the changes to resonance structure generation means that
products may not necessarily have proper atom IDs
In certain situations, we would call ensure_species with
keep_isomorphic=False, which would ultimately result in wrong
degeneracy values.

This fixes that oversight and makes sure that keep_isomorphic=True
when generating reactions.

A unit test is also added.
@mliu49 mliu49 force-pushed the species_identity branch from f549004 to b3ff5c5 Compare May 17, 2019 22:17
Copy link
Contributor

@mjohnson541 mjohnson541 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, I just have a few noted questions.

"""InChI string representation of this species. Read-only."""
if self._inchi is None:
if self.molecule:
self._inchi = self.molecule[0].InChI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is self.molecule[0].InChI always not None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Molecule.InChI is a property which will set the InChI if it's not already there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@@ -1321,28 +1326,33 @@ def get_mean_sigma_and_epsilon(self, reverse=False):
mean_epsilons = reduce((lambda x, y: x * y), epsilons) ** (1 / len(epsilons))
return mean_sigmas, mean_epsilons

def isomorphic_species_lists(list1, list2, check_identical=False, only_check_label=False, generateInitialMap=False):

def same_species_lists(list1, list2, check_identical=False, only_check_label=False, generate_initial_map=False, strict=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're renaming this already can we either generalize it so it works for N species or rename it in a less general way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean since it can only compare list of up to 3 species? Do you any name suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isomorphic_species_trios?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isomorphic_species_lists_le3, isomorphic_species_lists_lt4?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's not worth worrying about the confusion that much?

@mjohnson541 mjohnson541 merged commit cb77e28 into master May 22, 2019
@mjohnson541 mjohnson541 deleted the species_identity branch May 22, 2019 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants