Automatic Tree Generation Part 3 #1461

mjohnson541 · 2018-08-31T19:36:49Z

This pull request adds functionality to automatically generate trees from any family provided the top nodes are not logic nodes. It adds the functionality to automatically generate rules using the set of training reactions that match each node weighted based on the uncertainties of the reactions. It also adds functionality for testing these trees and comparing them with the original RMG trees. Bugfixes are also a very substantial portion of this PR.

The algorithm has undergone significant restructuring since part 2, however, the general principles remain the same. New nodes are generated by determining which of a generated set of potential extension nodes minimize the uncertainty in estimation using the mean after the split. After the tree is generated they are regularized. By default this means that all nodes are made as specific as the reactions they match. Then all of the reactions matching each node are fit to a modified Arrhenius expression that uses a Blower-Masel interpolant for the Ea.

This does not yet provide the functionality to use the tree in RMG yet or uncertainty analysis, which will be included in atg4.

So far all of the new trees I've tested have been drastic improvements using the leave-one-out test at 2 sigma errors Substitution_O (N=127) goes from 16.7 kcal/mol to 10.7 kcal/mol (about a factor of 20 improvement) and ketoenol (N=8) goes from 28.9 kcal/mol to 15.0 kcal/mol (about a factor of 1000 improvement) (all at T=1000K).

codecov · 2018-08-31T20:13:54Z

Codecov Report

Merging #1461 into master will decrease coverage by 0.18%.
The diff coverage is 45.08%.

@@            Coverage Diff             @@
##           master    #1461      +/-   ##
==========================================
- Coverage   41.85%   41.67%   -0.19%     
==========================================
  Files         165      166       +1     
  Lines       28014    28452     +438     
  Branches     5713     5855     +142     
==========================================
+ Hits        11726    11858     +132     
- Misses      15500    15773     +273     
- Partials      788      821      +33

Impacted Files	Coverage Δ
rmgpy/reaction.py	`0% <0%> (ø)`	⬆️
rmgpy/molecule/molecule.py	`0% <0%> (ø)`	⬆️
rmgpy/molecule/group.py	`0% <0%> (ø)`	⬆️
rmgpy/species.py	`0% <0%> (ø)`	⬆️
rmgpy/molecule/element.py	`0% <0%> (ø)`	⬆️
rmgpy/kinetics/uncertainies.py	`100% <100%> (ø)`
rmgpy/data/kinetics/common.py	`70.08% <100%> (+1.84%)`	⬆️
rmgpy/data/kinetics/family.py	`55.33% <53.67%> (-3.23%)`	⬇️
rmgpy/data/kinetics/groups.py	`15.15% <58.06%> (-4.85%)`	⬇️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5ab75c...23a8d44. Read the comment docs.

mjohnson541 · 2019-01-20T23:49:37Z

This passes everything locally so this should be ready for full review. None of these changes should impact how RMG currently functions since I haven't added any of the new trees to the database yet.

mjohnson541 · 2019-01-21T04:45:44Z

Pretty certain the RMG-tests failure is a travis issue, the tests pass and give identical models locally, and the child branch atg4 passes RMG-tests, for some reason on travis the liquid-oxidation example is timing out in all the restarts I've tried right now.

mliu49

I've gone through and made an initial round of comments. What do you suggest paying more attention to in terms of reviewing or testing? Is it necessary for me to understand all of the details of the algorithm?

Also, Travis is failing because the test job for liquid oxidation is timing out. The time limit is set at 4m30s. The job normally takes 3m50s on Travis, so I'm not sure why it might be taking longer.

mliu49 · 2018-10-12T18:52:06Z

rmgpy/kinetics/arrhenius.pyx

@@ -28,11 +28,14 @@
 ###############################################################################

 import numpy
+np = numpy


I would prefer if you changed the existing code to use np instead of keeping both variables.

Actually, it seems that doing this any other way causes strange problems. If I just do import numpy as np it tells me to cimport instead. If I convert import numpy as np into cimport numpy as np I get an error when it tries to access np.linalg.

mliu49 · 2019-01-25T16:57:17Z

rmgpy/molecule/molecule.py

+                        return False
+                    else:
+                        initialMap[atom] = L[0]
+            if not self.isMappingValid(other,initialMap,equivalent=False):


Should equivalent be True for full isomorphism?

mliu49 · 2019-01-25T17:02:31Z

rmgpy/molecule/molecule.py

+                    if L == []:
+                        return False
+                    else:
+                        initialMap[atom] = L[0]


If you're taking the first result anyway, would it be better to short circuit?

for a in other.atoms: if a.label == atom.label: initialMap[atom] = a break else: return False

I did some googling and I don't think the speed difference for list comprehensions overcomes the speed up from only checking half of the list so I'll change it.

mliu49 · 2019-01-25T17:03:53Z

rmgpy/reaction.py

            return object1.isIsomorphic(object2)
+        else:
+            return object1.isIsomorphic(object2,generateInitialMap=True)


Why don't you just supply _generate_initial_map directly as the argument instead of adding another if statement?

mliu49 · 2019-01-25T17:04:23Z

rmgpy/molecule/element.py

+#Bond Dissociation Energies
+#Reference: Huheey, pps. A-21 to A-34; T.L. Cottrell, "The Strengths of Chemical Bonds," 2nd ed., Butterworths, London, 1958; B. deB. Darwent, "National Standard Reference Data Series," National Bureau of Standards, No. 31, Washington, DC, 1970; S.W. Benson, J. Chem. Educ., 42, 502 (1965).
+#(C,C,1.5) was taken from an unsourced table that had similar values to those used below, should be replaced if a sourced value becomes available
+elements = ['C','N','H','O','S','Cl','Si']


What is this elements list for? Perhaps use a more descriptive name?

mliu49 · 2019-01-29T22:43:40Z

rmgpy/data/kinetics/family.py

-                    self.extendRegularization(child,inds,regs,typ)
-            else:
-                raise ValueError('regularization type of {0} is unimplemented'.format(typ))
+        R = ['H','C','N','O','Si','S'] #set of possible R elements/atoms


What about Cl and I?

I'll add Cl, but we don't have BDE's for I currently. Do we have anything for I in the database?

mliu49 · 2019-01-29T22:44:26Z

rmgpy/data/kinetics/family.py

+        templateRxnMap = self.getReactionMatches(thermoDatabase=thermoDatabase,removeDegeneracy=True,getReverse=True)
+        self.makeBMRulesFromTemplateRxnMap(templateRxnMap)
+        self.checkTree()
+        return


This return seems unnecessary?

mliu49 · 2019-01-29T22:47:04Z

rmgpy/data/kinetics/family.py

+        else:
+            return rxns
+
+    def getReactionMatches(self,rxns=None,thermoDatabase=None,removeDegeneracy=False,estimateThermo=True,fixLabels=False,exactMatchesOnly=False,getReverse=False):


Could you add spaces after commas please? Also could be broken into two lines.

mliu49 · 2019-01-30T21:24:09Z

rmgpy/molecule/group.py

            else:
                old_atom_type_str = old_atom_type[0].label

-            grps.append((grp,grpc,basename+'_'+old_atom_type_str+'->'+item.label,'atomExt',(i,)))
+            grps.append((grp,grpc,basename+'_'+str(i+1)+old_atom_type_str+'->'+item.label,'atomExt',(i,)))


Would it make sense to use > instead of -> to save a character?

Honestly, in practice the names are so long I've tried to err on the side of making it easy to read individual sections of the name (having a bit more detail) rather than the thing as a whole.

mliu49 · 2019-01-30T21:25:18Z

rmgpy/molecule/group.py

+        else:
+            atom_type_str = atom_type[0].label
+
+        grps.append((grp,grpc,basename+'_'+str(i+1)+atom_type_str+'-inRing','ringExt',(i,)))


What about using r0 or r1 like in the adjlists instead of -inRing?

Honestly, in practice the names are so long I've tried to err on the side of making it easy to read individual sections of the name (having a bit more detail) rather than the thing as a whole. In particular -inRing is a less obvious extension so I thought I'd make it clearer in the naming, I'm not opposed to shortening it, but I feel like r0 and r1 is too short.

mjohnson541 · 2019-01-31T23:42:32Z

Unless otherwise noted I've made all the changes you requested. I spent a few hours trying to squash the changes properly, but there are so many merge conflicts (between 1-8 a commit) I got to the point where I couldn't reasonably believe I'd resolved them without making any mistakes so I'd prefer to leave the changes on the end, if you aren't opposed to that.

Honestly, one way of look at this code is that I run it and it gives me a tree and I can check the accuracy of that tree via cross validation, so there aren't many real dangers to the rest of RMG, really you need to mostly just yell at me about comments and clarity so that the poor soul who has to touch this next can work out what's going on. I don't know if its worth it for you to go so far as to understand all of it in detail, but perhaps just enough that if you had to rewrite it without the original code you'd know how to start and have some idea what problems can happen.

mjohnson541 · 2019-01-31T23:52:19Z

I'm not sure about the RMG-tests timing, nothing should be impacted and #1486 runs RMG-tests perfectly fine.

mjohnson541 · 2019-02-01T17:06:14Z

Unittests issues should be fixed now.

mliu49

I made a few more comments. I'm fine with merging this, though as I mentioned, I am concerned about merge conflicts which may arise with RMG-cat. Hopefully those can be resolved easily.

Question: I think we discussed this a long time ago, but what was your reason for not sub-classing KineticsFamily with a new class for automatically generated families? It seems like it would help improve code organization and clarity.

mliu49 · 2019-04-01T17:20:38Z

rmgpy/data/kinetics/common.py

@@ -402,3 +402,21 @@ def reduce_same_reactant_degeneracy(reaction, same_reactants=None):
                        'Degeneracy of reaction {} was decreased by 50% to {} since two of the reactants '
                        'are identical'.format(reaction, reaction.degeneracy)
                    )
+
+def getAllDescendants(entry):


Should this be a method of the Entry class?

added in atg4

mliu49 · 2019-04-01T17:50:24Z

rmgpy/kinetics/arrhenius.pyx

+            comment = self.comment,
+        )
+
+    def fitToReactions(self,rxns,w0=None,family=None,Ts=[300.0,500.0,600.0,700.0,800.0,900.0,1000.0,1100.0,1200.0,1500.0]):


It's inadvisable to provide a mutable type as a default argument. I would suggest defaulting to None and setting these temperatures in the body of the method.

mliu49 · 2019-04-01T17:53:51Z

rmgpy/kinetics/arrhenius.pyx

@@ -429,6 +432,223 @@ cdef class ArrheniusEP(KineticsModel):

 ################################################################################

+cdef class ArrheniusBM(KineticsModel):


Do we need support for reversing this type of kinetics? If so, you would need to update Reaction.generateReverseRateCoefficient.

No, it should be converted to an arrhenius for a corresponding reaction

mliu49 · 2019-04-01T17:56:43Z

rmgpy/kinetics/uncertainies.py

+                  0:(14.0,'kcal/mol'),
+                  11:(14.0,'kcal/mol'),
+                  }
+rank_accuracy_map = {i:Quantity(rank_accuracy_map[i]) for i in rank_accuracy_map.keys()}


Why not {key: Quantity(value) for key, value in rank_accuracy_map.iteritems()}?

mliu49 · 2019-04-01T18:00:25Z

rmgpy/molecule/element.py

+
+BDEs = {}
+for key in BDEDict.keys():
+    q = Quantity(BDEDict[key]).value_si


Minor suggestion:

for key, value in BDEDict.iteritems(): q = Quantity(value).value_si

mliu49 · 2019-04-01T21:09:18Z

rmgpy/data/kinetics/groups.py

-            # entry is a top-level node that should be matched
+        specialCases = ['peroxyl_disproportionation','bimolec_hydroperoxide_decomposition','r_recombination']
+        if len(forwardTemplate) == 1 and len(reaction.reactants) > len(forwardTemplate) and self.label.lower().split('/')[0] not in specialCases:
+            entry = forwardTemplate[0]


Could you add a comment about how this if case is for automatically generated trees with only one tree? It took me a while to figure that out... (unless that's wrong, in which case you should definitely add a comment).

added in atg4

mliu49 · 2019-04-01T21:10:48Z

rmgpy/data/kinetics/family.py

@@ -47,12 +48,14 @@
 from rmgpy import settings
 from rmgpy.reaction import Reaction
 from rmgpy.kinetics import Arrhenius
+from rmgpy.kinetics.arrhenius import ArrheniusBM


For consistency, could you add ArrheniusBM to kinetics.__init__ and combine this import with the prior one for Arrhenius?

added in atg4

mliu49 · 2019-04-01T21:14:09Z

rmgpy/data/kinetics/family.py

+                    entry.item = grp
+                    template_reactants.append(entry)
+            except AttributeError:
+                template_reactants = template.reactants


What's the AttributeError that you're expecting here?

when entry has no attribute item...I believe this may be related to logic nodes. This section gets completely changed in atg4 to make it compatiable with automatically generated trees

mliu49 · 2019-04-01T21:20:48Z

rmgpy/data/kinetics/family.py

+            index += 1
+
+
+


Could you remove extra empty lines? Methods should be separated by 1 empty line.

done in atg4

mliu49 · 2019-04-01T21:25:58Z

.conda/meta.yaml

@@ -69,6 +69,7 @@ requirements:
    - symmetry
    - xlwt
    - dde
+    - scikit-learn


Would you mind putting this in the right place alphabetically? Also dde while you're at it :)

…enius model as a function of the enthalpy of reaction instead of the alpha of ArrheniusEP a w0 that is the average of the bond dissociation energies of the bond formed and the bond broken is used equations are from Engineering Approximations for Activation Energies in Hydrogen Transfer Reactions (Blowers and Masel 2000)

A dictionary of bond dissociation energies has been added to element.py These are used by the getBDE method of Bond in molecule.py if additional atoms are added to RMG new values will need to be added to the BDE dictionary

…y reactions bond dissociation energies are estimated using the BDE dictionary

this prevents the atom order from being disturbed before regularization

fix extension generation and tree generation tests to match changes to algorithm

The original regularization algorithm took at top down recursive approach regularizations were identified by in the top nodes and then efficiently applied down their child nodes However, this type of regularization fails when there are two atoms only distinguishable by their index. Example: Suppose the bonds to these atoms are entirely unspecified and that reactions exist where both of them are single bonds and where one is single and one is double. During regularization dimension identification the single bond choice becomes a regularization dimension for both because setting either bond to a single bond does not impact which reactions are matched. The top down regularization algorithm sees these as regularization dimensions and tried to set all of the bond orders to single for both bonds causing it to no longer match the reactions that have the double bond there. The new algorithm regularizes from the bottom up. At nodes with no children all graphically indistinguishable nodes are not regularized while (for the case of simpleRegularization) all other nodes are fully regularized. For nodes with children (in the case of simpleRegularization) full regularization is attempted, but ignored if this causes the template to not match any of its children.

…ph isomorphic

mjohnson541 · 2019-04-09T22:49:45Z

Answer to question: The reason is that the technology makes it possible to have hybrid trees...you could have a tree that has a set of fixed nodes and then is automatically generated underneath that, which makes it important to not have them be different objects.

mjohnson541 · 2019-04-09T22:50:12Z

I think this should be ready now.

mjohnson541 · 2019-04-10T21:24:00Z

Ok, I think this is ready.

mliu49 · 2019-04-10T23:38:18Z

rmgpy/data/kinetics/family.py

@@ -1795,14 +1804,27 @@ def __generateReactions(self, reactants, products=None, forward=True, prod_reson
        else:
            template = self.reverseTemplate

+        if len(reactants) > len(template.reactants): #if the family has one template and is unimolecular split template into multiple reactants


Ok, so this has been bothering me, and I would additional confirmation that it won't have any adverse effects. The motivating concern was why the deepcopy issue only affected the liquid oxidation test. So I analyzed this section again, and have a few more thoughts:

The comment says if the family has one template and is unimolecular, which doesn't seem to match with the if condition. Should it be if the family has one template and is multimolecular? This is probably just a typo though, so we don't need to dwell on it.

So then when do we trigger this if clause? The obvious case is for the AG-trees where all families have a single template. However, it can also happen if we call __generateReactions in the forward direction with the products of a family, e.g. 2 products for the unimolecular HO2_Elimination_from_PeroxyRadical family or 3 products for the Peroxyl_Disproportionation family. Both of these families (and other trimolecular families) are liquid phase families and are only included in the liquid oxidation example. Of course, the same thing would happen when calling __generateReactions in the reverse direction with the reactants of a family, e.g. 2 reactants for R_Addition_MultipleBond, which leaves me wondering about why we didn't see more of performance impact on the other tests.

Thus, it seems like this might trigger in more situations than originally expected (at least for me, idk if you considered those other cases). However, I don't think any of the existing families have disconnected graphs, so split should return the original graph, and we end up with the same template entries.

Thinking about the deepcopy again, you're creating multiple entries and assigning the split groups to each entry. However, without the deepcopy, it seems like each entry would point to the same instance, so you're simply replacing the groups of all them each time. This doesn't have any apparent effects because there aren't any groups that can be split currently, but it could have effects for atg4.

Yes

Does forward=True reaction generation generate reverse reactions? My code is based on the assumption that I know if it's being reacted in the forward or reverse direction and use the forward/reverse templates as appropriate to make it so this should only happen to trees that have a single template that represents more than one reactant. My thought was that this wasn't the case and that one of the liquid families might be written more like an atg family?

Yes

Absolutely true, fortunately It appears I've already fixed this in atg4 in a way that doesn't require a deepcopy by restructuring what objects are used. see: e59fcb6

My thought is that it should be safe to use this branch now especially since this branch can't run atg trees in RMG jobs yet. I'd like to resolve 2 so that I can fix as needed on atg4, but I think the more fundamental issue is solved on atg4 already.

It should be fine for now, but it's actually not very smart to split the templates everytime the function is called. The templates should probably be generated in the init method of the family so they don't need to be generated each function call.

Since we try generating reactions in both directions for a given set of reacting species, it should be quite common to encounter a mismatch in the number of species and templates. I was also thinking that maybe one of the liquid families was written like an atg family, but I didn't find any.

…new trees

mliu49

Thanks for removing the white-space changes!

mjohnson541 force-pushed the atg3 branch from 01839d7 to 4bef6db Compare August 31, 2018 20:13

mjohnson541 self-assigned this Aug 31, 2018

mjohnson541 force-pushed the atg3 branch from 132e70f to 4107cbd Compare September 1, 2018 18:24

mjohnson541 added Type: Feature Topic: Kinetics Epic Complexity: Very High labels Sep 1, 2018

mjohnson541 force-pushed the atg3 branch from 4107cbd to cf7aed4 Compare September 1, 2018 23:23

mjohnson541 added the Status: Ready for Review PR is complete and ready to be reviewed label Sep 1, 2018

mjohnson541 requested a review from mliu49 September 1, 2018 23:59

mjohnson541 force-pushed the atg3 branch from cf7aed4 to e18247f Compare January 20, 2019 01:17

mjohnson541 force-pushed the atg3 branch from bc472dd to bff4553 Compare January 21, 2019 01:19

mliu49 reviewed Jan 30, 2019

View reviewed changes

mjohnson541 force-pushed the atg3 branch 3 times, most recently from 4184189 to d808a10 Compare January 31, 2019 23:25

mjohnson541 force-pushed the atg3 branch from d808a10 to 0960f0f Compare February 1, 2019 17:05

mliu49 reviewed Apr 1, 2019

View reviewed changes

mjohnson541 added 2 commits April 9, 2019 15:58

Group.getExtensions fix

39af25b

mjohnson541 force-pushed the atg3 branch from 0960f0f to b728bb4 Compare April 9, 2019 20:02

mjohnson541 added 3 commits April 9, 2019 16:56

Enable estimation of bond dissociation energies

abd2e6a

A dictionary of bond dissociation energies has been added to element.py These are used by the getBDE method of Bond in molecule.py if additional atoms are added to RMG new values will need to be added to the BDE dictionary

add method for calculating the w0 Blower Masel parameter for arbitrar…

8a59e88

…y reactions bond dissociation energies are estimated using the BDE dictionary

add ArrheniusBM declarations to the arrhenius.pxd file

09763fa

mjohnson541 added 6 commits April 9, 2019 16:59

put regularization in makeTree immediately after generateTree

4c83865

this prevents the atom order from being disturbed before regularization

improve regularization and empty regularization detection

b008118

adjust testing so that regularization testing works

16d1b9c

fix extension generation and tree generation tests to match changes to algorithm

Create checkTree function for testing that generated trees are subgra…

5c8f30a

…ph isomorphic

add scikit-learn to dependencies

4b3ac87

mjohnson541 force-pushed the atg3 branch from b728bb4 to 47a426b Compare April 9, 2019 21:05

mjohnson541 force-pushed the atg3 branch from 408ebda to 47a426b Compare April 10, 2019 16:49

mliu49 reviewed Apr 10, 2019

View reviewed changes

mjohnson541 added 13 commits April 11, 2019 12:52

ensure families with two identical trees are not processed like the …

70819b4

…new trees

remove duplicate import

8455028

add comment describing old and new tree processing differences

1bfad2c

add newline at end of data/kinetics/common.py

89db1bd

add Cl to possible atoms in tree generation

f13dea4

improve generateInitialMap use in isomorphism functions

43f2200

speed up initial map comparisons

50f6b9b

remove return from makeTree

f411cb8

add scikit-learn to .conda/meta.yaml

5726769

clean up tree processing decisions in data/kinetics/groups.py

681b554

clean up function def statements

a1cb9bc

fix reaction isomorphism test by adding generateInitialMap option

97a036f

remove unnecessary deepcopy

23a8d44

mjohnson541 force-pushed the atg3 branch from 14dc794 to 23a8d44 Compare April 11, 2019 17:23

mliu49 approved these changes Apr 11, 2019

View reviewed changes

mliu49 merged commit 0db89ac into master Apr 11, 2019

mliu49 deleted the atg3 branch April 11, 2019 19:06

mliu49 mentioned this pull request May 15, 2019

RMG v2.4.0 (?) Release Planning #1591

Closed

5 tasks

		@@ -429,6 +432,223 @@ cdef class ArrheniusEP(KineticsModel):

		################################################################################

		cdef class ArrheniusBM(KineticsModel):

Automatic Tree Generation Part 3 #1461

Automatic Tree Generation Part 3 #1461

Conversation

mjohnson541 commented Aug 31, 2018

codecov bot commented Aug 31, 2018 • edited Loading

Codecov Report

mjohnson541 commented Jan 20, 2019

mjohnson541 commented Jan 21, 2019 • edited Loading

mliu49 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjohnson541 commented Jan 31, 2019

mjohnson541 commented Jan 31, 2019

mjohnson541 commented Feb 1, 2019

mliu49 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjohnson541 commented Apr 9, 2019

mjohnson541 commented Apr 9, 2019

mjohnson541 commented Apr 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mliu49 left a comment

Choose a reason for hiding this comment

codecov bot commented Aug 31, 2018 •

edited

Loading

mjohnson541 commented Jan 21, 2019 •

edited

Loading