Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic Tree Generation Part 3 #1461

Merged
merged 96 commits into from
Apr 11, 2019
Merged

Automatic Tree Generation Part 3 #1461

merged 96 commits into from
Apr 11, 2019

Conversation

mjohnson541
Copy link
Contributor

This pull request adds functionality to automatically generate trees from any family provided the top nodes are not logic nodes. It adds the functionality to automatically generate rules using the set of training reactions that match each node weighted based on the uncertainties of the reactions. It also adds functionality for testing these trees and comparing them with the original RMG trees. Bugfixes are also a very substantial portion of this PR.

The algorithm has undergone significant restructuring since part 2, however, the general principles remain the same. New nodes are generated by determining which of a generated set of potential extension nodes minimize the uncertainty in estimation using the mean after the split. After the tree is generated they are regularized. By default this means that all nodes are made as specific as the reactions they match. Then all of the reactions matching each node are fit to a modified Arrhenius expression that uses a Blower-Masel interpolant for the Ea.

This does not yet provide the functionality to use the tree in RMG yet or uncertainty analysis, which will be included in atg4.

So far all of the new trees I've tested have been drastic improvements using the leave-one-out test at 2 sigma errors Substitution_O (N=127) goes from 16.7 kcal/mol to 10.7 kcal/mol (about a factor of 20 improvement) and ketoenol (N=8) goes from 28.9 kcal/mol to 15.0 kcal/mol (about a factor of 1000 improvement) (all at T=1000K).

@codecov
Copy link

codecov bot commented Aug 31, 2018

Codecov Report

Merging #1461 into master will decrease coverage by 0.18%.
The diff coverage is 45.08%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1461      +/-   ##
==========================================
- Coverage   41.85%   41.67%   -0.19%     
==========================================
  Files         165      166       +1     
  Lines       28014    28452     +438     
  Branches     5713     5855     +142     
==========================================
+ Hits        11726    11858     +132     
- Misses      15500    15773     +273     
- Partials      788      821      +33
Impacted Files Coverage Δ
rmgpy/reaction.py 0% <0%> (ø) ⬆️
rmgpy/molecule/molecule.py 0% <0%> (ø) ⬆️
rmgpy/molecule/group.py 0% <0%> (ø) ⬆️
rmgpy/species.py 0% <0%> (ø) ⬆️
rmgpy/molecule/element.py 0% <0%> (ø) ⬆️
rmgpy/kinetics/uncertainies.py 100% <100%> (ø)
rmgpy/data/kinetics/common.py 70.08% <100%> (+1.84%) ⬆️
rmgpy/data/kinetics/family.py 55.33% <53.67%> (-3.23%) ⬇️
rmgpy/data/kinetics/groups.py 15.15% <58.06%> (-4.85%) ⬇️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5ab75c...23a8d44. Read the comment docs.

@mjohnson541
Copy link
Contributor Author

This passes everything locally so this should be ready for full review. None of these changes should impact how RMG currently functions since I haven't added any of the new trees to the database yet.

@mjohnson541
Copy link
Contributor Author

mjohnson541 commented Jan 21, 2019

Pretty certain the RMG-tests failure is a travis issue, the tests pass and give identical models locally, and the child branch atg4 passes RMG-tests, for some reason on travis the liquid-oxidation example is timing out in all the restarts I've tried right now.

Copy link
Contributor

@mliu49 mliu49 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through and made an initial round of comments. What do you suggest paying more attention to in terms of reviewing or testing? Is it necessary for me to understand all of the details of the algorithm?

Also, Travis is failing because the test job for liquid oxidation is timing out. The time limit is set at 4m30s. The job normally takes 3m50s on Travis, so I'm not sure why it might be taking longer.

@@ -28,11 +28,14 @@
###############################################################################

import numpy
np = numpy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if you changed the existing code to use np instead of keeping both variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it seems that doing this any other way causes strange problems. If I just do import numpy as np it tells me to cimport instead. If I convert import numpy as np into cimport numpy as np I get an error when it tries to access np.linalg.

return False
else:
initialMap[atom] = L[0]
if not self.isMappingValid(other,initialMap,equivalent=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should equivalent be True for full isomorphism?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

if L == []:
return False
else:
initialMap[atom] = L[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're taking the first result anyway, would it be better to short circuit?

for a in other.atoms:
    if a.label == atom.label:
        initialMap[atom] = a
        break
else:
    return False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some googling and I don't think the speed difference for list comprehensions overcomes the speed up from only checking half of the list so I'll change it.

return object1.isIsomorphic(object2)
else:
return object1.isIsomorphic(object2,generateInitialMap=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you just supply _generate_initial_map directly as the argument instead of adding another if statement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

#Bond Dissociation Energies
#Reference: Huheey, pps. A-21 to A-34; T.L. Cottrell, "The Strengths of Chemical Bonds," 2nd ed., Butterworths, London, 1958; B. deB. Darwent, "National Standard Reference Data Series," National Bureau of Standards, No. 31, Washington, DC, 1970; S.W. Benson, J. Chem. Educ., 42, 502 (1965).
#(C,C,1.5) was taken from an unsourced table that had similar values to those used below, should be replaced if a sourced value becomes available
elements = ['C','N','H','O','S','Cl','Si']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this elements list for? Perhaps use a more descriptive name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

self.extendRegularization(child,inds,regs,typ)
else:
raise ValueError('regularization type of {0} is unimplemented'.format(typ))
R = ['H','C','N','O','Si','S'] #set of possible R elements/atoms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about Cl and I?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add Cl, but we don't have BDE's for I currently. Do we have anything for I in the database?

templateRxnMap = self.getReactionMatches(thermoDatabase=thermoDatabase,removeDegeneracy=True,getReverse=True)
self.makeBMRulesFromTemplateRxnMap(templateRxnMap)
self.checkTree()
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This return seems unnecessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

else:
return rxns

def getReactionMatches(self,rxns=None,thermoDatabase=None,removeDegeneracy=False,estimateThermo=True,fixLabels=False,exactMatchesOnly=False,getReverse=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add spaces after commas please? Also could be broken into two lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

else:
old_atom_type_str = old_atom_type[0].label

grps.append((grp,grpc,basename+'_'+old_atom_type_str+'->'+item.label,'atomExt',(i,)))
grps.append((grp,grpc,basename+'_'+str(i+1)+old_atom_type_str+'->'+item.label,'atomExt',(i,)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use > instead of -> to save a character?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, in practice the names are so long I've tried to err on the side of making it easy to read individual sections of the name (having a bit more detail) rather than the thing as a whole.

else:
atom_type_str = atom_type[0].label

grps.append((grp,grpc,basename+'_'+str(i+1)+atom_type_str+'-inRing','ringExt',(i,)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using r0 or r1 like in the adjlists instead of -inRing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, in practice the names are so long I've tried to err on the side of making it easy to read individual sections of the name (having a bit more detail) rather than the thing as a whole. In particular -inRing is a less obvious extension so I thought I'd make it clearer in the naming, I'm not opposed to shortening it, but I feel like r0 and r1 is too short.

@mjohnson541 mjohnson541 force-pushed the atg3 branch 3 times, most recently from 4184189 to d808a10 Compare January 31, 2019 23:25
@mjohnson541
Copy link
Contributor Author

Unless otherwise noted I've made all the changes you requested. I spent a few hours trying to squash the changes properly, but there are so many merge conflicts (between 1-8 a commit) I got to the point where I couldn't reasonably believe I'd resolved them without making any mistakes so I'd prefer to leave the changes on the end, if you aren't opposed to that.

Honestly, one way of look at this code is that I run it and it gives me a tree and I can check the accuracy of that tree via cross validation, so there aren't many real dangers to the rest of RMG, really you need to mostly just yell at me about comments and clarity so that the poor soul who has to touch this next can work out what's going on. I don't know if its worth it for you to go so far as to understand all of it in detail, but perhaps just enough that if you had to rewrite it without the original code you'd know how to start and have some idea what problems can happen.

@mjohnson541
Copy link
Contributor Author

I'm not sure about the RMG-tests timing, nothing should be impacted and #1486 runs RMG-tests perfectly fine.

@mjohnson541
Copy link
Contributor Author

Unittests issues should be fixed now.

Copy link
Contributor

@mliu49 mliu49 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a few more comments. I'm fine with merging this, though as I mentioned, I am concerned about merge conflicts which may arise with RMG-cat. Hopefully those can be resolved easily.

Question: I think we discussed this a long time ago, but what was your reason for not sub-classing KineticsFamily with a new class for automatically generated families? It seems like it would help improve code organization and clarity.

@@ -402,3 +402,21 @@ def reduce_same_reactant_degeneracy(reaction, same_reactants=None):
'Degeneracy of reaction {} was decreased by 50% to {} since two of the reactants '
'are identical'.format(reaction, reaction.degeneracy)
)

def getAllDescendants(entry):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a method of the Entry class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in atg4

comment = self.comment,
)

def fitToReactions(self,rxns,w0=None,family=None,Ts=[300.0,500.0,600.0,700.0,800.0,900.0,1000.0,1100.0,1200.0,1500.0]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's inadvisable to provide a mutable type as a default argument. I would suggest defaulting to None and setting these temperatures in the body of the method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -429,6 +432,223 @@ cdef class ArrheniusEP(KineticsModel):

################################################################################

cdef class ArrheniusBM(KineticsModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need support for reversing this type of kinetics? If so, you would need to update Reaction.generateReverseRateCoefficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it should be converted to an arrhenius for a corresponding reaction

0:(14.0,'kcal/mol'),
11:(14.0,'kcal/mol'),
}
rank_accuracy_map = {i:Quantity(rank_accuracy_map[i]) for i in rank_accuracy_map.keys()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not {key: Quantity(value) for key, value in rank_accuracy_map.iteritems()}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


BDEs = {}
for key in BDEDict.keys():
q = Quantity(BDEDict[key]).value_si
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion:

for key, value in BDEDict.iteritems():
    q = Quantity(value).value_si

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

# entry is a top-level node that should be matched
specialCases = ['peroxyl_disproportionation','bimolec_hydroperoxide_decomposition','r_recombination']
if len(forwardTemplate) == 1 and len(reaction.reactants) > len(forwardTemplate) and self.label.lower().split('/')[0] not in specialCases:
entry = forwardTemplate[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment about how this if case is for automatically generated trees with only one tree? It took me a while to figure that out... (unless that's wrong, in which case you should definitely add a comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in atg4

@@ -47,12 +48,14 @@
from rmgpy import settings
from rmgpy.reaction import Reaction
from rmgpy.kinetics import Arrhenius
from rmgpy.kinetics.arrhenius import ArrheniusBM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, could you add ArrheniusBM to kinetics.__init__ and combine this import with the prior one for Arrhenius?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in atg4

entry.item = grp
template_reactants.append(entry)
except AttributeError:
template_reactants = template.reactants
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the AttributeError that you're expecting here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when entry has no attribute item...I believe this may be related to logic nodes. This section gets completely changed in atg4 to make it compatiable with automatically generated trees

index += 1



Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove extra empty lines? Methods should be separated by 1 empty line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in atg4

.conda/meta.yaml Outdated
@@ -69,6 +69,7 @@ requirements:
- symmetry
- xlwt
- dde
- scikit-learn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind putting this in the right place alphabetically? Also dde while you're at it :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

…enius model as a function of the enthalpy of reaction

instead of the alpha of ArrheniusEP a w0 that is the average of the bond dissociation energies of the bond formed and the bond broken is used
equations are from Engineering Approximations for Activation Energies in Hydrogen Transfer Reactions (Blowers and Masel 2000)
A dictionary of bond dissociation energies has been added to element.py
These are used by the getBDE method of Bond in molecule.py

if additional atoms are added to RMG new values will need to be added to the BDE dictionary
…y reactions

bond dissociation energies are estimated using the BDE dictionary
this prevents the atom order from being disturbed before regularization
fix extension generation and tree generation tests to match changes to algorithm
The original regularization algorithm took at top down recursive approach regularizations were identified by in the top nodes and then efficiently
applied down their child nodes

However, this type of regularization fails when there are two atoms only distinguishable by their index.

Example:  Suppose the bonds to these atoms are entirely unspecified and that reactions exist where both of them are single bonds and where one is single and one is double.
During regularization dimension identification the single bond choice becomes a regularization dimension for both because setting either bond to a single bond does not impact which reactions are matched.  The top down regularization algorithm sees these as regularization dimensions and tried to set all of the bond orders to single for both bonds causing it to no longer match the reactions that have the double bond there.

The new algorithm regularizes from the bottom up.  At nodes with no children all graphically indistinguishable nodes are not regularized while (for the case of simpleRegularization) all other nodes are fully regularized.  For nodes with children (in the case of simpleRegularization) full regularization is attempted, but ignored if this causes the template to not match any of its children.
@mjohnson541
Copy link
Contributor Author

Answer to question: The reason is that the technology makes it possible to have hybrid trees...you could have a tree that has a set of fixed nodes and then is automatically generated underneath that, which makes it important to not have them be different objects.

@mjohnson541
Copy link
Contributor Author

I think this should be ready now.

@mjohnson541
Copy link
Contributor Author

Ok, I think this is ready.

@@ -1795,14 +1804,27 @@ def __generateReactions(self, reactants, products=None, forward=True, prod_reson
else:
template = self.reverseTemplate

if len(reactants) > len(template.reactants): #if the family has one template and is unimolecular split template into multiple reactants
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so this has been bothering me, and I would additional confirmation that it won't have any adverse effects. The motivating concern was why the deepcopy issue only affected the liquid oxidation test. So I analyzed this section again, and have a few more thoughts:

  1. The comment says if the family has one template and is unimolecular, which doesn't seem to match with the if condition. Should it be if the family has one template and is multimolecular? This is probably just a typo though, so we don't need to dwell on it.
  2. So then when do we trigger this if clause? The obvious case is for the AG-trees where all families have a single template. However, it can also happen if we call __generateReactions in the forward direction with the products of a family, e.g. 2 products for the unimolecular HO2_Elimination_from_PeroxyRadical family or 3 products for the Peroxyl_Disproportionation family. Both of these families (and other trimolecular families) are liquid phase families and are only included in the liquid oxidation example. Of course, the same thing would happen when calling __generateReactions in the reverse direction with the reactants of a family, e.g. 2 reactants for R_Addition_MultipleBond, which leaves me wondering about why we didn't see more of performance impact on the other tests.
  3. Thus, it seems like this might trigger in more situations than originally expected (at least for me, idk if you considered those other cases). However, I don't think any of the existing families have disconnected graphs, so split should return the original graph, and we end up with the same template entries.
  4. Thinking about the deepcopy again, you're creating multiple entries and assigning the split groups to each entry. However, without the deepcopy, it seems like each entry would point to the same instance, so you're simply replacing the groups of all them each time. This doesn't have any apparent effects because there aren't any groups that can be split currently, but it could have effects for atg4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes
  2. Does forward=True reaction generation generate reverse reactions? My code is based on the assumption that I know if it's being reacted in the forward or reverse direction and use the forward/reverse templates as appropriate to make it so this should only happen to trees that have a single template that represents more than one reactant. My thought was that this wasn't the case and that one of the liquid families might be written more like an atg family?
  3. Yes
  4. Absolutely true, fortunately It appears I've already fixed this in atg4 in a way that doesn't require a deepcopy by restructuring what objects are used. see: e59fcb6

My thought is that it should be safe to use this branch now especially since this branch can't run atg trees in RMG jobs yet. I'd like to resolve 2 so that I can fix as needed on atg4, but I think the more fundamental issue is solved on atg4 already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine for now, but it's actually not very smart to split the templates everytime the function is called. The templates should probably be generated in the init method of the family so they don't need to be generated each function call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we try generating reactions in both directions for a given set of reacting species, it should be quite common to encounter a mismatch in the number of species and templates. I was also thinking that maybe one of the liquid families was written like an atg family, but I didn't find any.

Copy link
Contributor

@mliu49 mliu49 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for removing the white-space changes!

@mliu49 mliu49 merged commit 0db89ac into master Apr 11, 2019
@mliu49 mliu49 deleted the atg3 branch April 11, 2019 19:06
@mliu49 mliu49 mentioned this pull request May 15, 2019
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants