-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic Tree Generation Part 3 #1461
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1461 +/- ##
==========================================
- Coverage 41.85% 41.67% -0.19%
==========================================
Files 165 166 +1
Lines 28014 28452 +438
Branches 5713 5855 +142
==========================================
+ Hits 11726 11858 +132
- Misses 15500 15773 +273
- Partials 788 821 +33
Continue to review full report at Codecov.
|
This passes everything locally so this should be ready for full review. None of these changes should impact how RMG currently functions since I haven't added any of the new trees to the database yet. |
Pretty certain the RMG-tests failure is a travis issue, the tests pass and give identical models locally, and the child branch atg4 passes RMG-tests, for some reason on travis the liquid-oxidation example is timing out in all the restarts I've tried right now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone through and made an initial round of comments. What do you suggest paying more attention to in terms of reviewing or testing? Is it necessary for me to understand all of the details of the algorithm?
Also, Travis is failing because the test job for liquid oxidation is timing out. The time limit is set at 4m30s. The job normally takes 3m50s on Travis, so I'm not sure why it might be taking longer.
@@ -28,11 +28,14 @@ | |||
############################################################################### | |||
|
|||
import numpy | |||
np = numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if you changed the existing code to use np instead of keeping both variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it seems that doing this any other way causes strange problems. If I just do import numpy as np
it tells me to cimport
instead. If I convert import numpy as np
into cimport numpy as np
I get an error when it tries to access np.linalg.
rmgpy/molecule/molecule.py
Outdated
return False | ||
else: | ||
initialMap[atom] = L[0] | ||
if not self.isMappingValid(other,initialMap,equivalent=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should equivalent be True for full isomorphism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
rmgpy/molecule/molecule.py
Outdated
if L == []: | ||
return False | ||
else: | ||
initialMap[atom] = L[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're taking the first result anyway, would it be better to short circuit?
for a in other.atoms:
if a.label == atom.label:
initialMap[atom] = a
break
else:
return False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some googling and I don't think the speed difference for list comprehensions overcomes the speed up from only checking half of the list so I'll change it.
rmgpy/reaction.py
Outdated
return object1.isIsomorphic(object2) | ||
else: | ||
return object1.isIsomorphic(object2,generateInitialMap=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you just supply _generate_initial_map
directly as the argument instead of adding another if statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
rmgpy/molecule/element.py
Outdated
#Bond Dissociation Energies | ||
#Reference: Huheey, pps. A-21 to A-34; T.L. Cottrell, "The Strengths of Chemical Bonds," 2nd ed., Butterworths, London, 1958; B. deB. Darwent, "National Standard Reference Data Series," National Bureau of Standards, No. 31, Washington, DC, 1970; S.W. Benson, J. Chem. Educ., 42, 502 (1965). | ||
#(C,C,1.5) was taken from an unsourced table that had similar values to those used below, should be replaced if a sourced value becomes available | ||
elements = ['C','N','H','O','S','Cl','Si'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this elements list for? Perhaps use a more descriptive name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
rmgpy/data/kinetics/family.py
Outdated
self.extendRegularization(child,inds,regs,typ) | ||
else: | ||
raise ValueError('regularization type of {0} is unimplemented'.format(typ)) | ||
R = ['H','C','N','O','Si','S'] #set of possible R elements/atoms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about Cl and I?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add Cl, but we don't have BDE's for I currently. Do we have anything for I in the database?
rmgpy/data/kinetics/family.py
Outdated
templateRxnMap = self.getReactionMatches(thermoDatabase=thermoDatabase,removeDegeneracy=True,getReverse=True) | ||
self.makeBMRulesFromTemplateRxnMap(templateRxnMap) | ||
self.checkTree() | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This return seems unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
rmgpy/data/kinetics/family.py
Outdated
else: | ||
return rxns | ||
|
||
def getReactionMatches(self,rxns=None,thermoDatabase=None,removeDegeneracy=False,estimateThermo=True,fixLabels=False,exactMatchesOnly=False,getReverse=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add spaces after commas please? Also could be broken into two lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
else: | ||
old_atom_type_str = old_atom_type[0].label | ||
|
||
grps.append((grp,grpc,basename+'_'+old_atom_type_str+'->'+item.label,'atomExt',(i,))) | ||
grps.append((grp,grpc,basename+'_'+str(i+1)+old_atom_type_str+'->'+item.label,'atomExt',(i,))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to use >
instead of ->
to save a character?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, in practice the names are so long I've tried to err on the side of making it easy to read individual sections of the name (having a bit more detail) rather than the thing as a whole.
else: | ||
atom_type_str = atom_type[0].label | ||
|
||
grps.append((grp,grpc,basename+'_'+str(i+1)+atom_type_str+'-inRing','ringExt',(i,))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using r0
or r1
like in the adjlists instead of -inRing
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, in practice the names are so long I've tried to err on the side of making it easy to read individual sections of the name (having a bit more detail) rather than the thing as a whole. In particular -inRing
is a less obvious extension so I thought I'd make it clearer in the naming, I'm not opposed to shortening it, but I feel like r0 and r1 is too short.
4184189
to
d808a10
Compare
Unless otherwise noted I've made all the changes you requested. I spent a few hours trying to squash the changes properly, but there are so many merge conflicts (between 1-8 a commit) I got to the point where I couldn't reasonably believe I'd resolved them without making any mistakes so I'd prefer to leave the changes on the end, if you aren't opposed to that. Honestly, one way of look at this code is that I run it and it gives me a tree and I can check the accuracy of that tree via cross validation, so there aren't many real dangers to the rest of RMG, really you need to mostly just yell at me about comments and clarity so that the poor soul who has to touch this next can work out what's going on. I don't know if its worth it for you to go so far as to understand all of it in detail, but perhaps just enough that if you had to rewrite it without the original code you'd know how to start and have some idea what problems can happen. |
I'm not sure about the RMG-tests timing, nothing should be impacted and #1486 runs RMG-tests perfectly fine. |
Unittests issues should be fixed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a few more comments. I'm fine with merging this, though as I mentioned, I am concerned about merge conflicts which may arise with RMG-cat. Hopefully those can be resolved easily.
Question: I think we discussed this a long time ago, but what was your reason for not sub-classing KineticsFamily with a new class for automatically generated families? It seems like it would help improve code organization and clarity.
@@ -402,3 +402,21 @@ def reduce_same_reactant_degeneracy(reaction, same_reactants=None): | |||
'Degeneracy of reaction {} was decreased by 50% to {} since two of the reactants ' | |||
'are identical'.format(reaction, reaction.degeneracy) | |||
) | |||
|
|||
def getAllDescendants(entry): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a method of the Entry
class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in atg4
rmgpy/kinetics/arrhenius.pyx
Outdated
comment = self.comment, | ||
) | ||
|
||
def fitToReactions(self,rxns,w0=None,family=None,Ts=[300.0,500.0,600.0,700.0,800.0,900.0,1000.0,1100.0,1200.0,1500.0]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's inadvisable to provide a mutable type as a default argument. I would suggest defaulting to None and setting these temperatures in the body of the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -429,6 +432,223 @@ cdef class ArrheniusEP(KineticsModel): | |||
|
|||
################################################################################ | |||
|
|||
cdef class ArrheniusBM(KineticsModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need support for reversing this type of kinetics? If so, you would need to update Reaction.generateReverseRateCoefficient
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it should be converted to an arrhenius for a corresponding reaction
rmgpy/kinetics/uncertainies.py
Outdated
0:(14.0,'kcal/mol'), | ||
11:(14.0,'kcal/mol'), | ||
} | ||
rank_accuracy_map = {i:Quantity(rank_accuracy_map[i]) for i in rank_accuracy_map.keys()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not {key: Quantity(value) for key, value in rank_accuracy_map.iteritems()}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
rmgpy/molecule/element.py
Outdated
|
||
BDEs = {} | ||
for key in BDEDict.keys(): | ||
q = Quantity(BDEDict[key]).value_si |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion:
for key, value in BDEDict.iteritems():
q = Quantity(value).value_si
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
# entry is a top-level node that should be matched | ||
specialCases = ['peroxyl_disproportionation','bimolec_hydroperoxide_decomposition','r_recombination'] | ||
if len(forwardTemplate) == 1 and len(reaction.reactants) > len(forwardTemplate) and self.label.lower().split('/')[0] not in specialCases: | ||
entry = forwardTemplate[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment about how this if case is for automatically generated trees with only one tree? It took me a while to figure that out... (unless that's wrong, in which case you should definitely add a comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in atg4
@@ -47,12 +48,14 @@ | |||
from rmgpy import settings | |||
from rmgpy.reaction import Reaction | |||
from rmgpy.kinetics import Arrhenius | |||
from rmgpy.kinetics.arrhenius import ArrheniusBM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, could you add ArrheniusBM
to kinetics.__init__
and combine this import with the prior one for Arrhenius
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in atg4
entry.item = grp | ||
template_reactants.append(entry) | ||
except AttributeError: | ||
template_reactants = template.reactants |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the AttributeError that you're expecting here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when entry has no attribute item...I believe this may be related to logic nodes. This section gets completely changed in atg4 to make it compatiable with automatically generated trees
index += 1 | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove extra empty lines? Methods should be separated by 1 empty line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in atg4
.conda/meta.yaml
Outdated
@@ -69,6 +69,7 @@ requirements: | |||
- symmetry | |||
- xlwt | |||
- dde | |||
- scikit-learn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind putting this in the right place alphabetically? Also dde
while you're at it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
…enius model as a function of the enthalpy of reaction instead of the alpha of ArrheniusEP a w0 that is the average of the bond dissociation energies of the bond formed and the bond broken is used equations are from Engineering Approximations for Activation Energies in Hydrogen Transfer Reactions (Blowers and Masel 2000)
A dictionary of bond dissociation energies has been added to element.py These are used by the getBDE method of Bond in molecule.py if additional atoms are added to RMG new values will need to be added to the BDE dictionary
…y reactions bond dissociation energies are estimated using the BDE dictionary
this prevents the atom order from being disturbed before regularization
fix extension generation and tree generation tests to match changes to algorithm
The original regularization algorithm took at top down recursive approach regularizations were identified by in the top nodes and then efficiently applied down their child nodes However, this type of regularization fails when there are two atoms only distinguishable by their index. Example: Suppose the bonds to these atoms are entirely unspecified and that reactions exist where both of them are single bonds and where one is single and one is double. During regularization dimension identification the single bond choice becomes a regularization dimension for both because setting either bond to a single bond does not impact which reactions are matched. The top down regularization algorithm sees these as regularization dimensions and tried to set all of the bond orders to single for both bonds causing it to no longer match the reactions that have the double bond there. The new algorithm regularizes from the bottom up. At nodes with no children all graphically indistinguishable nodes are not regularized while (for the case of simpleRegularization) all other nodes are fully regularized. For nodes with children (in the case of simpleRegularization) full regularization is attempted, but ignored if this causes the template to not match any of its children.
Answer to question: The reason is that the technology makes it possible to have hybrid trees...you could have a tree that has a set of fixed nodes and then is automatically generated underneath that, which makes it important to not have them be different objects. |
I think this should be ready now. |
Ok, I think this is ready. |
rmgpy/data/kinetics/family.py
Outdated
@@ -1795,14 +1804,27 @@ def __generateReactions(self, reactants, products=None, forward=True, prod_reson | |||
else: | |||
template = self.reverseTemplate | |||
|
|||
if len(reactants) > len(template.reactants): #if the family has one template and is unimolecular split template into multiple reactants |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so this has been bothering me, and I would additional confirmation that it won't have any adverse effects. The motivating concern was why the deepcopy issue only affected the liquid oxidation test. So I analyzed this section again, and have a few more thoughts:
- The comment says
if the family has one template and is unimolecular
, which doesn't seem to match with the if condition. Should it beif the family has one template and is multimolecular
? This is probably just a typo though, so we don't need to dwell on it. - So then when do we trigger this if clause? The obvious case is for the AG-trees where all families have a single template. However, it can also happen if we call
__generateReactions
in the forward direction with the products of a family, e.g. 2 products for the unimolecularHO2_Elimination_from_PeroxyRadical
family or 3 products for thePeroxyl_Disproportionation
family. Both of these families (and other trimolecular families) are liquid phase families and are only included in the liquid oxidation example. Of course, the same thing would happen when calling__generateReactions
in the reverse direction with the reactants of a family, e.g. 2 reactants forR_Addition_MultipleBond
, which leaves me wondering about why we didn't see more of performance impact on the other tests. - Thus, it seems like this might trigger in more situations than originally expected (at least for me, idk if you considered those other cases). However, I don't think any of the existing families have disconnected graphs, so split should return the original graph, and we end up with the same template entries.
- Thinking about the
deepcopy
again, you're creating multiple entries and assigning the split groups to each entry. However, without the deepcopy, it seems like each entry would point to the same instance, so you're simply replacing the groups of all them each time. This doesn't have any apparent effects because there aren't any groups that can be split currently, but it could have effects for atg4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes
- Does forward=True reaction generation generate reverse reactions? My code is based on the assumption that I know if it's being reacted in the forward or reverse direction and use the forward/reverse templates as appropriate to make it so this should only happen to trees that have a single template that represents more than one reactant. My thought was that this wasn't the case and that one of the liquid families might be written more like an atg family?
- Yes
- Absolutely true, fortunately It appears I've already fixed this in atg4 in a way that doesn't require a deepcopy by restructuring what objects are used. see: e59fcb6
My thought is that it should be safe to use this branch now especially since this branch can't run atg trees in RMG jobs yet. I'd like to resolve 2 so that I can fix as needed on atg4, but I think the more fundamental issue is solved on atg4 already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be fine for now, but it's actually not very smart to split the templates everytime the function is called. The templates should probably be generated in the init method of the family so they don't need to be generated each function call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we try generating reactions in both directions for a given set of reacting species, it should be quite common to encounter a mismatch in the number of species and templates. I was also thinking that maybe one of the liquid families was written like an atg family, but I didn't find any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for removing the white-space changes!
This pull request adds functionality to automatically generate trees from any family provided the top nodes are not logic nodes. It adds the functionality to automatically generate rules using the set of training reactions that match each node weighted based on the uncertainties of the reactions. It also adds functionality for testing these trees and comparing them with the original RMG trees. Bugfixes are also a very substantial portion of this PR.
The algorithm has undergone significant restructuring since part 2, however, the general principles remain the same. New nodes are generated by determining which of a generated set of potential extension nodes minimize the uncertainty in estimation using the mean after the split. After the tree is generated they are regularized. By default this means that all nodes are made as specific as the reactions they match. Then all of the reactions matching each node are fit to a modified Arrhenius expression that uses a Blower-Masel interpolant for the Ea.
This does not yet provide the functionality to use the tree in RMG yet or uncertainty analysis, which will be included in atg4.
So far all of the new trees I've tested have been drastic improvements using the leave-one-out test at 2 sigma errors Substitution_O (N=127) goes from 16.7 kcal/mol to 10.7 kcal/mol (about a factor of 20 improvement) and ketoenol (N=8) goes from 28.9 kcal/mol to 15.0 kcal/mol (about a factor of 1000 improvement) (all at T=1000K).