Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

onlyHeterocyclics option for Machine Leaning Estimator Settings #1621

Merged
merged 4 commits into from
Jul 25, 2019

Conversation

yunsiechung
Copy link
Contributor

Motivation or Problem

Adds one more option for Machine Learning Estimator settings to allow one to use machine learning for only heterocyclic species and group additivity for the rest

Description of Changes

onlyHeterocyclics option is added to Machine rmgpy.rmg.input.mlEstimator, and a new method isHeterocyclic is added to rmgpy.molecule.molecule.py. Relevant changes are made in other rmg files including the documentation.
If both onlyCyclics and onlyHeterocyclics options are True, then onlyCyclics option overrides, and all cyclic species thermo will be estimated by machine learning.

Testing

3 unit tests are added to rmgpy.molecule.moleculeTest.py to check that species can be correctly identified as heterocyclic or not by isHeterocyclic method.
3 unit tests are added to rmgpy.data.thermoTest.py to check that mlEstimator is applied for heterocyclic species for appropriate mlEstimator settings.

Reviewer Tips

Check unit tests and run simple mlEstimators examples with heterocyclic settings turned on and off

@yunsiechung yunsiechung requested review from cgrambow and mliu49 June 11, 2019 20:11
@codecov
Copy link

codecov bot commented Jun 11, 2019

Codecov Report

Merging #1621 into master will increase coverage by <.01%.
The diff coverage is 14.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1621      +/-   ##
==========================================
+ Coverage   41.88%   41.88%   +<.01%     
==========================================
  Files         176      176              
  Lines       29354    29368      +14     
  Branches     6052     6059       +7     
==========================================
+ Hits        12294    12301       +7     
- Misses      16183    16188       +5     
- Partials      877      879       +2
Impacted Files Coverage Δ
rmgpy/molecule/molecule.py 0% <0%> (ø) ⬆️
rmgpy/rmg/input.py 35.39% <0%> (-0.39%) ⬇️
rmgpy/data/thermo.py 61.92% <100%> (+0.06%) ⬆️
rmgpy/data/kinetics/family.py 52.9% <0%> (+0.23%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ccbe34a...3e495db. Read the comment docs.

@mliu49
Copy link
Contributor

mliu49 commented Jun 11, 2019

One comment before I review in detail. Since heterocyclics are a subset of cyclics, it seems to me like it would make more sense for onlyHeterocyclics=True to imply onlyCyclics=True. This would suggest the following behavior:

  • onlyCyclics=False and onlyHeterocyclics=False is ok -> no restrictions
  • onlyCyclics=True and onlyHeterocyclics=False is ok -> only cyclic species
  • onlyCyclics=True and onlyHeterocyclics=True is ok -> only heterocyclic species
  • onlyCyclics=False and onlyHeterocyclics=True -> change onlyCyclics to True and log a warning -> only heterocyclic species

@yunsiechung
Copy link
Contributor Author

@mliu49 What you said makes more sense. I will make changes accordingly

1 similar comment
@yunsiechung
Copy link
Contributor Author

@mliu49 What you said makes more sense. I will make changes accordingly

@yunsiechung yunsiechung force-pushed the MLthermo_heterocyclic branch from 70eb9c8 to 8ccb7a0 Compare June 14, 2019 14:52
Copy link
Contributor

@mliu49 mliu49 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Only two minor comments.

species will be estimated. ``onlyHeterocyclics`` means that only heterocyclic species will be estimated. Note that
if ``onlyHeterocyclics`` setting is set to True, the machine learning estimator is restricted to only heterocyclic
species regardless of the ``onlyCyclics`` setting. If ``onlyCyclics`` is False and ``onlyHeterocyclics`` is True,
RMG will log a warning that ``onlyCyclics`` should also be True and because heterocyclics are a subset of cyclics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly awkward wording "and because heterocyclics are a subset of cyclics"

Maybe "and the estimator will be restricted to heterocyclics because they are a subset of cyclics`?

@@ -407,6 +430,7 @@ def test_thermo_generation_ml_settings(self):
self.assertIsInstance(thermo, ThermoData)
self.assertTrue('ML Estimation' in thermo.comment, 'Thermo not from ML estimation, test purpose not fulfilled')


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove this extra line?

@mliu49
Copy link
Contributor

mliu49 commented Jul 12, 2019

Could you address my last two comments and rebase? I think this can be merged soon.

'isHeterocyclic' function will return True if a molecule is heterocyclic
or return False if not
3 unit tests for molecule.isHeterocyclic() were added.
It checks whether it returns False for cyclohexanol and returns True for
furan and pyridine
A new option 'onlyHeterocyclics' is added for Machine Learning Thermo
Estimator. If 'onlyHeterocyclics' is True, the thermo of only heterocyclics
species are estimated by mlEstimator regardless of 'onlyCyclics'
setting. If 'onlyCyclics' is False and 'onlyHeterocyclics' is True, then
RMG will log a warning that 'onlyCyclics' should be also True since
heterocyclics is a subset of cyclics.
I also added lines that will log a warning when 'onlyCyclics' is False
and 'minCycleOverlap' is greater than zero since the machine learning
estimator is restricted to only cyclic species when 'minCycleOverlap' is
greater than zero.
If 'onlyCyclics' is False and 'heteroCyclics' is True and
'minCycleOverlap' is greater than zero, it will log a warning and the
machine learning estimator will be restricted to only heterocyclic
species with the speficied minimum cycle overlap.
4 unit tests for 'onlyHeterocyclics' option in Machine Learning Thermo
Estimator are added.
Case 1: 'onlyCyclics' = True, 'onlyHeterocyclics' = True
         test cyclic species gives None
Case 2: 'onlyCyclics' = True, 'onlyHeterocyclics' = True
         test heterocyclic species gives ML thermo
Case 3: 'onlyCyclics' = False, 'onlyHeterocyclics' = True
         test cyclics species gives None
Case 4: 'onlyCyclics' = False, 'onlyHeterocyclics' = True
         test heterocyclic species gives ML thermo
@yunsiechung yunsiechung force-pushed the MLthermo_heterocyclic branch from 8ccb7a0 to 3e495db Compare July 25, 2019 20:14
@yunsiechung
Copy link
Contributor Author

@mliu49 I applied all the changes you have suggested :)

Copy link
Contributor

@mliu49 mliu49 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good!

@mliu49 mliu49 merged commit adfa67b into master Jul 25, 2019
@mliu49 mliu49 deleted the MLthermo_heterocyclic branch July 25, 2019 21:50
@mliu49 mliu49 mentioned this pull request Dec 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants