-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reaction generation using multiprocessing #1459
Conversation
cd3d25c
to
97ce8a1
Compare
Codecov Report
@@ Coverage Diff @@
## master #1459 +/- ##
=========================================
+ Coverage 41.53% 41.6% +0.07%
=========================================
Files 176 176
Lines 29111 29142 +31
Branches 5975 5990 +15
=========================================
+ Hits 12090 12125 +35
+ Misses 16189 16173 -16
- Partials 832 844 +12
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I made some comments and suggestions. It would be nice to see some profiling results before this is merged.
Regarding trailing whitespaces and other Codacy issues, you generally only need to address issues in lines that you changed. Removing all trailing whitespaces in a file is not necessary.
rmg.py
Outdated
@@ -82,6 +82,10 @@ def parse_command_line_arguments(command_line_args=None): | |||
parser.add_argument('-t', '--walltime', type=str, nargs=1, default='00:00:00:00', | |||
metavar='DD:HH:MM:SS', help='set the maximum execution time') | |||
|
|||
# Add option to select max number of processes for reaction generation | |||
parser.add_argument('-n', '--maxProc', type=int, nargs=1, default=1, | |||
help='used by multiprocessing in react.py') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be the most helpful info, since users may not know what react.py is. Perhaps max number of processors to use during reaction generation
or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
rmgpy/rmg/main.py
Outdated
pass | ||
|
||
if maxProc > psutil.cpu_count(): | ||
raise ValueError('Invalid format for user defined maximum number of procesors {0}; should be an integer and smaller or equal to your available number of cpu {1}'.format(maxProc, psutil.cpu_count())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like you're checking the format here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to 'Invalid input ...'
rmgpy/rmg/model.py
Outdated
spcs.extend(rxn.reactants) | ||
spcs.extend(rxn.products) | ||
|
||
ensure_independent_atom_ids(spcs, resonance=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to ensure_independent_atom_ids here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise Species.getResonanceHybrid() fails due to not having valid atom ids. We call ensure_independent_atom_ids in generate_reactions_from_families, however, this is not working for multiprocessing. Using ensure_independent_atom_ids here fixes that issue.
rmgpy/rmg/react.py
Outdated
tmp = divmod(user_mem, resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) | ||
tmp2 = min(maxProc, tmp[0]) | ||
procNum = max(1, tmp2) | ||
print 'For reaction generation {0} processes are used.'.format(procNum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We like to avoid print statements in favor of using the logging module. So you would just replace print
with logging.info()
or maybe logging.debug
depending on whether you want this to be printed normally.
Something you should consider is how many times this will be printed (seems like it will be a lot) and whether it's useful info for a typical user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. It will be printed for each time we are spawning a pool of worker. It might be of interest as the number of processes/worker might differ from the user input, depending on the available RAM and the memory consumption of the RMG base process before spawning the workers.
rmgpy/scoop_framework/util.py
Outdated
@@ -160,6 +172,8 @@ def submit_(func, *args, **kwargs): | |||
returns the return value of the called function, or | |||
when SCOOP is loaded, the future object. | |||
""" | |||
warnings.warn("The option scoop is no longer supported"\ | |||
" and may be removed in Version: 2.3 ", DeprecationWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all of these warnings, could you indent the second line to line up with the " in the first line? Also, the slashes shouldn't be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Got the template from here:
https://github.com/ReactionMechanismGenerator/RMG-Py/wiki/RMG-Contributor-Guidelines
I guess it should be changed accordingly?
rmgpy/rmg/react.py
Outdated
tmp = divmod(usermem, resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) | ||
|
||
# Get available RAM (GB)and procnum dependent on OS | ||
if platform == "linux" or platform == "linux2": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use platform.startswith('linux')
here to check for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Done.
rmgpy/rmg/react.py
Outdated
# OS X | ||
memoryavailable = psutil.virtual_memory().available/(1000.0 ** 3) | ||
memoryuse = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/(1000.0 ** 3) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should add handling for Windows as well. It doesn't have to be equivalent handling, but you could simply add an else
for any other platform and perhaps only use one processor in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but couldn't test it. Is someone with a Windows OS volunteering?
1604410
to
1ad85ac
Compare
Hmm, something seems to have gone wrong with your rebase. This branch now includes a ton of commits from master. |
ee97295
to
915910b
Compare
I added a commit refactoring the code for family splitting. Let me know what you think, and whether it matches your original design. |
rmgpy/rmg/model.py
Outdated
# This method chops the iterable into a number of chunks which it | ||
# submits to the process pool as separate tasks. | ||
p = Pool(processes=procnum) | ||
p.map(calculate_thermo_parallel,spcs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple concerns about the overall implementation for parallel QM. Normally, thermo is calculated for one species at a time in self.processNewReactions()
below. Here, you determine all species which would be calculated using QM and pre-calculate their thermo in parallel, then in processNewReactions
, it sees the thermo and does not re-calculate it.
- The
calculate_thermo_parallel
function seems to skip a lot of processing done inThermoDatabase.getThermoData
andthermoEngine.processThermoData
, which can have a significant effect on the final values. - It appears that you're using SMILES to determine whether species are unique? This is not safe because different resonance structures will have different SMILES. Also, generating SMILES for every new molecule will be relatively time consuming.
Minor suggestions, which may or may not be relevant after resolving the above issues:
- Since the procnum determination based on RAM is done both here and in react.py, it might be useful to turn it into a function somewhere and import it here.
- I would also split this section off as a new method, since
enlarge
is already a really big method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thought behind the implemented procedure is that the QM files should be generated in parallel and then looked up for one species at a time in self.processNewReactions() below. Testing showed that calculating QM thermo for one species at a time is more time consuming than in a bulk up here.
I use SMILES for comparison and then generate resonance structures within calculate_thermo_parallel() to generate QM files for each resonance structure. What would be a better way to ensure that the same QM file is not generated twice/overlapping during the parallel generation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SMILES comparison is now substituted with isIsomorphic comparison.
The procnum determination is now a function and imported for QMTP parallel and reaction generation in parallel.
And the parallel QM files writing section is split off outside of enlarge.
8e3e27b
to
55e1fc6
Compare
e82fe61
to
d7a7f84
Compare
d7a7f84
to
63e2976
Compare
6ed0fc6
to
d27647b
Compare
Allows thermo pruning with parallel QMTP.
Add explicit `rename` argument to generateThermo which is only true when called from enlarge. Thus, only new species without thermo are renamed, which prevents initial species and bath gases from getting accidentally renamed.
e69745b
to
7318b37
Compare
I changed the species re-labeling to only be done for new species created from reaction generation. Turns out that thermo generation for input species was happening earlier than we thought. I also removed the whitespace changes to clean up the PR a bit. |
ab4afeb
to
f362c9d
Compare
Reduce number of families being tested Make separate tests for serial and parallel processing
f362c9d
to
f749f49
Compare
I think this is ready code-wise. Are there any additional tests that you think should be run? |
Just re-tested, QMTP, pdep, and thermo filtering in serial and parallel and seems to work fine. Ready to go from my side. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hooray!
Should we update the documentation re parallelization? |
It is updated. Where do you think it needs more update? |
Motivation: - processNewReactions requires a newSpecies argument which was not being properly identified after multiprocessing changes in #1459 Background: - The newSpecies argument is retained from early RMG when reactions were generated right after adding a new species to the core. - Hence, newSpecies referred to the newly added core species which was used to generate the new reactions. - The overall model generation algorithm has since changed, so identifying newSpecies is not as straightforward. - The multiprocessing PR changed newSpecies to be any core species, missing the original purpose of identifying the species from which the reaction was generated. Changes: - Use the species tuples created during reaction generation to keep track of which species were used to generate the reaction - Update unit tests which are affected by the changes in return values from react and react_all
Motivation: - processNewReactions requires a newSpecies argument which was not being properly identified after multiprocessing changes in #1459 Background: - The newSpecies argument is retained from early RMG when reactions were generated right after adding a new species to the core. - Hence, newSpecies referred to the newly added core species which was used to generate the new reactions. - The overall model generation algorithm has since changed, so identifying newSpecies is not as straightforward. - The multiprocessing PR changed newSpecies to be any core species, missing the original purpose of identifying the species from which the reaction was generated. Changes: - Use the species tuples created during reaction generation to keep track of which species were used to generate the reaction - Update unit tests which are affected by the changes in return values from react and react_all
Motivation: - processNewReactions requires a newSpecies argument which was not being properly identified after multiprocessing changes in #1459 Background: - The newSpecies argument is retained from early RMG when reactions were generated right after adding a new species to the core. - Hence, newSpecies referred to the newly added core species which was used to generate the new reactions. - The overall model generation algorithm has since changed, so identifying newSpecies is not as straightforward. - The multiprocessing PR changed newSpecies to be any core species, missing the original purpose of identifying the species from which the reaction was generated. Changes: - Use the species tuples created during reaction generation to keep track of which species were used to generate the reaction - Update unit tests which are affected by the changes in return values from react and react_all
Motivation: - processNewReactions requires a newSpecies argument which was not being properly identified after multiprocessing changes in #1459 Background: - The newSpecies argument is retained from early RMG when reactions were generated right after adding a new species to the core. - Hence, newSpecies referred to the newly added core species which was used to generate the new reactions. - The overall model generation algorithm has since changed, so identifying newSpecies is not as straightforward. - The multiprocessing PR changed newSpecies to be any core species, missing the original purpose of identifying the species from which the reaction was generated. Changes: - Use the species tuples created during reaction generation to keep track of which species were used to generate the reaction - Update unit tests which are affected by the changes in return values from react and react_all
Motivation: - processNewReactions requires a newSpecies argument which was not being properly identified after multiprocessing changes in #1459 Background: - The newSpecies argument is retained from early RMG when reactions were generated right after adding a new species to the core. - Hence, newSpecies referred to the newly added core species which was used to generate the new reactions. - The overall model generation algorithm has since changed, so identifying newSpecies is not as straightforward. - The multiprocessing PR changed newSpecies to be any core species, missing the original purpose of identifying the species from which the reaction was generated. Changes: - Use the species tuples created during reaction generation to keep track of which species were used to generate the reaction - Update unit tests which are affected by the changes in return values from react and react_all
Motivation or Problem
Scoop has been found to increase the memory consumption until the OS's out of memory killer kills processes and the reaction generation fails. Consequently, python's multiprocessing module is implemented for MAC and Linux and a deprecation warning is added to scoop references.
Description of Changes
Multiprocessing is implemented for reaction generation in rmgpy/rmg/react.py. The processes are spawned and closed on a single node within the function 'react()'. The number of processes is determined based on the ratio of currently available RAM and currently used RAM. The user can input the maximum number of allowed processes from the command line. For each reaction generation the number of processes will be the minimum value either being the number of allowed processes due to user input or the value obtained by the RAM ratio. The RAM limitation is employed, because multiprocessing is forking the base process and the memory limit (SWAP + RAM) might be exceeded when using too many processes for a base process large in memory.
Multiprocessing is employed from the command line using the -n command and the maximum number of processes the user want to use, here 4. The default is 1 process, only an integer value smaller than the number of available cpu is allowed.
python rmg -n 4 input.py
Testing
The implementation has been tested for the superminimal, minimal and PDD cases on both MAC and RMG-server, ranging from 1 to 24 processes on a single node. An example submission script for the RMG-server is attached here:
submit.txt
Reviewer Tips
Try your own cases and document the changes in run time and memory consumption.