-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add amber TI parser and improve the subsampling code #32
Conversation
Codecov Report
@@ Coverage Diff @@
## master #32 +/- ##
==========================================
- Coverage 95.04% 94.77% -0.28%
==========================================
Files 7 9 +2
Lines 202 383 +181
Branches 28 64 +36
==========================================
+ Hits 192 363 +171
- Misses 5 10 +5
- Partials 5 10 +5
Continue to review full report at Codecov.
|
@shuail - this will need to come with tests. You'll probably want to put the data for the tests in Note also that there is a "code coverage" analysis being run above which I presume is checking to see what portion of the code is being tested by the tests; since yours doesn't include tests, well, code coverage reduces dramatically and fails that test. |
As just discussed:
Ping me if anything is unclear. |
@orbeckst just added the amber dataset in alchemtest and create a PR there. Added amber test here to use the files in alchemtest. Tested on my local computer to use the updated alchemtest, it runs fine. Finger crossed for travis once the alchemtest library get updated. |
I'll deal with PR alchemistry/alchemtest#6 first to sort out the data side of things. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting started. See comments.
Major things
- keep the PR focused on Amber parser: split the changes to subsampling into a separate PR
- do not
SysExit
in a library - use logging instead of
print
@@ -4,6 +4,7 @@ | |||
import numpy as np | |||
from pymbar.timeseries import statisticalInefficiency | |||
from pymbar.timeseries import detectEquilibration | |||
from pymbar.timeseries import subsampleCorrelatedData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to subsampling
are independent from the TI parser. Please remove from this PR and submit a new PR. This allows us to focus on one issue at a time, including reviewing and testing, and also makes the git history cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also helps if you raise an issue for changes to the library, e.g., "Replace pandas-based subsampling with pymbar.subsampleCorrelatedData". We can then discuss the change and link the PR to an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not reviewing these changes here because they will be in a separate PR.
@@ -0,0 +1,21 @@ | |||
"""Amber parser tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also like to have a test with at least one of the estimators. Can you add an Amber test to test_fep_estimators.py
?
src/alchemlyb/parsing/amber.py
Outdated
Change the final format to pandas to be consistent with the alchemlyb format | ||
""" | ||
|
||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
order imports in canonical order:
- standard lib (os, re, ...)
- scientific python (numpy, pandas, ...)
- alchemlyb (if present)
src/alchemlyb/parsing/amber.py
Outdated
import numpy as np | ||
import os | ||
|
||
def convert_to_pandas(file_datum, ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc string
src/alchemlyb/parsing/amber.py
Outdated
#self.mbar_energies = [] | ||
|
||
|
||
def file_validation(outfile, ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc string, also remove comma + white space in arg list
src/alchemlyb/parsing/amber.py
Outdated
invalid = True | ||
|
||
if not secp.skip_after('^ 3. ATOMIC '): | ||
print(' WARNING: no ATOMIC section found, ignoring file\n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use logging
src/alchemlyb/parsing/amber.py
Outdated
|
||
t0, = secp.extract_section('^ begin time', '^$', ['coords']) | ||
if not secp.skip_after('^ 4. RESULTS'): | ||
print(' WARNING: no RESULTS section found, ignoring file\n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use logging
src/alchemlyb/parsing/amber.py
Outdated
file_datum.t0 = t0 | ||
file_datum.dt = dt | ||
file_datum.T = T | ||
return file_datum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would simplify to
if invalid:
return False
file_datum = FEData()
file_datum.clambda = clambda
file_datum.t0 = t0
file_datum.dt = dt
file_datum.T = T
return file_datum
src/alchemlyb/parsing/amber.py
Outdated
|
||
def extract_dHdl(outfile, ): | ||
file_datum = file_validation(outfile) | ||
if file_validation(outfile): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rewrite the if
to terminate early, the following would be clearer:
if not file_validation(outfile):
return None
finished = False
comps = []
...
return df
from alchemtest.amber import load_simplesolvated | ||
|
||
|
||
def test_dHdl(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good start
@orbeckst Thanks for all your comments, it's very helpful. I made three major changes 1, switch to logging in the amber parser and clean up the code (add docs, empty lines etc). 2, add test_ti_estimators_amber.py to test the ti_estimator for amber. 3, roll back the subsampling commit and will create a separated PR later. |
src/alchemlyb/parsing/amber.py
Outdated
df = convert_to_pandas(file_datum) | ||
return df | ||
|
||
#currently just check the code with a simple amber ti output file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the whole main section (also, the print
use python 2.7 syntax but we run under python 2 and python 3 – this makes Travis fail at the moment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, the main section got removed now.
This is starting to look good. We would now like to get test coverage up. If you look at https://codecov.io/gh/alchemistry/alchemlyb/src/master/src/alchemlyb/parsing/amber.py (follow the codecov/project report and look for the overall coverage and navigate in the files) there are many lines in the function parsing.amber.file_validation() that are not exercised by the tests. We would want to have tests that actually show that you catch these errors as expected. The best thing to do would be to include small files with errors and then just test the (This is where finalizing PRs becomes painful and boring but it is vital for the long-term survivability of the code that we check things now and are then assured that in the future everything will still work or we learn when it doesn't.) Other comments on the code. |
src/alchemlyb/parsing/amber.py
Outdated
def __init__(self, filename): | ||
"""Opens a file according to its file type.""" | ||
self.filename = filename | ||
with open(filename, 'r') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach with _MAGIC_CMPR
can be replace with util.anyopen()
, see, for example https://github.com/alchemistry/alchemlyb/blob/master/src/alchemlyb/parsing/gmx.py#L143 . That's cleaner and more expandable. (I hadn't picked up on it in my earlier code reviews.)
(You can do it yourself or I am also happy to finish the PR as is and do this later and separately.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, will try to switch to anyopen
@orbeckst Got it, will add some testing for the file validation function. For the tested output files, I planned to created an invalidfiles folder under alchemtest/amber and use the alchemtest frame work in the alchemlyb. Does this sound good? |
Yes, good idea. Just check that the files are properly included when you run |
@orbeckst Just finished adding the test code here and test files (alchemtest) to test the file validation function in amber parser and switched the file opener to anyopen. Hope the test will pass this time :) |
Looking good, just waiting for PR alchemistry/alchemtest#11. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using anyopen()
will require a few more changes, which I hadn't realized earlier.
If this becomes to difficult I would suggest to revert the change where you introduced anyopen
and we use the original implementation for this PR. Then raise an issue to use anyopen
and we can then sort this out in a separate PR.
EDIT: Well, thinking about this again: try the original code — maybe it works and I am wrong.
Other minor things: see comments.
src/alchemlyb/parsing/amber.py
Outdated
DVDL_COMPS = ['BOND', 'ANGLE', 'DIHED', '1-4 NB', '1-4 EEL', 'VDWAALS', | ||
'EELEC', 'RESTRAINT'] | ||
_FP_RE = r'[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?' | ||
_MAGIC_CMPR = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed anymore since switch to anyopen
. Please remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
src/alchemlyb/parsing/amber.py
Outdated
import re | ||
import pandas as pd | ||
import numpy as np | ||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty line between standard library imports and package imports .... but this is nit-picking on my part ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logging
would come after re
and be fore numpy
because it is standard lib... according to PEP8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, push a fixed
src/alchemlyb/parsing/amber.py
Outdated
yield first | ||
|
||
while it: | ||
yield it.next() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Python 3 this should be
next(it)
(I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, change it to next(it)
src/alchemlyb/parsing/amber.py
Outdated
match = re.search(pattern, line) | ||
if match: | ||
break | ||
return self.fileh.tell() != self.filesize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to break when we use anyopen
because we never uncompress the file to disk and thus the file size of the compressed file is not related to the position in the uncompressed stream fileh
.
(More specifically: For a compressed file, self.filesize < self.fileh.tell()
for some positions so you might get a random occurrence where this is False
in the middle of the file. Conversely, you won't get True
at the real end of the file.)
I think this is supposed to check for EOF (end of file). This would need to be implemented differently. Depends on how skip_after()
is used. Perhaps can be rewritten along the lines of https://stackoverflow.com/a/24738688/334357 ??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, thinking about this again: try it out, maybe it works and I am wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I checked that the self.filesize < self.fileh.tell()
for the zipped file and it will not be equal at the end of the file so I changed the skip_after()
function to not use this detection but explicitly return True
if the pattern was found along iterating the whole file
src/alchemlyb/parsing/amber.py
Outdated
"""Read next line of the filehandle and check for EOF.""" | ||
self.lineno += 1 | ||
curr_pos = self.fileh.tell() | ||
if curr_pos == self.filesize: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break with compression (see comment above).
Not even sure why this is needed. Can't we just say
def next(self):
self.lineno += 1
return next(self.fileh)
This should raise StopIteration
when at the end and it should return the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a better solution. Sorry I didn't think that part much as migrating the code from alchemical-analysis. It looks like that this test successfully catch the untested part of the original code!
src/alchemlyb/parsing/amber.py
Outdated
return file_datum | ||
|
||
def extract_dHdl(outfile): | ||
"""Return gradients `dH/dl` from Amebr TI outputfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: Amber
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
src/alchemlyb/parsing/amber.py
Outdated
self.filename = filename | ||
try: | ||
self.fileh = anyopen(self.filename, 'r') | ||
self.filesize = os.stat(self.filename).st_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cannot be used for compressed files (see other comments).
I think this line can be deleted because it is useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, thinking about this again: try it out, maybe it works and I am wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you are right, the filesize is not working here for zipped files, remove the filesize related detection in the code.
… other minor issues
I merged alchemistry/alchemtest#11 so you should now have the appropriate test files available for Travis CI. I will manually restart the test #90 . You don't need to do anything until the test results come back. |
src/alchemlyb/parsing/amber.py
Outdated
import pandas as pd | ||
import numpy as np | ||
|
||
from util import anyopen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be
from .util import anyopen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yes, fixed this
@orbeckst thanks for your comments! Yes, after I switched to the zipped files, the end of file(EOF) detection in the |
One small change to get it to pass under python 3. |
Once the tests pass, we have the coverage. Assuming that this will look good, we need to make sure that it can be merged and I want to clean up the PR, i.e., merging related commits and removing the subsampling code that was added and deleted. I can do this locally with A consequence for you will be that you also have to rebase your fork because you have been working from master instead of a branch. I can tell you what I would do if you want to know. |
This looking very good. The only way that you can sensibly improve the coverage is by adding a test for the function parsing.amber.any_none(sequence) Basically, make sure that the line 42 is hit. Any of the other remaining checks are fairly standard defensive programming and I am not even sure how to test them properly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One tiny additional test, please, then I am happy ;-)
|
||
for element in sequence: | ||
if element is None: | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looking very good. The only way that you can sensibly improve the coverage is by adding a test for the function
parsing.amber.any_none(sequence)
Basically, make sure that the line 42 is hit. Any of the other remaining checks are fairly standard defensive programming and I am not even sure how to test them properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got a test function added for this line
@shuail looking good, I'll do the rebase... hopefully done before I have to run to class... |
I created PR #36 to finalize. |
Fixes #10