Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add amber TI parser and improve the subsampling code #32

Closed
wants to merge 14 commits into from

Conversation

shuail
Copy link
Collaborator

@shuail shuail commented Oct 30, 2017

Fixes #10

  1. add the amber TI parser to get the dhdl
  2. Inside the statistical_inefficiency function under subsampling.py, switch to use the subsampleCorrelatedData in pymbar to generate the subsampled data points, the original code use the pandas slicing method which required to round the statistical inefficiency to an integer.

@codecov-io
Copy link

codecov-io commented Oct 30, 2017

Codecov Report

Merging #32 into master will decrease coverage by 0.27%.
The diff coverage is 93.25%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #32      +/-   ##
==========================================
- Coverage   95.04%   94.77%   -0.28%     
==========================================
  Files           7        9       +2     
  Lines         202      383     +181     
  Branches       28       64      +36     
==========================================
+ Hits          192      363     +171     
- Misses          5       10       +5     
- Partials        5       10       +5
Impacted Files Coverage Δ
src/alchemlyb/parsing/amber.py 93.25% <93.25%> (ø)
src/alchemlyb/__init__.py 100% <0%> (ø)
src/alchemlyb/parsing/util.py 100% <0%> (+11.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8915ece...3bc2c67. Read the comment docs.

@davidlmobley
Copy link

@shuail - this will need to come with tests. You'll probably want to put the data for the tests in alchemtest and then build in tests which rely on it here.

Note also that there is a "code coverage" analysis being run above which I presume is checking to see what portion of the code is being tested by the tests; since yours doesn't include tests, well, code coverage reduces dramatically and fails that test.

@orbeckst orbeckst self-requested a review October 30, 2017 19:24
@orbeckst orbeckst self-assigned this Oct 30, 2017
@orbeckst
Copy link
Member

As just discussed:

  1. add tests to alchemtest (open a PR there)
  2. write tests (initially modelled on the Gromacs ones) that use the alchemtest data set

Ping me if anything is unclear.

@shuail
Copy link
Collaborator Author

shuail commented Oct 31, 2017

@orbeckst just added the amber dataset in alchemtest and create a PR there. Added amber test here to use the files in alchemtest. Tested on my local computer to use the updated alchemtest, it runs fine. Finger crossed for travis once the alchemtest library get updated.

@orbeckst
Copy link
Member

I'll deal with PR alchemistry/alchemtest#6 first to sort out the data side of things.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting started. See comments.

Major things

  • keep the PR focused on Amber parser: split the changes to subsampling into a separate PR
  • do not SysExit in a library
  • use logging instead of print

@@ -4,6 +4,7 @@
import numpy as np
from pymbar.timeseries import statisticalInefficiency
from pymbar.timeseries import detectEquilibration
from pymbar.timeseries import subsampleCorrelatedData
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to subsampling are independent from the TI parser. Please remove from this PR and submit a new PR. This allows us to focus on one issue at a time, including reviewing and testing, and also makes the git history cleaner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also helps if you raise an issue for changes to the library, e.g., "Replace pandas-based subsampling with pymbar.subsampleCorrelatedData". We can then discuss the change and link the PR to an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not reviewing these changes here because they will be in a separate PR.

@@ -0,0 +1,21 @@
"""Amber parser tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also like to have a test with at least one of the estimators. Can you add an Amber test to test_fep_estimators.py?

Change the final format to pandas to be consistent with the alchemlyb format
"""

import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

order imports in canonical order:

  1. standard lib (os, re, ...)
  2. scientific python (numpy, pandas, ...)
  3. alchemlyb (if present)

import numpy as np
import os

def convert_to_pandas(file_datum, ):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc string

#self.mbar_energies = []


def file_validation(outfile, ):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc string, also remove comma + white space in arg list

invalid = True

if not secp.skip_after('^ 3. ATOMIC '):
print(' WARNING: no ATOMIC section found, ignoring file\n')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use logging


t0, = secp.extract_section('^ begin time', '^$', ['coords'])
if not secp.skip_after('^ 4. RESULTS'):
print(' WARNING: no RESULTS section found, ignoring file\n')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use logging

file_datum.t0 = t0
file_datum.dt = dt
file_datum.T = T
return file_datum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would simplify to

if invalid:
        return False

file_datum = FEData()
file_datum.clambda = clambda
file_datum.t0 = t0
file_datum.dt = dt
file_datum.T = T
return file_datum


def extract_dHdl(outfile, ):
file_datum = file_validation(outfile)
if file_validation(outfile):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewrite the if to terminate early, the following would be clearer:

if not file_validation(outfile):
   return None

finished = False
comps = []
...
return df

from alchemtest.amber import load_simplesolvated


def test_dHdl():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good start

@orbeckst orbeckst mentioned this pull request Nov 1, 2017
@shuail
Copy link
Collaborator Author

shuail commented Nov 2, 2017

@orbeckst Thanks for all your comments, it's very helpful. I made three major changes 1, switch to logging in the amber parser and clean up the code (add docs, empty lines etc). 2, add test_ti_estimators_amber.py to test the ti_estimator for amber. 3, roll back the subsampling commit and will create a separated PR later.

df = convert_to_pandas(file_datum)
return df

#currently just check the code with a simple amber ti output file
Copy link
Member

@orbeckst orbeckst Nov 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the whole main section (also, the print use python 2.7 syntax but we run under python 2 and python 3 – this makes Travis fail at the moment).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, the main section got removed now.

@orbeckst
Copy link
Member

orbeckst commented Nov 3, 2017

This is starting to look good.

We would now like to get test coverage up. If you look at https://codecov.io/gh/alchemistry/alchemlyb/src/master/src/alchemlyb/parsing/amber.py (follow the codecov/project report and look for the overall coverage and navigate in the files) there are many lines in the function parsing.amber.file_validation() that are not exercised by the tests. We would want to have tests that actually show that you catch these errors as expected.

The best thing to do would be to include small files with errors and then just test the file_validation() function itself on these files. It is difficult to check that a log message was printed but you can test that file_validation() returns False for these cases. (As a sidenote, if you need to test exceptions and warnings then pytest also makes this straight forward).

(This is where finalizing PRs becomes painful and boring but it is vital for the long-term survivability of the code that we check things now and are then assured that in the future everything will still work or we learn when it doesn't.)

Other comments on the code.

def __init__(self, filename):
"""Opens a file according to its file type."""
self.filename = filename
with open(filename, 'r') as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach with _MAGIC_CMPR can be replace with util.anyopen(), see, for example https://github.com/alchemistry/alchemlyb/blob/master/src/alchemlyb/parsing/gmx.py#L143 . That's cleaner and more expandable. (I hadn't picked up on it in my earlier code reviews.)

(You can do it yourself or I am also happy to finish the PR as is and do this later and separately.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, will try to switch to anyopen

@shuail
Copy link
Collaborator Author

shuail commented Nov 3, 2017

@orbeckst Got it, will add some testing for the file validation function. For the tested output files, I planned to created an invalidfiles folder under alchemtest/amber and use the alchemtest frame work in the alchemlyb. Does this sound good?

@orbeckst
Copy link
Member

orbeckst commented Nov 3, 2017

Yes, good idea. Just check that the files are properly included when you run pip install, i.e., proper inclusions in setup.py (and perhaps MANIFEST.in... if we use it).

@shuail
Copy link
Collaborator Author

shuail commented Nov 13, 2017

@orbeckst Just finished adding the test code here and test files (alchemtest) to test the file validation function in amber parser and switched the file opener to anyopen. Hope the test will pass this time :)

@orbeckst
Copy link
Member

Looking good, just waiting for PR alchemistry/alchemtest#11.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using anyopen() will require a few more changes, which I hadn't realized earlier.

If this becomes to difficult I would suggest to revert the change where you introduced anyopen and we use the original implementation for this PR. Then raise an issue to use anyopen and we can then sort this out in a separate PR.

EDIT: Well, thinking about this again: try the original code — maybe it works and I am wrong.

Other minor things: see comments.

DVDL_COMPS = ['BOND', 'ANGLE', 'DIHED', '1-4 NB', '1-4 EEL', 'VDWAALS',
'EELEC', 'RESTRAINT']
_FP_RE = r'[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?'
_MAGIC_CMPR = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed anymore since switch to anyopen. Please remove.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

import re
import pandas as pd
import numpy as np
import logging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty line between standard library imports and package imports .... but this is nit-picking on my part ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logging would come after re and be fore numpy because it is standard lib... according to PEP8

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, push a fixed

yield first

while it:
yield it.next()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Python 3 this should be

next(it)

(I think)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, change it to next(it)

match = re.search(pattern, line)
if match:
break
return self.fileh.tell() != self.filesize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to break when we use anyopen because we never uncompress the file to disk and thus the file size of the compressed file is not related to the position in the uncompressed stream fileh.

(More specifically: For a compressed file, self.filesize < self.fileh.tell() for some positions so you might get a random occurrence where this is False in the middle of the file. Conversely, you won't get True at the real end of the file.)

I think this is supposed to check for EOF (end of file). This would need to be implemented differently. Depends on how skip_after() is used. Perhaps can be rewritten along the lines of https://stackoverflow.com/a/24738688/334357 ??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, thinking about this again: try it out, maybe it works and I am wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I checked that the self.filesize < self.fileh.tell() for the zipped file and it will not be equal at the end of the file so I changed the skip_after() function to not use this detection but explicitly return True if the pattern was found along iterating the whole file

"""Read next line of the filehandle and check for EOF."""
self.lineno += 1
curr_pos = self.fileh.tell()
if curr_pos == self.filesize:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break with compression (see comment above).

Not even sure why this is needed. Can't we just say

def next(self):
  self.lineno += 1
  return next(self.fileh)

This should raise StopIteration when at the end and it should return the line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a better solution. Sorry I didn't think that part much as migrating the code from alchemical-analysis. It looks like that this test successfully catch the untested part of the original code!

return file_datum

def extract_dHdl(outfile):
"""Return gradients `dH/dl` from Amebr TI outputfile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Amber

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

self.filename = filename
try:
self.fileh = anyopen(self.filename, 'r')
self.filesize = os.stat(self.filename).st_size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be used for compressed files (see other comments).

I think this line can be deleted because it is useless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, thinking about this again: try it out, maybe it works and I am wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you are right, the filesize is not working here for zipped files, remove the filesize related detection in the code.

@orbeckst
Copy link
Member

orbeckst commented Nov 13, 2017

I merged alchemistry/alchemtest#11 so you should now have the appropriate test files available for Travis CI. I will manually restart the test #90 . You don't need to do anything until the test results come back.

import pandas as pd
import numpy as np

from util import anyopen
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be

from .util import anyopen

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes, fixed this

@shuail
Copy link
Collaborator Author

shuail commented Nov 13, 2017

@orbeckst thanks for your comments! Yes, after I switched to the zipped files, the end of file(EOF) detection in the skip_after function will not work, so I changed that part and also the next() function to make it suitable for the zipped file. The anyopen actually works fine here.

@orbeckst
Copy link
Member

One small change to get it to pass under python 3.

@orbeckst
Copy link
Member

Once the tests pass, we have the coverage. Assuming that this will look good, we need to make sure that it can be merged and I want to clean up the PR, i.e., merging related commits and removing the subsampling code that was added and deleted. I can do this locally with git rebase.

A consequence for you will be that you also have to rebase your fork because you have been working from master instead of a branch. I can tell you what I would do if you want to know.

@orbeckst
Copy link
Member

This looking very good. The only way that you can sensibly improve the coverage is by adding a test for the function

parsing.amber.any_none(sequence)

Basically, make sure that the line 42 is hit. Any of the other remaining checks are fairly standard defensive programming and I am not even sure how to test them properly.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tiny additional test, please, then I am happy ;-)


for element in sequence:
if element is None:
return True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looking very good. The only way that you can sensibly improve the coverage is by adding a test for the function

parsing.amber.any_none(sequence)

Basically, make sure that the line 42 is hit. Any of the other remaining checks are fairly standard defensive programming and I am not even sure how to test them properly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got a test function added for this line

@orbeckst
Copy link
Member

@shuail looking good, I'll do the rebase... hopefully done before I have to run to class...

@orbeckst orbeckst mentioned this pull request Nov 13, 2017
@orbeckst
Copy link
Member

I created PR #36 to finalize.

@orbeckst orbeckst closed this Nov 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants