-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add amber TI parser and improve the subsampling code #32
Changes from 4 commits
91a6d03
05fbdf0
915062d
b74c89c
208705b
715f3ee
f64722d
dcf997b
3c29957
a0e1a5b
3a75866
46eb7e9
49cc45c
3bc2c67
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,265 @@ | ||
"""Parsers for extracting alchemical data from amber output files. | ||
Most of the file parsing part are inheriting from alchemical-analysis | ||
Change the final format to pandas to be consistent with the alchemlyb format | ||
""" | ||
|
||
import pandas as pd | ||
import re | ||
import numpy as np | ||
import os | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use logging instead of
See more comments below. Note that in order to see the messages, a root logger has to be created elsewhere: import logging
def create_alchemlyb_logger(logfile="alchemlyb.log", name="alchemlyb"):
"""Create a logger that outputs to screen and to `logfile`"""
logger = logging.getLogger(name)
logger.setLevel(logging.DEBUG)
# handler that writes to logfile
logfile_handler = logging.FileHandler(logfile)
logfile_formatter = logging.Formatter('%(asctime)s %(name)-12s %(levelname)-8s %(message)s')
logfile_handler.setFormatter(logfile_formatter)
logger.addHandler(logfile_handler)
# define a Handler which writes INFO messages or higher to the sys.stderr
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.INFO)
# set a format which is simpler for console use
formatter = logging.Formatter('%(name)-12s: %(levelname)-8s %(message)s')
console_handler.setFormatter(formatter)
logger.addHandler(console_handler)
return logger
rootlogger = create_alchemlyb_logger() We might add this convenience function elsewhere in the library. I opened issue #34. Note that you only have to create the top level logger (here called P.S.: I wrote the code above for MDAnalysis (in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, will switch to logging instead of printing. Just want to make sure that I understand the logic correctly, so what I need to do is to add
at the beginning of the code and start to logging like logging.info ("Write some logging info here") is that correct? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, exactly. (Just be aware that until a logger named "alchemlyb" has been created, no logging will be visible, but that's expected and desired behavior. With #34 I will add code to the library to start logging automatically.) |
||
def convert_to_pandas(file_datum, ): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. doc string |
||
data_dic = {} | ||
data_dic["dHdl"] = [] | ||
data_dic["lambdas"] = [] | ||
data_dic["time"] = [] | ||
for frame_index, frame_dhdl in enumerate(file_datum.gradients): | ||
data_dic["dHdl"].append(frame_dhdl) | ||
data_dic["lambdas"].append(file_datum.clambda) | ||
#here we need to convert dt to ps unit from ns | ||
frame_time = file_datum.t0 + (frame_index + 1) * file_datum.dt*1000 | ||
data_dic["time"].append(frame_time) | ||
df = pd.DataFrame(data_dic["dHdl"], columns=["dHdl"], index =pd.Float64Index(data_dic["time"], name='time')) | ||
df["lambdas"] = data_dic["lambdas"][0] | ||
df = df.reset_index().set_index(['time'] + ['lambdas']) | ||
return df | ||
|
||
DVDL_COMPS = ['BOND', 'ANGLE', 'DIHED', '1-4 NB', '1-4 EEL', 'VDWAALS', | ||
'EELEC', 'RESTRAINT'] | ||
_FP_RE = r'[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?' | ||
_MAGIC_CMPR = { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not needed anymore since switch to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. removed |
||
'\x1f\x8b\x08': ('gzip', 'GzipFile'), # last byte is compression method | ||
'\x42\x5a\x68': ('bz2', 'BZ2File') | ||
} | ||
|
||
def any_none(sequence): | ||
"""Check if any element of a sequence is None.""" | ||
|
||
for element in sequence: | ||
if element is None: | ||
return True | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looking very good. The only way that you can sensibly improve the coverage is by adding a test for the function parsing.amber.any_none(sequence) Basically, make sure that the line 42 is hit. Any of the other remaining checks are fairly standard defensive programming and I am not even sure how to test them properly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got a test function added for this line |
||
|
||
return False | ||
|
||
def _pre_gen(it, first): | ||
"""A generator that returns first first if it exists.""" | ||
|
||
if first: | ||
yield first | ||
|
||
while it: | ||
yield it.next() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For Python 3 this should be next(it) (I think) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, change it to next(it) |
||
|
||
class SectionParser(object): | ||
""" | ||
A simple parser to extract data values from sections. | ||
""" | ||
def __init__(self, filename): | ||
"""Opens a file according to its file type.""" | ||
self.filename = filename | ||
with open(filename, 'rb') as f: | ||
magic = f.read(3) # NOTE: works because all 3-byte headers | ||
try: | ||
method = _MAGIC_CMPR[magic] | ||
except KeyError: | ||
open_it = open | ||
else: | ||
open_it = getattr(__import__(method[0]), method[1]) | ||
try: | ||
self.fileh = open_it(self.filename, 'rb') | ||
self.filesize = os.stat(self.filename).st_size | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This cannot be used for compressed files (see other comments). I think this line can be deleted because it is useless. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, thinking about this again: try it out, maybe it works and I am wrong. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I think you are right, the filesize is not working here for zipped files, remove the filesize related detection in the code. |
||
except IOError: | ||
raise SystemExit('ERROR: cannot open file %s' % filename) | ||
self.lineno = 0 | ||
def skip_lines(self, nlines): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PEP8 formatting: leave a space between methods. |
||
"""Skip a given number of files.""" | ||
lineno = 0 | ||
for line in self: | ||
lineno += 1 | ||
if lineno > nlines: | ||
return line | ||
return None | ||
def skip_after(self, pattern): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PEP8 formatting: leave a space between methods. |
||
"""Skip until after a line that matches a regex pattern.""" | ||
for line in self: | ||
match = re.search(pattern, line) | ||
if match: | ||
break | ||
return self.fileh.tell() != self.filesize | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is going to break when we use (More specifically: For a compressed file, I think this is supposed to check for EOF (end of file). This would need to be implemented differently. Depends on how There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, thinking about this again: try it out, maybe it works and I am wrong. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I checked that the |
||
def extract_section(self, start, end, fields, limit=None, extra='', | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PEP8 formatting: leave a space between methods. |
||
debug=False): | ||
""" | ||
Extract data values (int, float) in fields from a section | ||
marked with start and end regexes. Do not read further than | ||
limit regex. | ||
""" | ||
inside = False | ||
lines = [] | ||
for line in _pre_gen(self, extra): | ||
if limit and re.search(limit, line): | ||
break | ||
if re.search(start, line): | ||
inside = True | ||
if inside: | ||
if re.search(end, line): | ||
break | ||
lines.append(line.rstrip('\n')) | ||
line = ''.join(lines) | ||
result = [] | ||
for field in fields: | ||
match = re.search(r' %s\s+=\s+(\*+|%s|\d+)' | ||
% (field, _FP_RE), line) | ||
if match: | ||
value = match.group(1) | ||
# FIXME: assumes fields are only integers or floats | ||
if '*' in value: # Fortran format overflow | ||
result.append(float('Inf') ) | ||
# NOTE: check if this is a sufficient test for int | ||
elif '.' not in value and re.search(r'\d+', value): | ||
result.append(int(value)) | ||
else: | ||
result.append(float(value)) | ||
else: # section may be incomplete | ||
result.append(None) | ||
return result | ||
def __iter__(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PEP8 formatting: leave a space between methods. |
||
return self | ||
def next(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PEP8 formatting: leave a space between methods. |
||
"""Read next line of the filehandle and check for EOF.""" | ||
self.lineno += 1 | ||
curr_pos = self.fileh.tell() | ||
if curr_pos == self.filesize: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will break with compression (see comment above). Not even sure why this is needed. Can't we just say def next(self):
self.lineno += 1
return next(self.fileh) This should raise There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's a better solution. Sorry I didn't think that part much as migrating the code from alchemical-analysis. It looks like that this test successfully catch the untested part of the original code! |
||
raise StopIteration | ||
# NOTE: can't mix next() with seek() | ||
return self.fileh.readline() | ||
def close(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PEP8 formatting: leave a space between methods. |
||
"""Close the filehandle.""" | ||
self.fileh.close() | ||
def __enter__(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PEP8 formatting: leave a space between methods. |
||
return self | ||
def __exit__(self, typ, value, traceback): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PEP8 formatting: leave a space between methods. |
||
self.close() | ||
|
||
class FEData(object): | ||
"""A simple struct container to collect data from individual files.""" | ||
|
||
#__slots__ = ['clambda', 't0', 'dt', 'T', 'gradients', | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove commented out code if not necessary |
||
# 'component_gradients', 'mbar_energies'] | ||
__slots__ = ['clambda', 't0', 'dt', 'T', 'gradients', | ||
'component_gradients'] | ||
|
||
def __init__(self): | ||
self.clambda = -1.0 | ||
self.t0 = -1.0 | ||
self.dt = -1.0 | ||
self.T = -1.0 | ||
self.gradients = [] | ||
self.component_gradients = [] | ||
#self.mbar_energies = [] | ||
|
||
|
||
def file_validation(outfile, ): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. doc string, also remove comma + white space in arg list |
||
invalid = False | ||
with SectionParser(outfile) as secp: | ||
line = secp.skip_lines(5) | ||
if not line: | ||
print(' WARNING: file does not contain any useful data, ' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. turn logger.warn("file %r does not contain any useful data, skipping file", outfile) Add the filename to the error message; I assume that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you think that users should always see the message, use |
||
'ignoring file') | ||
invalid = True | ||
if not secp.skip_after('^ 2. CONTROL DATA FOR THE RUN'): | ||
print(' WARNING: no CONTROL DATA found, ignoring file') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use logger |
||
invalid = True | ||
ntpr, = secp.extract_section('^Nature and format of output:', '^$', | ||
['ntpr']) | ||
nstlim, dt = secp.extract_section('Molecular dynamics:', '^$', | ||
['nstlim', 'dt']) | ||
T, = secp.extract_section('temperature regulation:', '^$', | ||
['temp0']) | ||
if not T: | ||
raise SystemExit('ERROR: Non-constant temperature MD not ' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do not This is a library and calling code might want to handle it. Raise |
||
'currently supported') | ||
invalid = True | ||
clambda, = secp.extract_section('^Free energy options:', '^$', | ||
['clambda'], '^---') | ||
if clambda is None: | ||
print(' WARNING: no free energy section found, ignoring file') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use logging |
||
invalid = True | ||
|
||
if not secp.skip_after('^ 3. ATOMIC '): | ||
print(' WARNING: no ATOMIC section found, ignoring file\n') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use logging |
||
invalid = True | ||
|
||
t0, = secp.extract_section('^ begin time', '^$', ['coords']) | ||
if not secp.skip_after('^ 4. RESULTS'): | ||
print(' WARNING: no RESULTS section found, ignoring file\n') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use logging |
||
invalid = True | ||
if invalid: | ||
return False | ||
else: | ||
file_datum = FEData() | ||
file_datum.clambda = clambda | ||
file_datum.t0 = t0 | ||
file_datum.dt = dt | ||
file_datum.T = T | ||
return file_datum | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would simplify to if invalid:
return False
file_datum = FEData()
file_datum.clambda = clambda
file_datum.t0 = t0
file_datum.dt = dt
file_datum.T = T
return file_datum |
||
|
||
def extract_dHdl(outfile, ): | ||
file_datum = file_validation(outfile) | ||
if file_validation(outfile): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. rewrite the if not file_validation(outfile):
return None
finished = False
comps = []
...
return df |
||
finished = False | ||
comps = [] | ||
with SectionParser(outfile) as secp: | ||
line = secp.skip_lines(5) | ||
nensec = 0 | ||
nenav = 0 | ||
old_nstep = -1 | ||
old_comp_nstep = -1 | ||
high_E_cnt = 0 | ||
|
||
in_comps = False | ||
for line in secp: | ||
if 'DV/DL, AVERAGES OVER' in line: | ||
in_comps = True | ||
if line.startswith(' NSTEP'): | ||
if in_comps: | ||
#CHECK the result | ||
result = secp.extract_section('^ NSTEP', '^ ---', | ||
['NSTEP'] + DVDL_COMPS, | ||
extra=line) | ||
if result[0] != old_comp_nstep and not any_none(result): | ||
comps.append([float(E) for E in result[1:]]) | ||
nenav += 1 | ||
old_comp_nstep = result[0] | ||
in_comps = False | ||
else: | ||
nstep, dvdl = secp.extract_section('^ NSTEP', '^ ---', | ||
['NSTEP', 'DV/DL'], | ||
extra=line) | ||
if nstep != old_nstep and dvdl is not None \ | ||
and nstep is not None: | ||
file_datum.gradients.append(dvdl) | ||
nensec += 1 | ||
old_nstep = nstep | ||
if line == ' 5. TIMINGS\n': | ||
finished = True | ||
break | ||
if not finished: | ||
print(' WARNING: prematurely terminated run') | ||
if not nensec: | ||
print(' WARNING: File %s does not contain any DV/DL data\n' % | ||
outfile) | ||
print('%i data points, %i DV/DL averages' % (nensec, nenav)) | ||
#at this step we get info stored in the FEData object for a given amber out file | ||
file_datum.component_gradients.extend(comps) | ||
#convert file_datum to the pandas format to make it identical to alchemlyb output format | ||
df = convert_to_pandas(file_datum) | ||
else: | ||
df = None | ||
return df | ||
|
||
#currently just check the code with a simple amber ti output file | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove the whole main section (also, the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it, the main section got removed now. |
||
#likely to switch to the alchmetest frame with more testing cases | ||
if ("__main__") == (__name__): | ||
dataset = "./amber_dataset/ti-0.00.out" | ||
df = extract_dHdl(dataset) | ||
print "Check the df", df |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,6 +4,7 @@ | |
import numpy as np | ||
from pymbar.timeseries import statisticalInefficiency | ||
from pymbar.timeseries import detectEquilibration | ||
from pymbar.timeseries import subsampleCorrelatedData | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The changes to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It also helps if you raise an issue for changes to the library, e.g., "Replace pandas-based subsampling with pymbar.subsampleCorrelatedData". We can then discuss the change and link the PR to an issue. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not reviewing these changes here because they will be in a separate PR. |
||
|
||
|
||
def _check_multiple_times(df): | ||
|
@@ -97,14 +98,22 @@ def statistical_inefficiency(df, series=None, lower=None, upper=None, step=None) | |
|
||
# calculate statistical inefficiency of series | ||
statinef = statisticalInefficiency(series) | ||
|
||
#use the subsampleCorrelatedData function to subsample the data | ||
indices = subsampleCorrelatedData(series, g=statinef) | ||
picked_time_index = [] | ||
#pick the time index for the pandas dataframe based on the python indices from subsample | ||
for s_index, s_index_pair in enumerate(series.index): | ||
if s_index in indices: | ||
picked_time_index.append(s_index_pair[0]) | ||
|
||
# we round up | ||
statinef = int(np.rint(statinef)) | ||
|
||
#statinef = int(np.rint(statinef)) | ||
# subsample according to statistical inefficiency | ||
series = series.iloc[::statinef] | ||
#series = series.iloc[::statinef] | ||
|
||
df = df.loc[series.index] | ||
#df = df.loc[series.index] | ||
df = df.loc[picked_time_index] | ||
else: | ||
df = slicing(df, lower=lower, upper=upper, step=step) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
"""Amber parser tests. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We also like to have a test with at least one of the estimators. Can you add an Amber test to |
||
|
||
""" | ||
|
||
from alchemlyb.parsing.amber import extract_dHdl | ||
from alchemtest.amber import load_simplesolvated | ||
|
||
|
||
def test_dHdl(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good start |
||
"""Test that dHdl has the correct form when extracted from files. | ||
|
||
""" | ||
dataset = load_simplesolvated() | ||
|
||
for leg in dataset['data']: | ||
for filename in dataset['data'][leg]: | ||
dHdl = extract_dHdl(filename,) | ||
|
||
assert dHdl.index.names == ['time', 'lambdas'] | ||
assert dHdl.shape == (500, 1) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
order imports in canonical order: