Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add amber TI parser and improve the subsampling code #32

Closed
wants to merge 14 commits into from
Closed
265 changes: 265 additions & 0 deletions src/alchemlyb/parsing/amber.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,265 @@
"""Parsers for extracting alchemical data from amber output files.
Most of the file parsing part are inheriting from alchemical-analysis
Change the final format to pandas to be consistent with the alchemlyb format
"""

import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

order imports in canonical order:

  1. standard lib (os, re, ...)
  2. scientific python (numpy, pandas, ...)
  3. alchemlyb (if present)

import re
import numpy as np
import os

Copy link
Member

@orbeckst orbeckst Nov 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use logging instead of print:

import logging

logger = logging.getLogger("alchemlyb.parsers.Amber")

See more comments below.

Note that in order to see the messages, a root logger has to be created elsewhere:

import logging

def create_alchemlyb_logger(logfile="alchemlyb.log", name="alchemlyb"):
    """Create a logger that outputs to screen and to `logfile`"""

    logger = logging.getLogger(name)
    logger.setLevel(logging.DEBUG)

    # handler that writes to logfile
    logfile_handler = logging.FileHandler(logfile)
    logfile_formatter = logging.Formatter('%(asctime)s %(name)-12s %(levelname)-8s %(message)s')
    logfile_handler.setFormatter(logfile_formatter)
    logger.addHandler(logfile_handler)

    # define a Handler which writes INFO messages or higher to the sys.stderr
    console_handler = logging.StreamHandler()
    console_handler.setLevel(logging.INFO)
    # set a format which is simpler for console use
    formatter = logging.Formatter('%(name)-12s: %(levelname)-8s %(message)s')
    console_handler.setFormatter(formatter)
    logger.addHandler(console_handler)

    return logger

rootlogger = create_alchemlyb_logger()

We might add this convenience function elsewhere in the library. I opened issue #34.

Note that you only have to create the top level logger (here called rootlogger) once. Loggers are known globally in the interpreter, so you can "attach" to the root logger from anywhere else just by naming the new logger "alchemlyb.OTHER.STUFF".

P.S.: I wrote the code above for MDAnalysis (in MDAnalysis.lib.log) and I am placing this code snippet into the public domain. (This is necessary because MDAnalysis is under GPL v2 so we cannot just take code from there.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will switch to logging instead of printing.

Just want to make sure that I understand the logic correctly, so what I need to do is to add

import logging

logger = logging.getLogger("alchemlyb.parsers.Amber")

at the beginning of the code and start to logging like

logging.info ("Write some logging info here") is that correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly.

(Just be aware that until a logger named "alchemlyb" has been created, no logging will be visible, but that's expected and desired behavior. With #34 I will add code to the library to start logging automatically.)

def convert_to_pandas(file_datum, ):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc string

data_dic = {}
data_dic["dHdl"] = []
data_dic["lambdas"] = []
data_dic["time"] = []
for frame_index, frame_dhdl in enumerate(file_datum.gradients):
data_dic["dHdl"].append(frame_dhdl)
data_dic["lambdas"].append(file_datum.clambda)
#here we need to convert dt to ps unit from ns
frame_time = file_datum.t0 + (frame_index + 1) * file_datum.dt*1000
data_dic["time"].append(frame_time)
df = pd.DataFrame(data_dic["dHdl"], columns=["dHdl"], index =pd.Float64Index(data_dic["time"], name='time'))
df["lambdas"] = data_dic["lambdas"][0]
df = df.reset_index().set_index(['time'] + ['lambdas'])
return df

DVDL_COMPS = ['BOND', 'ANGLE', 'DIHED', '1-4 NB', '1-4 EEL', 'VDWAALS',
'EELEC', 'RESTRAINT']
_FP_RE = r'[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?'
_MAGIC_CMPR = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed anymore since switch to anyopen. Please remove.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

'\x1f\x8b\x08': ('gzip', 'GzipFile'), # last byte is compression method
'\x42\x5a\x68': ('bz2', 'BZ2File')
}

def any_none(sequence):
"""Check if any element of a sequence is None."""

for element in sequence:
if element is None:
return True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looking very good. The only way that you can sensibly improve the coverage is by adding a test for the function

parsing.amber.any_none(sequence)

Basically, make sure that the line 42 is hit. Any of the other remaining checks are fairly standard defensive programming and I am not even sure how to test them properly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got a test function added for this line


return False

def _pre_gen(it, first):
"""A generator that returns first first if it exists."""

if first:
yield first

while it:
yield it.next()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Python 3 this should be

next(it)

(I think)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, change it to next(it)


class SectionParser(object):
"""
A simple parser to extract data values from sections.
"""
def __init__(self, filename):
"""Opens a file according to its file type."""
self.filename = filename
with open(filename, 'rb') as f:
magic = f.read(3) # NOTE: works because all 3-byte headers
try:
method = _MAGIC_CMPR[magic]
except KeyError:
open_it = open
else:
open_it = getattr(__import__(method[0]), method[1])
try:
self.fileh = open_it(self.filename, 'rb')
self.filesize = os.stat(self.filename).st_size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be used for compressed files (see other comments).

I think this line can be deleted because it is useless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, thinking about this again: try it out, maybe it works and I am wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you are right, the filesize is not working here for zipped files, remove the filesize related detection in the code.

except IOError:
raise SystemExit('ERROR: cannot open file %s' % filename)
self.lineno = 0
def skip_lines(self, nlines):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 formatting: leave a space between methods.

"""Skip a given number of files."""
lineno = 0
for line in self:
lineno += 1
if lineno > nlines:
return line
return None
def skip_after(self, pattern):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 formatting: leave a space between methods.

"""Skip until after a line that matches a regex pattern."""
for line in self:
match = re.search(pattern, line)
if match:
break
return self.fileh.tell() != self.filesize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to break when we use anyopen because we never uncompress the file to disk and thus the file size of the compressed file is not related to the position in the uncompressed stream fileh.

(More specifically: For a compressed file, self.filesize < self.fileh.tell() for some positions so you might get a random occurrence where this is False in the middle of the file. Conversely, you won't get True at the real end of the file.)

I think this is supposed to check for EOF (end of file). This would need to be implemented differently. Depends on how skip_after() is used. Perhaps can be rewritten along the lines of https://stackoverflow.com/a/24738688/334357 ??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, thinking about this again: try it out, maybe it works and I am wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I checked that the self.filesize < self.fileh.tell() for the zipped file and it will not be equal at the end of the file so I changed the skip_after() function to not use this detection but explicitly return True if the pattern was found along iterating the whole file

def extract_section(self, start, end, fields, limit=None, extra='',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 formatting: leave a space between methods.

debug=False):
"""
Extract data values (int, float) in fields from a section
marked with start and end regexes. Do not read further than
limit regex.
"""
inside = False
lines = []
for line in _pre_gen(self, extra):
if limit and re.search(limit, line):
break
if re.search(start, line):
inside = True
if inside:
if re.search(end, line):
break
lines.append(line.rstrip('\n'))
line = ''.join(lines)
result = []
for field in fields:
match = re.search(r' %s\s+=\s+(\*+|%s|\d+)'
% (field, _FP_RE), line)
if match:
value = match.group(1)
# FIXME: assumes fields are only integers or floats
if '*' in value: # Fortran format overflow
result.append(float('Inf') )
# NOTE: check if this is a sufficient test for int
elif '.' not in value and re.search(r'\d+', value):
result.append(int(value))
else:
result.append(float(value))
else: # section may be incomplete
result.append(None)
return result
def __iter__(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 formatting: leave a space between methods.

return self
def next(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 formatting: leave a space between methods.

"""Read next line of the filehandle and check for EOF."""
self.lineno += 1
curr_pos = self.fileh.tell()
if curr_pos == self.filesize:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break with compression (see comment above).

Not even sure why this is needed. Can't we just say

def next(self):
  self.lineno += 1
  return next(self.fileh)

This should raise StopIteration when at the end and it should return the line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a better solution. Sorry I didn't think that part much as migrating the code from alchemical-analysis. It looks like that this test successfully catch the untested part of the original code!

raise StopIteration
# NOTE: can't mix next() with seek()
return self.fileh.readline()
def close(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 formatting: leave a space between methods.

"""Close the filehandle."""
self.fileh.close()
def __enter__(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 formatting: leave a space between methods.

return self
def __exit__(self, typ, value, traceback):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 formatting: leave a space between methods.

self.close()

class FEData(object):
"""A simple struct container to collect data from individual files."""

#__slots__ = ['clambda', 't0', 'dt', 'T', 'gradients',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code if not necessary

# 'component_gradients', 'mbar_energies']
__slots__ = ['clambda', 't0', 'dt', 'T', 'gradients',
'component_gradients']

def __init__(self):
self.clambda = -1.0
self.t0 = -1.0
self.dt = -1.0
self.T = -1.0
self.gradients = []
self.component_gradients = []
#self.mbar_energies = []


def file_validation(outfile, ):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc string, also remove comma + white space in arg list

invalid = False
with SectionParser(outfile) as secp:
line = secp.skip_lines(5)
if not line:
print(' WARNING: file does not contain any useful data, '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turn print into

logger.warn("file %r does not contain any useful data, skipping file", outfile)

Add the filename to the error message; I assume that outfile was a filename.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think that users should always see the message, use warnings.warn(msg) in addition to the logger.

'ignoring file')
invalid = True
if not secp.skip_after('^ 2. CONTROL DATA FOR THE RUN'):
print(' WARNING: no CONTROL DATA found, ignoring file')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use logger

invalid = True
ntpr, = secp.extract_section('^Nature and format of output:', '^$',
['ntpr'])
nstlim, dt = secp.extract_section('Molecular dynamics:', '^$',
['nstlim', 'dt'])
T, = secp.extract_section('temperature regulation:', '^$',
['temp0'])
if not T:
raise SystemExit('ERROR: Non-constant temperature MD not '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not SystemExit in a library.

This is a library and calling code might want to handle it. Raise NotImplementedError instead.

'currently supported')
invalid = True
clambda, = secp.extract_section('^Free energy options:', '^$',
['clambda'], '^---')
if clambda is None:
print(' WARNING: no free energy section found, ignoring file')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use logging

invalid = True

if not secp.skip_after('^ 3. ATOMIC '):
print(' WARNING: no ATOMIC section found, ignoring file\n')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use logging

invalid = True

t0, = secp.extract_section('^ begin time', '^$', ['coords'])
if not secp.skip_after('^ 4. RESULTS'):
print(' WARNING: no RESULTS section found, ignoring file\n')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use logging

invalid = True
if invalid:
return False
else:
file_datum = FEData()
file_datum.clambda = clambda
file_datum.t0 = t0
file_datum.dt = dt
file_datum.T = T
return file_datum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would simplify to

if invalid:
        return False

file_datum = FEData()
file_datum.clambda = clambda
file_datum.t0 = t0
file_datum.dt = dt
file_datum.T = T
return file_datum


def extract_dHdl(outfile, ):
file_datum = file_validation(outfile)
if file_validation(outfile):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewrite the if to terminate early, the following would be clearer:

if not file_validation(outfile):
   return None

finished = False
comps = []
...
return df

finished = False
comps = []
with SectionParser(outfile) as secp:
line = secp.skip_lines(5)
nensec = 0
nenav = 0
old_nstep = -1
old_comp_nstep = -1
high_E_cnt = 0

in_comps = False
for line in secp:
if 'DV/DL, AVERAGES OVER' in line:
in_comps = True
if line.startswith(' NSTEP'):
if in_comps:
#CHECK the result
result = secp.extract_section('^ NSTEP', '^ ---',
['NSTEP'] + DVDL_COMPS,
extra=line)
if result[0] != old_comp_nstep and not any_none(result):
comps.append([float(E) for E in result[1:]])
nenav += 1
old_comp_nstep = result[0]
in_comps = False
else:
nstep, dvdl = secp.extract_section('^ NSTEP', '^ ---',
['NSTEP', 'DV/DL'],
extra=line)
if nstep != old_nstep and dvdl is not None \
and nstep is not None:
file_datum.gradients.append(dvdl)
nensec += 1
old_nstep = nstep
if line == ' 5. TIMINGS\n':
finished = True
break
if not finished:
print(' WARNING: prematurely terminated run')
if not nensec:
print(' WARNING: File %s does not contain any DV/DL data\n' %
outfile)
print('%i data points, %i DV/DL averages' % (nensec, nenav))
#at this step we get info stored in the FEData object for a given amber out file
file_datum.component_gradients.extend(comps)
#convert file_datum to the pandas format to make it identical to alchemlyb output format
df = convert_to_pandas(file_datum)
else:
df = None
return df

#currently just check the code with a simple amber ti output file
Copy link
Member

@orbeckst orbeckst Nov 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the whole main section (also, the print use python 2.7 syntax but we run under python 2 and python 3 – this makes Travis fail at the moment).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, the main section got removed now.

#likely to switch to the alchmetest frame with more testing cases
if ("__main__") == (__name__):
dataset = "./amber_dataset/ti-0.00.out"
df = extract_dHdl(dataset)
print "Check the df", df
17 changes: 13 additions & 4 deletions src/alchemlyb/preprocessing/subsampling.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import numpy as np
from pymbar.timeseries import statisticalInefficiency
from pymbar.timeseries import detectEquilibration
from pymbar.timeseries import subsampleCorrelatedData
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to subsampling are independent from the TI parser. Please remove from this PR and submit a new PR. This allows us to focus on one issue at a time, including reviewing and testing, and also makes the git history cleaner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also helps if you raise an issue for changes to the library, e.g., "Replace pandas-based subsampling with pymbar.subsampleCorrelatedData". We can then discuss the change and link the PR to an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not reviewing these changes here because they will be in a separate PR.



def _check_multiple_times(df):
Expand Down Expand Up @@ -97,14 +98,22 @@ def statistical_inefficiency(df, series=None, lower=None, upper=None, step=None)

# calculate statistical inefficiency of series
statinef = statisticalInefficiency(series)

#use the subsampleCorrelatedData function to subsample the data
indices = subsampleCorrelatedData(series, g=statinef)
picked_time_index = []
#pick the time index for the pandas dataframe based on the python indices from subsample
for s_index, s_index_pair in enumerate(series.index):
if s_index in indices:
picked_time_index.append(s_index_pair[0])

# we round up
statinef = int(np.rint(statinef))

#statinef = int(np.rint(statinef))
# subsample according to statistical inefficiency
series = series.iloc[::statinef]
#series = series.iloc[::statinef]

df = df.loc[series.index]
#df = df.loc[series.index]
df = df.loc[picked_time_index]
else:
df = slicing(df, lower=lower, upper=upper, step=step)

Expand Down
21 changes: 21 additions & 0 deletions src/alchemlyb/tests/parsing/test_amber.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
"""Amber parser tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also like to have a test with at least one of the estimators. Can you add an Amber test to test_fep_estimators.py?


"""

from alchemlyb.parsing.amber import extract_dHdl
from alchemtest.amber import load_simplesolvated


def test_dHdl():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good start

"""Test that dHdl has the correct form when extracted from files.

"""
dataset = load_simplesolvated()

for leg in dataset['data']:
for filename in dataset['data'][leg]:
dHdl = extract_dHdl(filename,)

assert dHdl.index.names == ['time', 'lambdas']
assert dHdl.shape == (500, 1)