Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet parser for dataframe serialisation #317

Merged
merged 10 commits into from
Jun 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGES
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,10 @@ The rules for this file:

* 2.1.0

Enhancements
- Add a parser to read serialised pandas dataframe (parquet) (issue #316, PR#317).
- workflow.ABFE allow parquet as input (issue #316, PR#317).

Fixes
- Fix the case where visualisation.plot_convergence would fail when the final
error is NaN (issue #318, PR#317).
Expand Down
1 change: 1 addition & 0 deletions devtools/conda-envs/test_env.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ dependencies:
- scipy
- scikit-learn
- matplotlib
- pyarrow

# Testing
- pytest
Expand Down
24 changes: 24 additions & 0 deletions docs/parsing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,29 @@ requires some care due to shortcomings in how pandas currently handles
metadata (see issue `pandas-dev/pandas#28283 <https://github.com/pandas-dev/pandas/issues/28283>`_).


Serialisation
'''''''''''''

Alchemlyb data structures (``dHdl`` and ``u_nk``) can be serialized as dataframes
and made persistent.
We use the `parquet <https://pandas.pydata.org/docs/user_guide/io.html#io-parquet>`_
format for serializing (writing) to a file and de-serializing (reading) from a
parquet file.

For serialization we simply use the :meth:`pandas.DataFrame.to_parquet` method of
a :class:`pandas.DataFrame`. For loading alchemlyb data we provide the
:func:`alchemlyb.parsing.parquet.extract_dHdl` and
:func:`alchemlyb.parsing.parquet.extract_u_nk` functions as shown in the example::

from alchemlyb.parsing.parquet import extract_dHdl, extract_u_nk
import pandas as pd

u_nk.to_parquet(path='u_nk.parquet', index=True)
dHdl.to_parquet(path='dHdl.parquet', index=True)

new_u_nk = extract_u_nk('u_nk.parquet', T=300)
new_dHdl = extract_dHdl('dHdl.parquet', T=300)


.. _dHdl:

Expand Down Expand Up @@ -211,4 +234,5 @@ See the documentation for the package you are using for more details on parser u
amber
namd
gomc
parquet

8 changes: 8 additions & 0 deletions docs/parsing/alchemlyb.parsing.parquet.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@


API Reference
-------------
This submodule includes these parsing functions:

.. autofunction:: alchemlyb.parsing.parquet.extract_u_nk
.. autofunction:: alchemlyb.parsing.parquet.extract_dHdl
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@ dependencies:
- pymbar>=4
- scipy
- scikit-learn
- pyarrow
- matplotlib
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,5 +52,6 @@
"scipy",
"scikit-learn",
"matplotlib",
"pyarrow",
],
)
84 changes: 84 additions & 0 deletions src/alchemlyb/parsing/parquet.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
import pandas as pd

from . import _init_attrs


@_init_attrs
def extract_u_nk(path, T):
orbeckst marked this conversation as resolved.
Show resolved Hide resolved
r"""Return reduced potentials `u_nk` (unit: kT) from a pandas parquet file.

The parquet file should be serialised from the dataframe output
from any parser with command
(``u_nk_df.to_parquet(path=path, index=True)``).

Parameters
----------
path : str
Path to parquet file to extract dataframe from.
T : float
Temperature in Kelvin of the simulations.

Returns
-------
u_nk : DataFrame
Potential energy for each alchemical state (k) for each frame (n).


Note
----
pyarraw serializers would handle the float or string column name fine but will
convert multi-lambda column name from `(0.0, 0.0)` to `"('0.0', '0.0')"`.
This parser will restore the correct column name.
Also parquet serialisation doesn't preserve the :attr:`pandas.DataFrame.attrs`.
So the temperature is assigned in this function.


.. versionadded:: 2.1.0

"""
u_nk = pd.read_parquet(path)
columns = list(u_nk.columns)
if isinstance(columns[0], str) and columns[0][0] == "(":
new_columns = []
for column in columns:
new_columns.append(
tuple(
map(
float, column[1:-1].replace('"', "").replace("'", "").split(",")
)
)
)
u_nk.columns = new_columns
return u_nk


@_init_attrs
def extract_dHdl(path, T):
orbeckst marked this conversation as resolved.
Show resolved Hide resolved
r"""Return gradients `dH/dl` (unit: kT) from a pandas parquet file.

The parquet file should be serialised from the dataframe output
from any parser with command
(`dHdl_df.to_parquet(path=path, index=True)`).

Parameters
----------
path : str
Path to parquet file to extract dataframe from.
T : float
Temperature in Kelvin the simulations sampled.
orbeckst marked this conversation as resolved.
Show resolved Hide resolved

Returns
-------
dH/dl : DataFrame
dH/dl as a function of time for this lambda window.

Note
----
Parquet serialisation doesn't preserve the :attr:`pandas.DataFrame.attrs`.
So the temperature is assigned in this function.


.. versionadded:: 2.1.0

"""
return pd.read_parquet(path)
23 changes: 23 additions & 0 deletions src/alchemlyb/tests/parsing/test_parquet.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
import pytest

from alchemlyb.parsing.parquet import extract_u_nk, extract_dHdl


@pytest.mark.parametrize(
"dHdl_list", ["gmx_benzene_Coulomb_dHdl", "gmx_ABFE_complex_dHdl"]
)
def test_extract_dHdl(dHdl_list, request, tmp_path):
dHdl = request.getfixturevalue(dHdl_list)[0]
dHdl.to_parquet(path=str(tmp_path / "dhdl.parquet"), index=True)
new_dHdl = extract_dHdl(str(tmp_path / "dhdl.parquet"), T=300)
assert (new_dHdl.columns == dHdl.columns).all()
assert (new_dHdl.index == dHdl.index).all()


@pytest.mark.parametrize("u_nk_list", ["gmx_benzene_VDW_u_nk", "gmx_ABFE_complex_n_uk"])
def test_extract_dHdl(u_nk_list, request, tmp_path):
u_nk = request.getfixturevalue(u_nk_list)[0]
u_nk.to_parquet(path=str(tmp_path / "u_nk.parquet"), index=True)
new_u_nk = extract_u_nk(str(tmp_path / "u_nk.parquet"), T=300)
assert (new_u_nk.columns == u_nk.columns).all()
assert (new_u_nk.index == u_nk.index).all()
31 changes: 31 additions & 0 deletions src/alchemlyb/tests/test_workflow_ABFE.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from alchemtest.amber import load_bace_example
from alchemtest.gmx import load_ABFE

import alchemlyb.parsing.amber
from alchemlyb.workflows.abfe import ABFE


Expand Down Expand Up @@ -397,3 +398,33 @@ def test_summary(self, workflow):
"""Test if if the summary is right."""
summary = workflow.generate_result()
assert np.isclose(summary["TI"]["Stages"]["TOTAL"], 1.40405980473, 0.1)


class Test_automatic_parquet:
"""Test the full automatic workflow for load_ABFE from parquet data."""

@staticmethod
@pytest.fixture(scope="class")
def workflow(tmp_path_factory):
outdir = tmp_path_factory.mktemp("out")
for i, u_nk in enumerate(load_bace_example()["data"]["complex"]["vdw"]):
df = alchemlyb.parsing.amber.extract_u_nk(u_nk, T=298)
df.to_parquet(path=f"{outdir}/u_nk_{i}.parquet", index=True)

workflow = ABFE(
units="kcal/mol",
software="PARQUET",
dir=str(outdir),
prefix="u_nk_",
suffix="parquet",
T=298.0,
outdirectory=str(outdir),
)
workflow.read()
workflow.estimate(estimators="BAR")
return workflow

def test_summary(self, workflow):
"""Test if if the summary is right."""
summary = workflow.generate_result()
assert np.isclose(summary["BAR"]["Stages"]["TOTAL"], 1.40405980473, 0.1)
17 changes: 12 additions & 5 deletions src/alchemlyb/workflows/abfe.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from .. import concat
from ..convergence import forward_backward_convergence
from ..estimators import MBAR, BAR, TI, FEP_ESTIMATORS, TI_ESTIMATORS
from ..parsing import gmx, amber
from ..parsing import gmx, amber, parquet
from ..postprocessors.units import get_unit_converter
from ..preprocessing.subsampling import decorrelate_dhdl, decorrelate_u_nk
from ..visualisation import (
Expand All @@ -39,7 +39,7 @@ class ABFE(WorkflowBase):
The unit used for printing and plotting results. {'kcal/mol', 'kJ/mol',
'kT'}. Default: 'kT'.
software : str
The software used for generating input (case-insensitive). {'GROMACS', 'AMBER'}.
The software used for generating input (case-insensitive). {'GROMACS', 'AMBER', 'PARQUET'}.
This option chooses the appropriate parser for the input file.
dir : str
Directory in which data files are stored. Default: os.path.curdir.
Expand All @@ -64,7 +64,9 @@ class ABFE(WorkflowBase):
.. versionadded:: 1.0.0
.. versionchanged:: 2.0.1
The `dir` argument expects a real directory without wildcards and wildcards will no longer
work as expected. Use `prefix` to specify wildcard-based patterns to search under `dir`.
work as expected. Use `prefix` to specify wildcard-based patterns to search under `dir`.
.. versionchanged:: 2.1.0
The serialised dataframe could be read via software='PARQUET'.
"""

def __init__(
Expand All @@ -86,8 +88,10 @@ def __init__(
f"{software}"
)
reg_exp = "**/" + prefix + "*" + suffix
if '*' in dir:
warnings.warn(f"A real directory is expected in `dir`={dir}, wildcard expressions should be supplied to `prefix`.")
if "*" in dir:
warnings.warn(
f"A real directory is expected in `dir`={dir}, wildcard expressions should be supplied to `prefix`."
)
if not Path(dir).is_dir():
raise ValueError(f"The input directory `dir`={dir} is not a directory.")
self.file_list = list(map(str, Path(dir).glob(reg_exp)))
Expand All @@ -105,6 +109,9 @@ def __init__(
elif software == "AMBER":
self._extract_u_nk = amber.extract_u_nk
self._extract_dHdl = amber.extract_dHdl
elif software == "PARQUET":
self._extract_u_nk = parquet.extract_u_nk
self._extract_dHdl = parquet.extract_dHdl
else:
raise NotImplementedError(f"{software} parser not found.")

Expand Down