parquet parser for dataframe serialisation #317

xiki-tempula · 2023-05-23T13:51:39Z

This PR add a parquet parser, which will load the serialised dataframe and permits the usage of

    from alchemlyb.parsing.parquet import extract_dHdl, extract_u_nk
    import pandas as pd

    u_nk.to_parquet(path='u_nk.parquet', index=True)
    dHdl.to_parquet(path='dHdl.parquet', index=True)

    new_u_nk = extract_u_nk('u_nk.parquet', T=300)
    new_dHdl = extract_dHdl('dHdl.parquet', T=300)

Fix #316

codecov · 2023-05-23T14:02:47Z

Codecov Report

Merging #317 (7347a9c) into master (4e590cc) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #317      +/-   ##
==========================================
+ Coverage   98.74%   98.75%   +0.01%     
==========================================
  Files          26       27       +1     
  Lines        1754     1772      +18     
  Branches      382      387       +5     
==========================================
+ Hits         1732     1750      +18     
  Misses          2        2              
  Partials       20       20

Impacted Files	Coverage Δ
src/alchemlyb/parsing/parquet.py	`100.00% <100.00%> (ø)`
src/alchemlyb/workflows/abfe.py	`99.67% <100.00%> (+<0.01%)`	⬆️

xiki-tempula · 2023-06-01T10:45:14Z

@orbeckst Do you mind have a review of this PR, please? Thank you.

orbeckst

Basic functionality looks good but some request for changes

docs and clearer code (see comments)
update all build files with new pyarrow dependency: environment.yml, setup.py, + raise issue in https://github.com/conda-forge/alchemlyb-feedstock/ to add pyarrow to the deps of the meta.yml of the conda-forge package (and link to this issue)

docs/parsing.rst

src/alchemlyb/parsing/parquet.py

orbeckst · 2023-06-01T15:43:50Z

src/alchemlyb/parsing/parquet.py

+    path : str
+        Path to parquet file to extract dataframe from.
+    T : float
+        Temperature in Kelvin the simulations sampled.


Suggested change

Temperature in Kelvin the simulations sampled.

Temperature in Kelvin of the simulations.

State that T is ignored but included for API compatibility

Sorry but parquet serialisation doesn't preserve the df.attrs. So the dataframe loaded here doesn't contain temperature and the temperature is assigned via this T.

src/alchemlyb/parsing/parquet.py

orbeckst · 2023-06-01T15:52:47Z

CHANGES

+
+Enhancements
+  - Add a parser to read serialised pandas dataframe (parquet) (issue #316, PR#317).
+


Also mention the enhancement to the ABFE workflow

Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>

…' into 316-an-engine-agonistic-parser

xiki-tempula · 2023-06-01T16:21:02Z

@orbeckst Thanks for the review. I have change the environment.yml. I will raise an issue with meta.yml when we are doing a release.

So the T problem is a bit tricky in that parquet serialisation doesn't preserve the df.attrs. So the dataframe loaded here doesn't contain temperature and the temperature is assigned via extract function. I will add a not to state that.

orbeckst · 2023-06-01T16:24:26Z

So the T problem is a bit tricky in that parquet serialisation doesn't preserve the df.attrs. So the dataframe loaded here doesn't contain temperature and the temperature is assigned via extract function. I will add a not to state that.

Ok, I blithely assumed that parquet would save everything — then T is not optional. Definitely document this requirement.

xiki-tempula · 2023-06-01T16:28:52Z

@orbeckst

Ok, I blithely assumed that parquet would save everything

I'm annoyed by this as well but it seems that parquet is still the best serialiser. This is the only format that preserves index besides to_pickle. Also it faithfully preserve all the data in its original datatype. The to_pickle, though it preserves everything, is slow to read and write and also would not be safe between different versions of pandas.

Definitely document this requirement.

I have added this as a note to both alchemlyb.parsing.parquet.extract_u_nk and alchemlyb.parsing.parquet.extract_dHdl.

orbeckst · 2023-06-01T18:41:26Z

Let me know when I need to review again.

xiki-tempula · 2023-06-01T19:10:27Z

@orbeckst I addressed the comments. Do you mind having another review? Thank you.

orbeckst

lgtm

orbeckst · 2023-06-01T19:17:45Z

Please squash-merge when ready.

update

399acfa

xiki-tempula linked an issue May 23, 2023 that may be closed by this pull request

An engine agonistic parser #316

Closed

update

336ef26

xiki-tempula requested a review from orbeckst May 23, 2023 15:15

orbeckst requested changes Jun 1, 2023

View reviewed changes

xiki-tempula and others added 6 commits June 1, 2023 17:05

Update docs/parsing.rst

a6adc53

Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>

Update src/alchemlyb/parsing/parquet.py

8e285b4

Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>

Update src/alchemlyb/parsing/parquet.py

4278220

Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>

Update src/alchemlyb/parsing/parquet.py

56b97d9

Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>

update

420d254

Merge remote-tracking branch 'upstream/316-an-engine-agonistic-parser…

0d8af67

…' into 316-an-engine-agonistic-parser

update

7347a9c

Merge branch 'master' into 316-an-engine-agonistic-parser

b830fcd

orbeckst approved these changes Jun 1, 2023

View reviewed changes

xiki-tempula merged commit 1d0a111 into master Jun 1, 2023

xiki-tempula deleted the 316-an-engine-agonistic-parser branch June 1, 2023 19:17

orbeckst assigned xiki-tempula Jun 1, 2023

xiki-tempula mentioned this pull request Jun 1, 2023

New dependency to be added to the 2.1.0 release #321

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet parser for dataframe serialisation #317

parquet parser for dataframe serialisation #317

xiki-tempula commented May 23, 2023 •

edited

Loading

codecov bot commented May 23, 2023 •

edited

Loading

xiki-tempula commented Jun 1, 2023

orbeckst left a comment

orbeckst Jun 1, 2023

orbeckst Jun 1, 2023

xiki-tempula Jun 1, 2023

orbeckst Jun 1, 2023

xiki-tempula commented Jun 1, 2023

orbeckst commented Jun 1, 2023

xiki-tempula commented Jun 1, 2023

orbeckst commented Jun 1, 2023

xiki-tempula commented Jun 1, 2023

orbeckst left a comment

orbeckst commented Jun 1, 2023

	Temperature in Kelvin the simulations sampled.
	Temperature in Kelvin of the simulations.


		Enhancements
		- Add a parser to read serialised pandas dataframe (parquet) (issue #316, PR#317).

parquet parser for dataframe serialisation #317

parquet parser for dataframe serialisation #317

Conversation

xiki-tempula commented May 23, 2023 • edited Loading

codecov bot commented May 23, 2023 • edited Loading

Codecov Report

xiki-tempula commented Jun 1, 2023

orbeckst left a comment

Choose a reason for hiding this comment

orbeckst Jun 1, 2023

Choose a reason for hiding this comment

orbeckst Jun 1, 2023

Choose a reason for hiding this comment

xiki-tempula Jun 1, 2023

Choose a reason for hiding this comment

orbeckst Jun 1, 2023

Choose a reason for hiding this comment

xiki-tempula commented Jun 1, 2023

orbeckst commented Jun 1, 2023

xiki-tempula commented Jun 1, 2023

orbeckst commented Jun 1, 2023

xiki-tempula commented Jun 1, 2023

orbeckst left a comment

Choose a reason for hiding this comment

orbeckst commented Jun 1, 2023

xiki-tempula commented May 23, 2023 •

edited

Loading

codecov bot commented May 23, 2023 •

edited

Loading