Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet parser for dataframe serialisation #317

Merged
merged 10 commits into from
Jun 1, 2023

Conversation

xiki-tempula
Copy link
Collaborator

@xiki-tempula xiki-tempula commented May 23, 2023

This PR add a parquet parser, which will load the serialised dataframe and permits the usage of

    from alchemlyb.parsing.parquet import extract_dHdl, extract_u_nk
    import pandas as pd

    u_nk.to_parquet(path='u_nk.parquet', index=True)
    dHdl.to_parquet(path='dHdl.parquet', index=True)

    new_u_nk = extract_u_nk('u_nk.parquet', T=300)
    new_dHdl = extract_dHdl('dHdl.parquet', T=300) 

Fix #316

@xiki-tempula xiki-tempula linked an issue May 23, 2023 that may be closed by this pull request
@codecov
Copy link

codecov bot commented May 23, 2023

Codecov Report

Merging #317 (7347a9c) into master (4e590cc) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #317      +/-   ##
==========================================
+ Coverage   98.74%   98.75%   +0.01%     
==========================================
  Files          26       27       +1     
  Lines        1754     1772      +18     
  Branches      382      387       +5     
==========================================
+ Hits         1732     1750      +18     
  Misses          2        2              
  Partials       20       20              
Impacted Files Coverage Δ
src/alchemlyb/parsing/parquet.py 100.00% <100.00%> (ø)
src/alchemlyb/workflows/abfe.py 99.67% <100.00%> (+<0.01%) ⬆️

@xiki-tempula xiki-tempula requested a review from orbeckst May 23, 2023 15:15
@xiki-tempula
Copy link
Collaborator Author

@orbeckst Do you mind have a review of this PR, please? Thank you.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basic functionality looks good but some request for changes

docs/parsing.rst Outdated Show resolved Hide resolved
src/alchemlyb/parsing/parquet.py Outdated Show resolved Hide resolved
path : str
Path to parquet file to extract dataframe from.
T : float
Temperature in Kelvin the simulations sampled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Temperature in Kelvin the simulations sampled.
Temperature in Kelvin of the simulations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State that T is ignored but included for API compatibility

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but parquet serialisation doesn't preserve the df.attrs. So the dataframe loaded here doesn't contain temperature and the temperature is assigned via this T.

src/alchemlyb/parsing/parquet.py Outdated Show resolved Hide resolved
src/alchemlyb/parsing/parquet.py Show resolved Hide resolved
src/alchemlyb/parsing/parquet.py Show resolved Hide resolved
src/alchemlyb/parsing/parquet.py Show resolved Hide resolved
CHANGES Outdated

Enhancements
- Add a parser to read serialised pandas dataframe (parquet) (issue #316, PR#317).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also mention the enhancement to the ABFE workflow

xiki-tempula and others added 6 commits June 1, 2023 17:05
Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>
Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>
Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>
Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>
@xiki-tempula
Copy link
Collaborator Author

@orbeckst Thanks for the review. I have change the environment.yml. I will raise an issue with meta.yml when we are doing a release.

So the T problem is a bit tricky in that parquet serialisation doesn't preserve the df.attrs. So the dataframe loaded here doesn't contain temperature and the temperature is assigned via extract function. I will add a not to state that.

@orbeckst
Copy link
Member

orbeckst commented Jun 1, 2023

So the T problem is a bit tricky in that parquet serialisation doesn't preserve the df.attrs. So the dataframe loaded here doesn't contain temperature and the temperature is assigned via extract function. I will add a not to state that.

Ok, I blithely assumed that parquet would save everything — then T is not optional. Definitely document this requirement.

@xiki-tempula
Copy link
Collaborator Author

@orbeckst

Ok, I blithely assumed that parquet would save everything

I'm annoyed by this as well but it seems that parquet is still the best serialiser. This is the only format that preserves index besides to_pickle. Also it faithfully preserve all the data in its original datatype. The to_pickle, though it preserves everything, is slow to read and write and also would not be safe between different versions of pandas.

Definitely document this requirement.

I have added this as a note to both alchemlyb.parsing.parquet.extract_u_nk and alchemlyb.parsing.parquet.extract_dHdl.

@orbeckst
Copy link
Member

orbeckst commented Jun 1, 2023

Let me know when I need to review again.

@xiki-tempula
Copy link
Collaborator Author

@orbeckst I addressed the comments. Do you mind having another review? Thank you.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@orbeckst
Copy link
Member

orbeckst commented Jun 1, 2023

Please squash-merge when ready.

@xiki-tempula xiki-tempula merged commit 1d0a111 into master Jun 1, 2023
@xiki-tempula xiki-tempula deleted the 316-an-engine-agonistic-parser branch June 1, 2023 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

An engine agonistic parser
2 participants