-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet parser for dataframe serialisation #317
Conversation
Codecov Report
@@ Coverage Diff @@
## master #317 +/- ##
==========================================
+ Coverage 98.74% 98.75% +0.01%
==========================================
Files 26 27 +1
Lines 1754 1772 +18
Branches 382 387 +5
==========================================
+ Hits 1732 1750 +18
Misses 2 2
Partials 20 20
|
@orbeckst Do you mind have a review of this PR, please? Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basic functionality looks good but some request for changes
- docs and clearer code (see comments)
- update all build files with new pyarrow dependency: environment.yml, setup.py, + raise issue in https://github.com/conda-forge/alchemlyb-feedstock/ to add pyarrow to the deps of the meta.yml of the conda-forge package (and link to this issue)
src/alchemlyb/parsing/parquet.py
Outdated
path : str | ||
Path to parquet file to extract dataframe from. | ||
T : float | ||
Temperature in Kelvin the simulations sampled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temperature in Kelvin the simulations sampled. | |
Temperature in Kelvin of the simulations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
State that T is ignored but included for API compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry but parquet serialisation doesn't preserve the df.attrs. So the dataframe loaded here doesn't contain temperature and the temperature is assigned via this T
.
CHANGES
Outdated
|
||
Enhancements | ||
- Add a parser to read serialised pandas dataframe (parquet) (issue #316, PR#317). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also mention the enhancement to the ABFE workflow
Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>
Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>
Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>
Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>
…' into 316-an-engine-agonistic-parser
@orbeckst Thanks for the review. I have change the environment.yml. I will raise an issue with meta.yml when we are doing a release. So the T problem is a bit tricky in that parquet serialisation doesn't preserve the df.attrs. So the dataframe loaded here doesn't contain temperature and the temperature is assigned via extract function. I will add a not to state that. |
Ok, I blithely assumed that parquet would save everything — then |
I'm annoyed by this as well but it seems that parquet is still the best serialiser. This is the only format that preserves index besides to_pickle. Also it faithfully preserve all the data in its original datatype. The to_pickle, though it preserves everything, is slow to read and write and also would not be safe between different versions of pandas.
I have added this as a note to both alchemlyb.parsing.parquet.extract_u_nk and alchemlyb.parsing.parquet.extract_dHdl. |
Let me know when I need to review again. |
@orbeckst I addressed the comments. Do you mind having another review? Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Please squash-merge when ready. |
This PR add a parquet parser, which will load the serialised dataframe and permits the usage of
Fix #316