Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating Serialization Functionality from PR #2140 #2704

Closed
wants to merge 24 commits into from

Conversation

yuxuanzhuang
Copy link
Contributor

Merging from and updating #2140

Changes made in this Pull Request:

  • add basic pickling support to Universe

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@codecov
Copy link

codecov bot commented May 29, 2020

Codecov Report

Merging #2704 into develop will increase coverage by 0.03%.
The diff coverage is 95.77%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2704      +/-   ##
===========================================
+ Coverage    91.22%   91.26%   +0.03%     
===========================================
  Files          176      176              
  Lines        24115    24208      +93     
  Branches      3160     3160              
===========================================
+ Hits         22000    22093      +93     
- Misses        1492     1493       +1     
+ Partials       623      622       -1     
Impacted Files Coverage Δ
package/MDAnalysis/auxiliary/base.py 88.81% <50.00%> (-0.25%) ⬇️
package/MDAnalysis/coordinates/GMS.py 85.71% <72.72%> (ø)
package/MDAnalysis/coordinates/TRZ.py 83.57% <86.36%> (ø)
package/MDAnalysis/coordinates/TXYZ.py 92.47% <90.00%> (+2.15%) ⬆️
package/MDAnalysis/coordinates/XYZ.py 89.44% <90.00%> (ø)
package/MDAnalysis/coordinates/DLPoly.py 95.29% <100.00%> (+0.02%) ⬆️
package/MDAnalysis/coordinates/GSD.py 87.75% <100.00%> (+0.52%) ⬆️
package/MDAnalysis/coordinates/LAMMPS.py 89.45% <100.00%> (ø)
package/MDAnalysis/coordinates/PDB.py 90.37% <100.00%> (ø)
package/MDAnalysis/coordinates/TRJ.py 94.65% <100.00%> (+0.09%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f7f5d3...864d733. Read the comment docs.

@orbeckst
Copy link
Member

orbeckst commented Jun 1, 2020

The linter is not happy with you in #2346.7

testsuite/MDAnalysisTests/parallelism/test_multiprocessing.py:4: [W1618(no-absolute-import), ] import missing `from __future__ import absolute_import`

Please fix – generally, we'd like to be CI green :-).

@orbeckst
Copy link
Member

orbeckst commented Jun 1, 2020

The other failing test Job #2346.8 ran too long. That's a general problem #2671 , not specific to you , but keep in mind keeping tests short.

I tried restarting it.

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the delay in reviewing here, I'm just providing some small initial comments to begin with as there's a few things I still haven't fully wrapped my head around yet.

It would be good to add some detailed docstring about the behaviour of _[B/Ex]AsciiPickle and maybe some extra details in somewhere (maybe ReaderBase?) on what we expect Readers to have in order to be pickleable (e.g. self._f, etc...).

@@ -131,3 +135,4 @@ def _read_frame(self, frame):
def _read_next_timestep(self) :
"""read next frame in trajectory"""
return self._read_frame(self._frame + 1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 don't need a blank line on the last line

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IAlibay do we have a PEP8 checker running? If not, we should add one to the linter.

@lilyminium does the user guide talk about PEP8-checking?

I'm just asking I find myself writing a lot of "PEP8" comments on PRs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we do - I think there's already an ongoing discussion in #2450

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the user guide talk about PEP8-checking?

Yes, the user guide highlights important points from PEP8, mentions tools like flake8 for linting, and autopep8/yapf for autoformatting. It doesn't mention stuff like spaces after commas and around comparison operators, or blank lines, though.

@@ -2069,6 +2078,54 @@ def _apply_transformations(self, ts):
return ts


class _AsciiPickle(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring explaining what these three classes do / when they should be useful would be very helpful here (especially when it comes to the implementation of future readers).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document everything, even if you make it a private class. You're working on the core code, for years to come, developers will have to work with what you're writing here, so be super-clear about

  • what your intent is (how it should behave)
  • how it interacts with other parts of the code
  • what the caveats, requirements, and implicit assumptions are
  • any TODOS/known issues

@@ -272,10 +272,14 @@ def test_load_multiple_args(self):
assert_equal(len(u.atoms), 3341, "Loading universe failed somehow")
assert_equal(u.trajectory.n_frames, 2 * ref.trajectory.n_frames)

def test_pickle_raises_NotImplementedError(self):
def test_pickle(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth testing more than PSF/DCD here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test here is just to make sure Universe pickling works. Since all the readers will be tested in test_multiprocessing.py, I am not sure it is worth testing more here again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding here is that some of the readers are going to behave a little bit differently under the hood for pickling (e.g. GSD). My comment here was more "is it worth doing a parametrize and looping over various file types to make sure none of them fail to pickle".

else:
top, trj = request.param
return mda.Universe(top, trj)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8, two blank lines between functions


# Define target functions here
# inside test functions doesn't work
def cog(u, ag, frame_id):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this and getnames be fixtures?

p.close()

assert_equal(ref, res)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 two blank lines here

finally:
# make sure file handle is closed afterwards
r.close()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, PEP8 - two blank lines

self._f = anyopen(self._pickle_fn)

del self._pickle_fn

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 here and below, two blank lines between classes.

@orbeckst
Copy link
Member

orbeckst commented Jun 5, 2020

For these big PRs it's a good idea to occasionally rebase against the latest develop as not to fall behind too far. There are also some speed-ups that run tests faster, run on other CI etc.

@orbeckst
Copy link
Member

orbeckst commented Jun 5, 2020

Python 2.7 fails to pickle. If unfixable, I guess we can use dill as a replacement in py27 (slower).

For right now, don't worry about Python 2.7 and focus on making it work in 3. Make your tests XFAIL if Python 2.

The current plan is to get MDAnalysis 1.0 #2443 out soon. This is the last release to officially support Python 2. Your code really only has to work with >1.x.

We can add it as "experimental" in 1.0 if we make it fail in obvious ways but it doesn't have to be full featured.


# AMBER NetCDF files should always have a convention
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this comment was removed accidentally

if n_atoms is not None and n_atoms != self.n_atoms:
errmsg = ("Supplied n_atoms ({0}) != natom from ncdf ({1}). "
"Note: n_atoms can be None and then the ncdf value "
"is used!".format(n_atoms, self.n_atoms))
raise ValueError(errmsg)
except KeyError:
errmsg = ("NCDF trajectory {0} does not contain atom "
"information".format(self.filename))
"information".format(
self._f.ConventionVersion, self.version))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this has been changed - it looks like maybe it was copied from elsewhere by accident?

self._f = scipy.io.netcdf.netcdf_file(self.filename,
mmap=self._mmap)

# @property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it was code that was added in @richardjgowers's original PR in conjunction with removing the code for setting self.n_frames in __init__, which has since been altered and so still in this PR up here . I'm not sure what the original intention for the move was, though, so don't know if it's better to remove this or move the updated n_frames bit here

@kain88-de
Copy link
Member

The inheritance approach used here could be a problem in the future if we need to customize for a reader how pickling works. Every time we want a different way of pickling we need to write a new base class and inherits from it. It might be we need to read two files for some formats. Another approach is to "teach" the objects used in the readers how to pickle themselves. This software development pattern is called composition. Wikipedia has a good comparison article

Our problem is that the default file object in python cannot be pickled. So we can write our own file object that copies the interface of the default file. Interestingly in python, we use inheritance to copy the interface of another class. So you can write a new file class like this.

class FilePickable(io.FileIO)
    def init(self, name, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._args = args
        self._kwargs = lwargs

    def getstate(self):
        return  self.tell(), self._args, self._kwargs

    def setstate(self, args):
         super().__init__(*args[1], **args[2])
         self.seek(args[0])

The anyopen function we have could return such an object.

This change would automatically make all trajectory readers using normal files pickable. The xdr file objects can already be pickled with a similar approach. The only thing we have to look out for is to use the anyopen function when we open any file within MDAnalysis. My mentioned problem above would also be automatically solved with this.

Of course, this is only a rough sketch. There might be some edge-cases to figure out.

@richardjgowers
Copy link
Member

@yuxuanzhuang I think @kain88-de is right here. Can you look into if this is possible?

@yuxuanzhuang
Copy link
Contributor Author

@richardjgowers Sounds like something I should definitely look into. @kain88-de Thanks for your advice!

@yuxuanzhuang
Copy link
Contributor Author

For these big PRs it's a good idea to occasionally rebase against the latest develop as not to fall behind too far. There are also some speed-ups that run tests faster, run on other CI etc.

When I tried to rebase this PR, other PRs shows up in this thread, not sure what mistake I am making here by git rebase mda_origin/develop

@kain88-de
Copy link
Member

kain88-de commented Jun 8, 2020 via email

@yuxuanzhuang
Copy link
Contributor Author

Can you post the output of git reflog | head -n 20. It will show your last 10 git commands. That helps with finding out what happened.

On Mon 8. Jun 2020 at 16:13, Yuxuan Zhuang @.***> wrote: For these big PRs it's a good idea to occasionally rebase against the latest develop as not to fall behind too far. There are also some speed-ups that run tests faster, run on other CI etc. When I tried to rebase this PR, other PRs shows up in this thread, not sure what mistake I am making here by git rebase mda_origin/develop — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2704 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABA2OVUCUKPI2AETHN774XTRVTWY7ANCNFSM4NOIGZDQ .

git reflog | head -n 20                                                                                                                                           
b61081cd0 HEAD@{0}: checkout: moving from serialize to serialize_io
b61081cd0 HEAD@{1}: commit: add xfail to py2 serilization
629b98860 HEAD@{2}: pull: Merge made by the 'recursive' strategy.
405393875 HEAD@{3}: commit: fix merge err
392a71c2f HEAD@{4}: rebase finished: returning to refs/heads/serialize
392a71c2f HEAD@{5}: rebase: del commented func
ab0c2ecf7 HEAD@{6}: rebase: rely on GSDReader for pickling instead, gsd>2 dependency removed
4d808593d HEAD@{7}: rebase: need gsd>2.1.1 for pickle
0adb20f4e HEAD@{8}: rebase: add pickle test for gsd
dec7d122e HEAD@{9}: rebase: add lammpsdump support for pickle
78c563a29 HEAD@{10}: rebase: add txyz, lammpsdump formats to test
37fa2239d HEAD@{11}: rebase: add absolute_import
f9841daf3 HEAD@{12}: rebase: rm README.md
430ec157c HEAD@{13}: rebase: fix txyz trjfile
41905d73d HEAD@{14}: rebase: fix netcdf trjfile
0298ee113 HEAD@{15}: rebase: start of gsoc 2020 project, serialize
b04d298c6 HEAD@{16}: rebase: pickling all readers now works...
a231b547e HEAD@{17}: rebase: added more formats to multiprocessing tests
a0ccb0a20 HEAD@{18}: rebase: add tests for multiprocessing
ef30ad760 HEAD@{19}: rebase: make AuxReader not seralise (until tested)

@yuxuanzhuang
Copy link
Contributor Author

@yuxuanzhuang I think @kain88-de is right here. Can you look into if this is possible?

So I added such modification to another branch, not to mess up with current one. It works fine with most cases when the file was read by open (instead of bz2, gzip). I envision we can fix this pretty easily. we can also add gsd.hoomd.open, and any future open functions to anyopen, and make them pickable.

In terms of guaranteeing thread-safe. I guess we should make sure the files open with writing permissions should not return with this new Class.

The core codes read below:

class FileIOPickable(io.FileIO):
    def __init__(self, name, *args, **kwargs):
        super().__init__(name, *args, **kwargs)
        self._args = args
        self._kwargs = kwargs

    def __getstate__(self):
        return self.tell(), self.__dict__

    def __setstate__(self, args):
        state = args[1]
        super().__init__(state['name'], *state['_args'], **state['_kwargs'])
        self.seek(args[0])


class TextIOPickable(io.TextIOWrapper):
    def __init__(self, buffer, *args, **kwargs):
        super().__init__(buffer, *args, **kwargs)
        self._buffer_args = buffer.__dict__
        self._args = args
        self._kwargs = kwargs
        
    def __getstate__(self):
        return self.tell(), self.__dict__

    def __setstate__(self, args):
        state = args[1]
        buffer = FileIOPickable(state['_buffer_args']['name'],
                                *state['_buffer_args']['_args'],
                                **state['_buffer_args']['_kwargs'])
        super().__init__(buffer, *state['_args'], **state['_kwargs'])
        self.seek(args[0])

def pickle_open(name, mode):
    buffer = FileIOPickable(name, mode='rb')
    if mode == 'rb':
        return buffer
    elif mode == 'rt' or mode == 'r':
        return TextIOPickable(buffer)
...
    handlers = {'bz2': bz2_open, 'gz': gzip.open, '': pickle_open }
...

@kain88-de
Copy link
Member

Thanks. Unfortunately, the rebase took more than 20 steps. In the output you can see every step that was applied by git, also every step of the rebase. The first cryptic name in each line is the SHA of the git commit. You can use this to go a specific commit with git checkout <SHA> or reset the current branch with git reset --hard <SHA>.

This is a long running PR that touches many places in the code. It is difficult to do a rebase every once in a while on such a PR. If you want to catch up with Develop a merge would be better suited, more towards the end once you settled on a design and work towards finalizing the PR.

You can still reset this branch. Using git reflog you can look at what the first commit SHA before you started the rebase. To safely reset this branch you can then run the following git commands.

git branch backup # just in case reference to start over.
git reset --hard <SHA>

and merge with develop.

git remote update  # fetch remote changes without local updates
git merge mda_origin/develop

About the changes regarding pickling. If you can show us the other branch with your changes that would help. Glad it seems to work.

@kain88-de
Copy link
Member

To update this branch on your remote you will have to force the push git push origin HEAD --force-with-lease

@kain88-de
Copy link
Member

In terms of guaranteeing thread-safe. I guess we should make sure the files open with writing permissions should not return with this new Class.

That should throw an exception when trying to pickle it. It is a truly exceptional case and should signal this to the caller. We can give the exception a short explanatory text. Something like "Writable file handler cannot be pickled". This is a better signal that an error occurred instead of silently returning nothing.

@orbeckst
Copy link
Member

orbeckst commented Jun 8, 2020

This is a long running PR that touches many places in the code. It is difficult to do a rebase every once in a while on such a PR. If you want to catch up with Develop a merge would be better suited, more towards the end once you settled on a design and work towards finalizing the PR.

Sorry, I suggested the rebase because we had a number of changes that would improve running tests. I didn't anticipate that the rebase would mess too much with history (normally I find that the rebases to develop are pretty clean, serial history).

@yuxuanzhuang
Copy link
Contributor Author

This is a pretty preliminary branch for IO serialization. It works for most case (and even slight faster I believe), while I attached current Reader failures in the comments.

If we are opt for this, I guess we can merge those changes into this PR. I do love how neet it becomes with composition instead of inheritance.

@richardjgowers
Copy link
Member

@yuxuanzhuang ok the branch looks good. I think a way forward here is to create a branch that implements only the FileIOPicklable and TextIOPicklable classes, has a bunch of tests for them, and we can merge that in a PR. Then we can rebase this work on top of that.

@orbeckst orbeckst added the GSoC GSoC project label Aug 7, 2020
orbeckst pushed a commit that referenced this pull request Aug 8, 2020
* Fixes #2878
* basic approach: composition instead of inheritance for pickling Universe (which was
   tested in PR #2704 (which was derived from PR #2140))
* Changes made in this Pull Request:
    - add new classes and pickle_open function to picklable_file_io.py
    - add new picklable `BufferedReader`, `FileIO`, and `TextIOWrapper` classes.
    - implement `__getstate__`, `__setstate__` to `Universe` and `BaseReader`
    - fix DCD, XDR pickle issue
    - modified gsd and ncdf to be picklabel
    - modified ChainReader to be picklabel
    - ensure chemfiles is picklable
* add tests (MultiFrameReader will test for serializability)
* update CHANGELOG
* update docs

Note: This merge squashed 120 commits. See PR #2723 for the full history with 420 comments.
@orbeckst
Copy link
Member

orbeckst commented Aug 8, 2020

I think we can close this PR, now that PR #2723 is merged?

@orbeckst
Copy link
Member

Closing because it has been superseded by "composition over inheritance" PR #2723 .

@orbeckst orbeckst closed this Aug 10, 2020
PicoCentauri pushed a commit to PicoCentauri/mdanalysis that referenced this pull request Mar 30, 2021
* Fixes MDAnalysis#2878
* basic approach: composition instead of inheritance for pickling Universe (which was
   tested in PR MDAnalysis#2704 (which was derived from PR MDAnalysis#2140))
* Changes made in this Pull Request:
    - add new classes and pickle_open function to picklable_file_io.py
    - add new picklable `BufferedReader`, `FileIO`, and `TextIOWrapper` classes.
    - implement `__getstate__`, `__setstate__` to `Universe` and `BaseReader`
    - fix DCD, XDR pickle issue
    - modified gsd and ncdf to be picklabel
    - modified ChainReader to be picklabel
    - ensure chemfiles is picklable
* add tests (MultiFrameReader will test for serializability)
* update CHANGELOG
* update docs

Note: This merge squashed 120 commits. See PR MDAnalysis#2723 for the full history with 420 comments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GSoC GSoC project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants