Updating Serialization Functionality from PR #2140 #2704

yuxuanzhuang · 2020-05-29T18:28:11Z

Merging from and updating #2140

Changes made in this Pull Request:

add basic pickling support to Universe

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

and broke everything

update to HEAD

codecov · 2020-05-29T20:19:19Z

Codecov Report

Merging #2704 into develop will increase coverage by 0.03%.
The diff coverage is 95.77%.

@@             Coverage Diff             @@
##           develop    #2704      +/-   ##
===========================================
+ Coverage    91.22%   91.26%   +0.03%     
===========================================
  Files          176      176              
  Lines        24115    24208      +93     
  Branches      3160     3160              
===========================================
+ Hits         22000    22093      +93     
- Misses        1492     1493       +1     
+ Partials       623      622       -1

Impacted Files	Coverage Δ
package/MDAnalysis/auxiliary/base.py	`88.81% <50.00%> (-0.25%)`	⬇️
package/MDAnalysis/coordinates/GMS.py	`85.71% <72.72%> (ø)`
package/MDAnalysis/coordinates/TRZ.py	`83.57% <86.36%> (ø)`
package/MDAnalysis/coordinates/TXYZ.py	`92.47% <90.00%> (+2.15%)`	⬆️
package/MDAnalysis/coordinates/XYZ.py	`89.44% <90.00%> (ø)`
package/MDAnalysis/coordinates/DLPoly.py	`95.29% <100.00%> (+0.02%)`	⬆️
package/MDAnalysis/coordinates/GSD.py	`87.75% <100.00%> (+0.52%)`	⬆️
package/MDAnalysis/coordinates/LAMMPS.py	`89.45% <100.00%> (ø)`
package/MDAnalysis/coordinates/PDB.py	`90.37% <100.00%> (ø)`
package/MDAnalysis/coordinates/TRJ.py	`94.65% <100.00%> (+0.09%)`	⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f7f5d3...864d733. Read the comment docs.

orbeckst · 2020-06-01T06:22:24Z

The linter is not happy with you in #2346.7

testsuite/MDAnalysisTests/parallelism/test_multiprocessing.py:4: [W1618(no-absolute-import), ] import missing `from __future__ import absolute_import`

Please fix – generally, we'd like to be CI green :-).

orbeckst · 2020-06-01T06:24:43Z

The other failing test Job #2346.8 ran too long. That's a general problem #2671 , not specific to you , but keep in mind keeping tests short.

I tried restarting it.

IAlibay

Sorry about the delay in reviewing here, I'm just providing some small initial comments to begin with as there's a few things I still haven't fully wrapped my head around yet.

It would be good to add some detailed docstring about the behaviour of _[B/Ex]AsciiPickle and maybe some extra details in somewhere (maybe ReaderBase?) on what we expect Readers to have in order to be pickleable (e.g. self._f, etc...).

IAlibay · 2020-06-05T16:03:12Z

package/MDAnalysis/coordinates/GSD.py

@@ -131,3 +135,4 @@ def _read_frame(self, frame):
    def _read_next_timestep(self) :
        """read next frame in trajectory"""
        return self._read_frame(self._frame + 1)
+


PEP8 don't need a blank line on the last line

@IAlibay do we have a PEP8 checker running? If not, we should add one to the linter.

@lilyminium does the user guide talk about PEP8-checking?

I'm just asking I find myself writing a lot of "PEP8" comments on PRs.

I'm not sure we do - I think there's already an ongoing discussion in #2450

does the user guide talk about PEP8-checking?

Yes, the user guide highlights important points from PEP8, mentions tools like flake8 for linting, and autopep8/yapf for autoformatting. It doesn't mention stuff like spaces after commas and around comparison operators, or blank lines, though.

IAlibay · 2020-06-05T16:08:25Z

package/MDAnalysis/coordinates/base.py

@@ -2069,6 +2078,54 @@ def _apply_transformations(self, ts):
        return ts


+class _AsciiPickle(object):


Docstring explaining what these three classes do / when they should be useful would be very helpful here (especially when it comes to the implementation of future readers).

Document everything, even if you make it a private class. You're working on the core code, for years to come, developers will have to work with what you're writing here, so be super-clear about

what your intent is (how it should behave)

how it interacts with other parts of the code

what the caveats, requirements, and implicit assumptions are

any TODOS/known issues

IAlibay · 2020-06-05T16:12:11Z

testsuite/MDAnalysisTests/core/test_universe.py

@@ -272,10 +272,14 @@ def test_load_multiple_args(self):
        assert_equal(len(u.atoms), 3341, "Loading universe failed somehow")
        assert_equal(u.trajectory.n_frames, 2 * ref.trajectory.n_frames)

-    def test_pickle_raises_NotImplementedError(self):
+    def test_pickle(self):


Would it be worth testing more than PSF/DCD here?

I think the test here is just to make sure Universe pickling works. Since all the readers will be tested in test_multiprocessing.py, I am not sure it is worth testing more here again.

My understanding here is that some of the readers are going to behave a little bit differently under the hood for pickling (e.g. GSD). My comment here was more "is it worth doing a parametrize and looping over various file types to make sure none of them fail to pickle".

IAlibay · 2020-06-05T16:13:41Z

testsuite/MDAnalysisTests/parallelism/test_multiprocessing.py

+    else:
+        top, trj = request.param
+        return mda.Universe(top, trj)
+


PEP8, two blank lines between functions

IAlibay · 2020-06-05T16:16:02Z

testsuite/MDAnalysisTests/parallelism/test_multiprocessing.py

+
+# Define target functions here
+# inside test functions doesn't work
+def cog(u, ag, frame_id):


Should this and getnames be fixtures?

IAlibay · 2020-06-05T16:43:44Z

testsuite/MDAnalysisTests/parallelism/test_multiprocessing.py

+    p.close()
+
+    assert_equal(ref, res)
+


PEP8 two blank lines here

IAlibay · 2020-06-05T16:44:03Z

testsuite/MDAnalysisTests/parallelism/test_multiprocessing.py

+    finally:
+        # make sure file handle is closed afterwards
+        r.close()
+


As above, PEP8 - two blank lines

IAlibay · 2020-06-05T16:45:48Z

package/MDAnalysis/coordinates/base.py

+        self._f = anyopen(self._pickle_fn)
+
+        del self._pickle_fn
+


PEP8 here and below, two blank lines between classes.

orbeckst · 2020-06-05T16:52:30Z

For these big PRs it's a good idea to occasionally rebase against the latest develop as not to fall behind too far. There are also some speed-ups that run tests faster, run on other CI etc.

orbeckst · 2020-06-05T17:01:08Z

Python 2.7 fails to pickle. If unfixable, I guess we can use dill as a replacement in py27 (slower).

For right now, don't worry about Python 2.7 and focus on making it work in 3. Make your tests XFAIL if Python 2.

The current plan is to get MDAnalysis 1.0 #2443 out soon. This is the last release to officially support Python 2. Your code really only has to work with >1.x.

We can add it as "experimental" in 1.0 if we make it fail in obvious ways but it doesn't have to be full featured.

fiona-naughton · 2020-06-06T12:10:01Z

package/MDAnalysis/coordinates/TRJ.py


-        # AMBER NetCDF files should always have a convention


I assume this comment was removed accidentally

fiona-naughton · 2020-06-06T12:21:01Z

package/MDAnalysis/coordinates/TRJ.py

            if n_atoms is not None and n_atoms != self.n_atoms:
                errmsg = ("Supplied n_atoms ({0}) != natom from ncdf ({1}). "
                          "Note: n_atoms can be None and then the ncdf value "
                          "is used!".format(n_atoms, self.n_atoms))
                raise ValueError(errmsg)
        except KeyError:
            errmsg = ("NCDF trajectory {0} does not contain atom "
-                      "information".format(self.filename))
+                      "information".format(
+                        self._f.ConventionVersion, self.version))


I'm not sure why this has been changed - it looks like maybe it was copied from elsewhere by accident?

fiona-naughton · 2020-06-06T12:21:33Z

package/MDAnalysis/coordinates/TRJ.py

+        self._f = scipy.io.netcdf.netcdf_file(self.filename,
+                                              mmap=self._mmap)
+
+#    @property


Looks like it was code that was added in @richardjgowers's original PR in conjunction with removing the code for setting self.n_frames in __init__, which has since been altered and so still in this PR up here . I'm not sure what the original intention for the move was, though, so don't know if it's better to remove this or move the updated n_frames bit here

kain88-de · 2020-06-06T20:39:42Z

The inheritance approach used here could be a problem in the future if we need to customize for a reader how pickling works. Every time we want a different way of pickling we need to write a new base class and inherits from it. It might be we need to read two files for some formats. Another approach is to "teach" the objects used in the readers how to pickle themselves. This software development pattern is called composition. Wikipedia has a good comparison article

Our problem is that the default file object in python cannot be pickled. So we can write our own file object that copies the interface of the default file. Interestingly in python, we use inheritance to copy the interface of another class. So you can write a new file class like this.

class FilePickable(io.FileIO)
    def init(self, name, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._args = args
        self._kwargs = lwargs

    def getstate(self):
        return  self.tell(), self._args, self._kwargs

    def setstate(self, args):
         super().__init__(*args[1], **args[2])
         self.seek(args[0])

The anyopen function we have could return such an object.

This change would automatically make all trajectory readers using normal files pickable. The xdr file objects can already be pickled with a similar approach. The only thing we have to look out for is to use the anyopen function when we open any file within MDAnalysis. My mentioned problem above would also be automatically solved with this.

Of course, this is only a rough sketch. There might be some edge-cases to figure out.

richardjgowers · 2020-06-07T11:08:16Z

@yuxuanzhuang I think @kain88-de is right here. Can you look into if this is possible?

yuxuanzhuang · 2020-06-08T06:42:40Z

@richardjgowers Sounds like something I should definitely look into. @kain88-de Thanks for your advice!

yuxuanzhuang · 2020-06-08T14:13:19Z

For these big PRs it's a good idea to occasionally rebase against the latest develop as not to fall behind too far. There are also some speed-ups that run tests faster, run on other CI etc.

When I tried to rebase this PR, other PRs shows up in this thread, not sure what mistake I am making here by git rebase mda_origin/develop

kain88-de · 2020-06-08T14:20:23Z

Can you post the output of `git reflog | head -n 20`. It will show your last 10 git commands. That helps with finding out what happened.

…

On Mon 8. Jun 2020 at 16:13, Yuxuan Zhuang ***@***.***> wrote: For these big PRs it's a good idea to occasionally rebase against the latest develop as not to fall behind too far. There are also some speed-ups that run tests faster, run on other CI etc. When I tried to rebase this PR, other PRs shows up in this thread, not sure what mistake I am making here by git rebase mda_origin/develop — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2704 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABA2OVUCUKPI2AETHN774XTRVTWY7ANCNFSM4NOIGZDQ> .

yuxuanzhuang · 2020-06-08T14:23:08Z

Can you post the output of git reflog | head -n 20. It will show your last 10 git commands. That helps with finding out what happened.
…
On Mon 8. Jun 2020 at 16:13, Yuxuan Zhuang @.***> wrote: For these big PRs it's a good idea to occasionally rebase against the latest develop as not to fall behind too far. There are also some speed-ups that run tests faster, run on other CI etc. When I tried to rebase this PR, other PRs shows up in this thread, not sure what mistake I am making here by git rebase mda_origin/develop — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2704 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABA2OVUCUKPI2AETHN774XTRVTWY7ANCNFSM4NOIGZDQ .

git reflog | head -n 20                                                                                                                                           
b61081cd0 HEAD@{0}: checkout: moving from serialize to serialize_io
b61081cd0 HEAD@{1}: commit: add xfail to py2 serilization
629b98860 HEAD@{2}: pull: Merge made by the 'recursive' strategy.
405393875 HEAD@{3}: commit: fix merge err
392a71c2f HEAD@{4}: rebase finished: returning to refs/heads/serialize
392a71c2f HEAD@{5}: rebase: del commented func
ab0c2ecf7 HEAD@{6}: rebase: rely on GSDReader for pickling instead, gsd>2 dependency removed
4d808593d HEAD@{7}: rebase: need gsd>2.1.1 for pickle
0adb20f4e HEAD@{8}: rebase: add pickle test for gsd
dec7d122e HEAD@{9}: rebase: add lammpsdump support for pickle
78c563a29 HEAD@{10}: rebase: add txyz, lammpsdump formats to test
37fa2239d HEAD@{11}: rebase: add absolute_import
f9841daf3 HEAD@{12}: rebase: rm README.md
430ec157c HEAD@{13}: rebase: fix txyz trjfile
41905d73d HEAD@{14}: rebase: fix netcdf trjfile
0298ee113 HEAD@{15}: rebase: start of gsoc 2020 project, serialize
b04d298c6 HEAD@{16}: rebase: pickling all readers now works...
a231b547e HEAD@{17}: rebase: added more formats to multiprocessing tests
a0ccb0a20 HEAD@{18}: rebase: add tests for multiprocessing
ef30ad760 HEAD@{19}: rebase: make AuxReader not seralise (until tested)

yuxuanzhuang · 2020-06-08T16:57:29Z

@yuxuanzhuang I think @kain88-de is right here. Can you look into if this is possible?

So I added such modification to another branch, not to mess up with current one. It works fine with most cases when the file was read by open (instead of bz2, gzip). I envision we can fix this pretty easily. we can also add gsd.hoomd.open, and any future open functions to anyopen, and make them pickable.

In terms of guaranteeing thread-safe. I guess we should make sure the files open with writing permissions should not return with this new Class.

The core codes read below:

class FileIOPickable(io.FileIO):
    def __init__(self, name, *args, **kwargs):
        super().__init__(name, *args, **kwargs)
        self._args = args
        self._kwargs = kwargs

    def __getstate__(self):
        return self.tell(), self.__dict__

    def __setstate__(self, args):
        state = args[1]
        super().__init__(state['name'], *state['_args'], **state['_kwargs'])
        self.seek(args[0])


class TextIOPickable(io.TextIOWrapper):
    def __init__(self, buffer, *args, **kwargs):
        super().__init__(buffer, *args, **kwargs)
        self._buffer_args = buffer.__dict__
        self._args = args
        self._kwargs = kwargs
        
    def __getstate__(self):
        return self.tell(), self.__dict__

    def __setstate__(self, args):
        state = args[1]
        buffer = FileIOPickable(state['_buffer_args']['name'],
                                *state['_buffer_args']['_args'],
                                **state['_buffer_args']['_kwargs'])
        super().__init__(buffer, *state['_args'], **state['_kwargs'])
        self.seek(args[0])

def pickle_open(name, mode):
    buffer = FileIOPickable(name, mode='rb')
    if mode == 'rb':
        return buffer
    elif mode == 'rt' or mode == 'r':
        return TextIOPickable(buffer)
...
    handlers = {'bz2': bz2_open, 'gz': gzip.open, '': pickle_open }
...

kain88-de · 2020-06-08T20:10:29Z

Thanks. Unfortunately, the rebase took more than 20 steps. In the output you can see every step that was applied by git, also every step of the rebase. The first cryptic name in each line is the SHA of the git commit. You can use this to go a specific commit with git checkout <SHA> or reset the current branch with git reset --hard <SHA>.

This is a long running PR that touches many places in the code. It is difficult to do a rebase every once in a while on such a PR. If you want to catch up with Develop a merge would be better suited, more towards the end once you settled on a design and work towards finalizing the PR.

You can still reset this branch. Using git reflog you can look at what the first commit SHA before you started the rebase. To safely reset this branch you can then run the following git commands.

git branch backup # just in case reference to start over.
git reset --hard <SHA>

and merge with develop.

git remote update  # fetch remote changes without local updates
git merge mda_origin/develop

About the changes regarding pickling. If you can show us the other branch with your changes that would help. Glad it seems to work.

kain88-de · 2020-06-08T20:11:14Z

To update this branch on your remote you will have to force the push git push origin HEAD --force-with-lease

kain88-de · 2020-06-08T20:29:36Z

In terms of guaranteeing thread-safe. I guess we should make sure the files open with writing permissions should not return with this new Class.

That should throw an exception when trying to pickle it. It is a truly exceptional case and should signal this to the caller. We can give the exception a short explanatory text. Something like "Writable file handler cannot be pickled". This is a better signal that an error occurred instead of silently returning nothing.

orbeckst · 2020-06-08T21:26:34Z

This is a long running PR that touches many places in the code. It is difficult to do a rebase every once in a while on such a PR. If you want to catch up with Develop a merge would be better suited, more towards the end once you settled on a design and work towards finalizing the PR.

Sorry, I suggested the rebase because we had a number of changes that would improve running tests. I didn't anticipate that the rebase would mess too much with history (normally I find that the rebases to develop are pretty clean, serial history).

yuxuanzhuang · 2020-06-09T07:47:12Z

This is a pretty preliminary branch for IO serialization. It works for most case (and even slight faster I believe), while I attached current Reader failures in the comments.

If we are opt for this, I guess we can merge those changes into this PR. I do love how neet it becomes with composition instead of inheritance.

richardjgowers · 2020-06-09T12:45:04Z

@yuxuanzhuang ok the branch looks good. I think a way forward here is to create a branch that implements only the FileIOPicklable and TextIOPicklable classes, has a bunch of tests for them, and we can merge that in a PR. Then we can rebase this work on top of that.

* Fixes #2878 * basic approach: composition instead of inheritance for pickling Universe (which was tested in PR #2704 (which was derived from PR #2140)) * Changes made in this Pull Request: - add new classes and pickle_open function to picklable_file_io.py - add new picklable `BufferedReader`, `FileIO`, and `TextIOWrapper` classes. - implement `__getstate__`, `__setstate__` to `Universe` and `BaseReader` - fix DCD, XDR pickle issue - modified gsd and ncdf to be picklabel - modified ChainReader to be picklabel - ensure chemfiles is picklable * add tests (MultiFrameReader will test for serializability) * update CHANGELOG * update docs Note: This merge squashed 120 commits. See PR #2723 for the full history with 420 comments.

orbeckst · 2020-08-08T00:45:45Z

I think we can close this PR, now that PR #2723 is merged?

orbeckst · 2020-08-10T21:45:39Z

Closing because it has been superseded by "composition over inheritance" PR #2723 .

* Fixes MDAnalysis#2878 * basic approach: composition instead of inheritance for pickling Universe (which was tested in PR MDAnalysis#2704 (which was derived from PR MDAnalysis#2140)) * Changes made in this Pull Request: - add new classes and pickle_open function to picklable_file_io.py - add new picklable `BufferedReader`, `FileIO`, and `TextIOWrapper` classes. - implement `__getstate__`, `__setstate__` to `Universe` and `BaseReader` - fix DCD, XDR pickle issue - modified gsd and ncdf to be picklabel - modified ChainReader to be picklabel - ensure chemfiles is picklable * add tests (MultiFrameReader will test for serializability) * update CHANGELOG * update docs Note: This merge squashed 120 commits. See PR MDAnalysis#2723 for the full history with 420 comments.

richardjgowers and others added 15 commits February 16, 2019 10:52

added experimental pickling support to Universe

a00840e

fixed pickle test

133072f

wip of reader pickling

aa595f5

simplified serialisation support more

ba1a138

make AuxReader not seralise (until tested)

caa588d

add tests for multiprocessing

e21361d

added more formats to multiprocessing tests

99bdd08

and broke everything

pickling all readers now works...

f1a7d5a

Merge pull request #1 from MDAnalysis/develop

931d9b5

update to HEAD

start of gsoc 2020 project, serialize

7126473

merge serialise(#PR2140)

0c528c7

fix depreciate core.flag

59f55e8

fix netcdf trjfile

dce2dd4

fix txyz trjfile

c730e79

rm README.md

598a671

orbeckst mentioned this pull request May 29, 2020

added experimental pickling support to Universe #2140

Closed

orbeckst requested review from IAlibay, orbeckst, richardjgowers and fiona-naughton and removed request for IAlibay May 29, 2020 18:45

yuxuanzhuang added 5 commits June 1, 2020 09:51

add absolute_import

2c2883a

add txyz, lammpsdump formats to test

7f1f4a1

add lammpsdump support for pickle

33bf888

add pickle test for gsd

90d8f78

need gsd>2.1.1 for pickle

31e08a4

IAlibay requested changes Jun 5, 2020

View reviewed changes

richardjgowers assigned richardjgowers, fiona-naughton and IAlibay Jun 6, 2020

fiona-naughton requested changes Jun 6, 2020

View reviewed changes

merge with develop

91b28f4

yuxuanzhuang force-pushed the serialize branch from 77843d4 to 91b28f4 Compare June 8, 2020 21:03

add xfail to py2 serilization

864d733

yuxuanzhuang mentioned this pull request Jun 9, 2020

Serialize FileIO and TextIOWrapper and Universe #2723

Merged

4 tasks

orbeckst added the GSoC GSoC project label Aug 7, 2020

orbeckst closed this Aug 10, 2020

		@@ -2069,6 +2078,54 @@ def _apply_transformations(self, ts):
		return ts


		class _AsciiPickle(object):

Updating Serialization Functionality from PR #2140 #2704

Updating Serialization Functionality from PR #2140 #2704

Conversation

yuxuanzhuang commented May 29, 2020

PR Checklist

codecov bot commented May 29, 2020 • edited Loading

Codecov Report

orbeckst commented Jun 1, 2020

orbeckst commented Jun 1, 2020

IAlibay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbeckst commented Jun 5, 2020

orbeckst commented Jun 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kain88-de commented Jun 6, 2020

richardjgowers commented Jun 7, 2020

yuxuanzhuang commented Jun 8, 2020

yuxuanzhuang commented Jun 8, 2020

kain88-de commented Jun 8, 2020 via email

yuxuanzhuang commented Jun 8, 2020

yuxuanzhuang commented Jun 8, 2020

kain88-de commented Jun 8, 2020

kain88-de commented Jun 8, 2020

kain88-de commented Jun 8, 2020

orbeckst commented Jun 8, 2020

yuxuanzhuang commented Jun 9, 2020

richardjgowers commented Jun 9, 2020

orbeckst commented Aug 8, 2020

orbeckst commented Aug 10, 2020

codecov bot commented May 29, 2020 •

edited

Loading