VectorNumpy pickle suppport to enable multiprocessing #163

cansik · 2022-01-06T12:28:51Z

When sharing VectorNumpy based arrays (like VectorNumpy3D) between processes in python, pickle is used for serialisation and deserialisation of the objects. But it seems that structured arrays which inherhit form a Numpy array lose part of their state during that process.

Here is an example that shows the problem:

import pickle
import numpy as np
import vector

test = np.arange(0, 24, 0.1).view([("x", float), ("y", float), ("z", float)]).view(vector.VectorNumpy3D)

test_pickled = pickle.dumps(test)
test_new = pickle.loads(test_pickled)

print(test_new) # Leads to exception A
print(test_new[0]) # Leads to exception B

Exception A

First, there is a problem that __array_finalize__ is called twice from pickle.loads, once with the actual data and once with a None type. This leads to the following error message:

VectorNumpy3D must have a structured dtype containing fields ("x", "y") or ("rho", "phi")

It is possible to prevent this behaviour by checking for None arguments and just ignore them (L900-L901)

Exception B

After that it is possible to print the unpickled array, but accessing the elements itself (test_new[0]) leads to another error message. I guess it has to do with the construction of the VectorObjects itself.

TypeError: __init__() takes 3 positional arguments but 4 were given

Solution

After a bit of research I found this stackoverflow answer, why subclasses of numpy can lose their state when being pickled. Implementing this in the VectorNumpy parent class (L433-L440) fixes it also for all subclasses like VectorNumpy3D. To check if the implementation works, there are six new tests pickling/unpickling every Vector/Momentum-Numpy array variant.

In my own projects the code is already used for multi-processed calculations and result-sharing in the VectorNumpy4D format. Would be great if you could review it and give me feedback if this implementation is ok.

… VectorNumpy object

for more information, see https://pre-commit.ci

jpivarski

You're right: VectorNumpy*D and MomentumNumpy*D instances do have state beyond the array state: they have

self._azimuthal_type (all classes)
self._longitudinal_type (3D and 4D classes)
self._temporal_type (4D classes)

which specifies which coordinate system the array belongs to. That information could have been generated every time it's needed by looking at the set of field names in its dtype, but instead it's assigned in __array_finalize__ (which you've likely seen when you fixed the obj is None case).

Another solution would have been to make this data lazy ("generate every time"), but the current implementation could be thought of as an optimization: the values of self._azimuthal_type, etc., are class objects, so checking their value is a quick is comparison. Doing it lazily would add several string comparisons. That would slow down the function dispatch—deciding which compute function to run—but it wouldn't slow down the execution of the compute function, which is where most of the time should be spent. So it's not obvious that we need this optimization, but why change things to make them slower? Functional purity (i.e. no mutable state other than what NumPy needs)?

Your solution of adding this state to the pickle data works well. I think the self.__dict__ includes instance state (self._azimuthal_type, etc.) and not class object state (VectorNumpy*D.ObjectClass and VectorNumpy*D._IS_MOMENTUM), which we wouldn't want to double-count because then the unpickled instances might have these as class attributes and instance attributes. I just checked: it does the right thing (__dict__ only includes instance state, not class object state). So that's good.

I like the fact that pickling could be solved in one place without code duplication. There's the right number of obj is None checks in __array_finalize__ methods (6, for the 6 concrete classes), so that's good. When pickling, there's a choice between __reduce__/__setstate__ and __getstate__/__setstate__. I don't know which is more modern and preferred. I just looked it up (here):

Although powerful, implementing __reduce__() directly in your classes is error prone. For this reason, class designers should use the high-level interface (i.e., __getnewargs_ex__(), __getstate__() and __setstate__()) whenever possible. We will show, however, cases where using __reduce__() is the only option or leads to more efficient pickling or both.

On the one hand, it's not saying that __reduce__ is deprecated; it's not less future-proof than __getstate__. If "error prone" is the only argument against it, then demonstrating that it works with tests is a good answer to it, unless it's also difficult to maintain. It doesn't look difficult to maintain. I'm fine with it being __reduce__ if you are.

If you'd prefer to replace __reduce__ with a __getstate__ (or some other change), let me know and I'll hold off merging until you're done. Otherwise, let me know that you're done with this PR and I'll merge it.

And thank you for contributing!

henryiii · 2022-01-06T16:36:55Z

I'd usually use setstate/getstate (reduce should only be used if getstate cannot be used). Was there a reason getstate could not be used or would be significantly more complex? Given you are just slightly modifying a super call to reduce, I think reduce is pretty likely to be safe in this case.

Sort of related, but nothing to do directly with this PR: can we add __slots__?

henryiii

I'm happy with this, ideally you can commend on __reduce__, but as long as it's clearly the best solution, I'm happy.

henryiii · 2022-01-06T16:39:08Z

src/vector/_backends/numpy_.py

+
+    def __setstate__(self, state: typing.Any) -> None:
+        self.__dict__.update(state[-1])
+        super().__setstate__(state[0:-1])  # type: ignore


Reminder to myself: we should enable --show-error-codes on mypy and require the error codes here, so we know why these are being ignored. I can do that after the PR.

jpivarski · 2022-01-06T16:47:53Z

Sort of related, but nothing to do directly with this PR: can we add __slots__?

We could use __slots__. For the most part, I haven't had time to focus on Vector apart from keeping up with the most urgent bug reports (i.e. calling a function returns the wrong answer).

From what I know, __slots__ would be an optimization, but it might have other benefits you know of. It could be applied to Vector/MomentumNumpy*D and Vector/MomentumObject*D, but not the Awkward Array backend because that one doesn't have extra instance state.

jpivarski · 2022-01-06T17:34:09Z

@cansik Give us an indication of whether you're done and Henry or I will squash-and-merge your PR. Thanks!

henryiii · 2022-01-06T17:42:20Z

My favorite reason to add __slots__ is that it causes v.misspelled = 2 to start failing. If you still want a dict, it can give you extra control over the __dict__ by explicitly adding it to __slots__; then you can have things in __slots__ and things in __dict__.

cansik · 2022-01-06T18:21:22Z

@jpivarski & @henryiii Thanks for the review. I mainly followed the explanation in the stackoverflow answer and it seems that numpy does not support getstate, but reduce by default: https://stackoverflow.com/a/54487727/1138326

For this reason, I would leave the implementation as it is and I have nothing to add to it.

jpivarski · 2022-01-06T18:28:29Z

numpy does not support getstate

That's definitely a good enough reason. NumPy's restrictions on subclasses are pretty tight.

Thanks, I'll merge it now!

cansik and others added 7 commits January 6, 2022 12:49

added VectorNumpy pickle tests

b2b83da

added MomentumNumpy pickle tests

188a316

implemented reduce method to preserve custom attributes when pickling…

78b6de8

… VectorNumpy object

[pre-commit.ci] auto fixes from pre-commit.com hooks

a081246

for more information, see https://pre-commit.ci

added return types and fixed tuple concatenation

57277e3

added stronger type annotations

8f97124

fixed annotations to match super class

93ad44a

jpivarski approved these changes Jan 6, 2022

View reviewed changes

henryiii approved these changes Jan 6, 2022

View reviewed changes

henryiii reviewed Jan 6, 2022

View reviewed changes

jpivarski merged commit c207505 into scikit-hep:main Jan 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VectorNumpy pickle suppport to enable multiprocessing #163

VectorNumpy pickle suppport to enable multiprocessing #163

cansik commented Jan 6, 2022

jpivarski left a comment

henryiii commented Jan 6, 2022

henryiii left a comment

henryiii Jan 6, 2022

jpivarski commented Jan 6, 2022

jpivarski commented Jan 6, 2022

henryiii commented Jan 6, 2022

cansik commented Jan 6, 2022

jpivarski commented Jan 6, 2022

VectorNumpy pickle suppport to enable multiprocessing #163

VectorNumpy pickle suppport to enable multiprocessing #163

Conversation

cansik commented Jan 6, 2022

Exception A

Exception B

Solution

jpivarski left a comment

Choose a reason for hiding this comment

henryiii commented Jan 6, 2022

henryiii left a comment

Choose a reason for hiding this comment

henryiii Jan 6, 2022

Choose a reason for hiding this comment

jpivarski commented Jan 6, 2022

jpivarski commented Jan 6, 2022

henryiii commented Jan 6, 2022

cansik commented Jan 6, 2022

jpivarski commented Jan 6, 2022