Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VectorNumpy pickle suppport to enable multiprocessing #163

Merged
merged 7 commits into from
Jan 6, 2022

Conversation

cansik
Copy link
Contributor

@cansik cansik commented Jan 6, 2022

When sharing VectorNumpy based arrays (like VectorNumpy3D) between processes in python, pickle is used for serialisation and deserialisation of the objects. But it seems that structured arrays which inherhit form a Numpy array lose part of their state during that process.

Here is an example that shows the problem:

import pickle
import numpy as np
import vector

test = np.arange(0, 24, 0.1).view([("x", float), ("y", float), ("z", float)]).view(vector.VectorNumpy3D)

test_pickled = pickle.dumps(test)
test_new = pickle.loads(test_pickled)

print(test_new) # Leads to exception A
print(test_new[0]) # Leads to exception B

Exception A

First, there is a problem that __array_finalize__ is called twice from pickle.loads, once with the actual data and once with a None type. This leads to the following error message:

VectorNumpy3D must have a structured dtype containing fields ("x", "y") or ("rho", "phi")

It is possible to prevent this behaviour by checking for None arguments and just ignore them (L900-L901)

Exception B

After that it is possible to print the unpickled array, but accessing the elements itself (test_new[0]) leads to another error message. I guess it has to do with the construction of the VectorObjects itself.

TypeError: __init__() takes 3 positional arguments but 4 were given

Solution

After a bit of research I found this stackoverflow answer, why subclasses of numpy can lose their state when being pickled. Implementing this in the VectorNumpy parent class (L433-L440) fixes it also for all subclasses like VectorNumpy3D. To check if the implementation works, there are six new tests pickling/unpickling every Vector/Momentum-Numpy array variant.

In my own projects the code is already used for multi-processed calculations and result-sharing in the VectorNumpy4D format. Would be great if you could review it and give me feedback if this implementation is ok.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right: VectorNumpy*D and MomentumNumpy*D instances do have state beyond the array state: they have

  • self._azimuthal_type (all classes)
  • self._longitudinal_type (3D and 4D classes)
  • self._temporal_type (4D classes)

which specifies which coordinate system the array belongs to. That information could have been generated every time it's needed by looking at the set of field names in its dtype, but instead it's assigned in __array_finalize__ (which you've likely seen when you fixed the obj is None case).

Another solution would have been to make this data lazy ("generate every time"), but the current implementation could be thought of as an optimization: the values of self._azimuthal_type, etc., are class objects, so checking their value is a quick is comparison. Doing it lazily would add several string comparisons. That would slow down the function dispatch—deciding which compute function to run—but it wouldn't slow down the execution of the compute function, which is where most of the time should be spent. So it's not obvious that we need this optimization, but why change things to make them slower? Functional purity (i.e. no mutable state other than what NumPy needs)?

Your solution of adding this state to the pickle data works well. I think the self.__dict__ includes instance state (self._azimuthal_type, etc.) and not class object state (VectorNumpy*D.ObjectClass and VectorNumpy*D._IS_MOMENTUM), which we wouldn't want to double-count because then the unpickled instances might have these as class attributes and instance attributes. I just checked: it does the right thing (__dict__ only includes instance state, not class object state). So that's good.

I like the fact that pickling could be solved in one place without code duplication. There's the right number of obj is None checks in __array_finalize__ methods (6, for the 6 concrete classes), so that's good. When pickling, there's a choice between __reduce__/__setstate__ and __getstate__/__setstate__. I don't know which is more modern and preferred. I just looked it up (here):

Although powerful, implementing __reduce__() directly in your classes is error prone. For this reason, class designers should use the high-level interface (i.e., __getnewargs_ex__(), __getstate__() and __setstate__()) whenever possible. We will show, however, cases where using __reduce__() is the only option or leads to more efficient pickling or both.

On the one hand, it's not saying that __reduce__ is deprecated; it's not less future-proof than __getstate__. If "error prone" is the only argument against it, then demonstrating that it works with tests is a good answer to it, unless it's also difficult to maintain. It doesn't look difficult to maintain. I'm fine with it being __reduce__ if you are.

If you'd prefer to replace __reduce__ with a __getstate__ (or some other change), let me know and I'll hold off merging until you're done. Otherwise, let me know that you're done with this PR and I'll merge it.

And thank you for contributing!

@henryiii
Copy link
Member

henryiii commented Jan 6, 2022

I'd usually use setstate/getstate (reduce should only be used if getstate cannot be used). Was there a reason getstate could not be used or would be significantly more complex? Given you are just slightly modifying a super call to reduce, I think reduce is pretty likely to be safe in this case.

Sort of related, but nothing to do directly with this PR: can we add __slots__?

Copy link
Member

@henryiii henryiii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this, ideally you can commend on __reduce__, but as long as it's clearly the best solution, I'm happy.


def __setstate__(self, state: typing.Any) -> None:
self.__dict__.update(state[-1])
super().__setstate__(state[0:-1]) # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to myself: we should enable --show-error-codes on mypy and require the error codes here, so we know why these are being ignored. I can do that after the PR.

@jpivarski
Copy link
Member

Sort of related, but nothing to do directly with this PR: can we add __slots__?

We could use __slots__. For the most part, I haven't had time to focus on Vector apart from keeping up with the most urgent bug reports (i.e. calling a function returns the wrong answer).

From what I know, __slots__ would be an optimization, but it might have other benefits you know of. It could be applied to Vector/MomentumNumpy*D and Vector/MomentumObject*D, but not the Awkward Array backend because that one doesn't have extra instance state.

@jpivarski
Copy link
Member

@cansik Give us an indication of whether you're done and Henry or I will squash-and-merge your PR. Thanks!

@henryiii
Copy link
Member

henryiii commented Jan 6, 2022

My favorite reason to add __slots__ is that it causes v.misspelled = 2 to start failing. If you still want a dict, it can give you extra control over the __dict__ by explicitly adding it to __slots__; then you can have things in __slots__ and things in __dict__.

@cansik
Copy link
Contributor Author

cansik commented Jan 6, 2022

@jpivarski & @henryiii Thanks for the review. I mainly followed the explanation in the stackoverflow answer and it seems that numpy does not support getstate, but reduce by default: https://stackoverflow.com/a/54487727/1138326

For this reason, I would leave the implementation as it is and I have nothing to add to it.

@jpivarski
Copy link
Member

numpy does not support getstate

That's definitely a good enough reason. NumPy's restrictions on subclasses are pretty tight.

Thanks, I'll merge it now!

@jpivarski jpivarski merged commit c207505 into scikit-hep:main Jan 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants