-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to dill//cloudpickle #558
Comments
Hey @piskvorky ,Would this be a fair plan of exploration?
I imagine there wouldn't be an explicit need to test with gensim, correct? Though I see that it's a one-line change to swap the modules. Would there be something specific we're looking at, either performance-wise, or features-wise that would be worth watching out for? For what it's worth stackoverflow seems to have an ok-ish comparison of the two libraries. Apologies if I just popped in out of nowhere. I'd been lurking/looking for opportunities to contribute, and this seems like a very manageable first contribution. |
Thanks @shirish93 , that sounds like a good plan! And a great first contribution :) I'd add py2.6 to the test for sure, we need to support that. And try serializing numpy arrays (and objects containing numpy arrays), that's a common use case. I'm not so much worried about performance, speed is not critical here. If anything, size is more important (compression quality). But what's really critical is the interoperability and stability: storing a "pickle" in py2.6, loading it back in 2.7 etc. If the serialization worked across py2/py3 boundary, that would be ideal, though I'm not sure if that's possible, given that both dill and cloudpickle support serializing code too. |
What part of couldpickle's wider functionality is desired? |
After a preliminary exploration(Windows, Python 2.7 & Python 3.3, Cloudpickle vs Dill, vs Pickle), I would suggest we figure out a justification for the change as @gojomo suggests before further action. On the plus side, they can pickle in functions, etc well. On the -ve side, they are consistently 10x slower than plain pickle, across all types of native python data structures and numpy arrays. Cloudpickle is slightly faster than Dill, but they are both consistently 10x slower on my Window machine. This was without using the compiled C libraries for anything, but I'm not sure how much that's going to make a difference. I've started a discussion in their project to see if something can be done about it. (Update: Likely not.) Since gensim doesn't extensively use pickling, the speed hit might not matter too much (and the task itself doesn't take too long anyway), but this could lead to unnecessary slowdowns if the 'extra' features of those libraries are not being used. I'll do an extended analysis in all the environments depending on where you guys are leaning on it. |
Thanks @shirish93 ! Can you show your benchmarking code? 10x slowdown sounds a little too much. Perhaps we could let users choose the serialization lib dynamically, but I'm not very excited about introducing the extra API complexity needed for this. Compiled C would speed up serializations for sure -- do dill/cloudpickle offer such option? Do they cross this py2.x-py3.x boundary, say can a function object stored in 2.6 be loaded in 3.4? |
@gojomo mostly lambda functions, having to name everything is tedious. |
Sorry for disappearing. Loong holiday break, haha! Here's my test code:
And here's the results:
It might make sense to use them for the lambdas though, if there's not a lot of pickling/depickling going on.
Does this help? |
Hi, I'm the |
@tmylk Is it required to change utils.pickle to accommodate dill and cloudpickle or is bench-marking done by @shirish93 sufficient to infer that due to speed issues both the libraries must not be used? |
Hi @pranay360 The goal is to create an abstract serializer with default being pickle as it is now. Good implementation of this abstraction is in Pyro. Feel free to copy that approach. |
This would fix #913 |
Partially fixed in #1039 |
Seems like cloudpickle is a new drop-in replacement for pickle, offering a wider range of functionality at no extra cost (no extra dependencies).
A bit like dill, which is older and probably more mature. @ogrisel do you know the difference?
Ticket: Investigate whether it's really the case such replacement is drop-in, especially for cross-platform (unix / win) and cross-python-version (2.x / 3.x) serializations. Document any limitations. If suitable, switch
utils.pickle
to use dill or cloudpickle internally.The text was updated successfully, but these errors were encountered: