Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to dill//cloudpickle #558

Closed
piskvorky opened this issue Dec 6, 2015 · 12 comments
Closed

Switch to dill//cloudpickle #558

piskvorky opened this issue Dec 6, 2015 · 12 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature

Comments

@piskvorky
Copy link
Owner

Seems like cloudpickle is a new drop-in replacement for pickle, offering a wider range of functionality at no extra cost (no extra dependencies).

A bit like dill, which is older and probably more mature. @ogrisel do you know the difference?

Ticket: Investigate whether it's really the case such replacement is drop-in, especially for cross-platform (unix / win) and cross-python-version (2.x / 3.x) serializations. Document any limitations. If suitable, switch utils.pickle to use dill or cloudpickle internally.

@piskvorky piskvorky added feature Issue described a new feature difficulty easy Easy issue: required small fix labels Dec 6, 2015
@piskvorky piskvorky changed the title Switch to cloudpickle Switch to dill//cloudpickle Dec 6, 2015
@shirish93
Copy link

Hey @piskvorky ,Would this be a fair plan of exploration?

  • Install dill/cloud pickle on 2.7/3.4 on linux/windows (so 2_2_2 = 8 different settings).
  • throw all possible objects python will allow/they can take, into them.
  • time them
  • publish the results.

I imagine there wouldn't be an explicit need to test with gensim, correct? Though I see that it's a one-line change to swap the modules.

Would there be something specific we're looking at, either performance-wise, or features-wise that would be worth watching out for?

For what it's worth stackoverflow seems to have an ok-ish comparison of the two libraries.

Apologies if I just popped in out of nowhere. I'd been lurking/looking for opportunities to contribute, and this seems like a very manageable first contribution.

@piskvorky
Copy link
Owner Author

Thanks @shirish93 , that sounds like a good plan! And a great first contribution :)

I'd add py2.6 to the test for sure, we need to support that. And try serializing numpy arrays (and objects containing numpy arrays), that's a common use case.

I'm not so much worried about performance, speed is not critical here. If anything, size is more important (compression quality).

But what's really critical is the interoperability and stability: storing a "pickle" in py2.6, loading it back in 2.7 etc. If the serialization worked across py2/py3 boundary, that would be ideal, though I'm not sure if that's possible, given that both dill and cloudpickle support serializing code too.

@gojomo
Copy link
Collaborator

gojomo commented Dec 8, 2015

What part of couldpickle's wider functionality is desired?

@shirish93
Copy link

After a preliminary exploration(Windows, Python 2.7 & Python 3.3, Cloudpickle vs Dill, vs Pickle), I would suggest we figure out a justification for the change as @gojomo suggests before further action.

On the plus side, they can pickle in functions, etc well.

On the -ve side, they are consistently 10x slower than plain pickle, across all types of native python data structures and numpy arrays. Cloudpickle is slightly faster than Dill, but they are both consistently 10x slower on my Window machine. This was without using the compiled C libraries for anything, but I'm not sure how much that's going to make a difference. I've started a discussion in their project to see if something can be done about it. (Update: Likely not.)

Since gensim doesn't extensively use pickling, the speed hit might not matter too much (and the task itself doesn't take too long anyway), but this could lead to unnecessary slowdowns if the 'extra' features of those libraries are not being used.

I'll do an extended analysis in all the environments depending on where you guys are leaning on it.

@piskvorky
Copy link
Owner Author

Thanks @shirish93 !

Can you show your benchmarking code?

10x slowdown sounds a little too much. Perhaps we could let users choose the serialization lib dynamically, but I'm not very excited about introducing the extra API complexity needed for this.

Compiled C would speed up serializations for sure -- do dill/cloudpickle offer such option?

Do they cross this py2.x-py3.x boundary, say can a function object stored in 2.6 be loaded in 3.4?

@piskvorky
Copy link
Owner Author

@gojomo mostly lambda functions, having to name everything is tedious.

@shirish93
Copy link

Sorry for disappearing. Loong holiday break, haha!

Here's my test code:

picklers = ['pickle', 'dill', 'cloudpickle']
def timePickler(pickler):
    setup = '''import dill,random,cloudpickle, pickle, numpy as np \n\
    randArr = np.random.random(500)\n\
    randString = ''.join([chr(each) for each in [random.randint(32,97) for num in range(10000)]])\n\
    randNum = random.randint(99, 9999999999)'''

    arrTime = timeit.timeit('pickler.dumps(randArr)'.replace('pickler', pickler),setup = setup, number = 100000 )
    stringTime = timeit.timeit('pickler.dumps(randString)'.replace('pickler', pickler),setup = setup, number = 100000 )
    numTime = timeit.timeit('pickler.dumps(randNum)'.replace('pickler', pickler),setup = setup, number = 100000)
    print (pickler + " time for pickling single number: "+str(numTime))
    print (pickler + " time for pickling numpy Arr: "+str(arrTime))
    print (pickler + " time for pickling string Arr: "+str(stringTime))
    print ('\n')

for pickler in picklers:
    timePickler(pickler)

And here's the results:

pickle time for pickling single number: 0.07997888903719286
pickle time for pickling numpy Arr: 1.7257532820085544
pickle time for pickling string Arr: 0.16252835749355654


dill time for pickling single number: 1.757354197975019
dill time for pickling numpy Arr: 26.846140637378994
dill time for pickling string Arr: 2.273730786280794


cloudpickle time for pickling single number: 1.109232256508676
cloudpickle time for pickling numpy Arr: 23.424985621413725
cloudpickle time for pickling string Arr: 1.4475591140499091

It might make sense to use them for the lambdas though, if there's not a lot of pickling/depickling going on.

>>> timeit.timeit('cloudpickle.dumps(lambda x: x)', setup = setup, number = 1000)
0.25640227041549224
>>> timeit.timeit('dill.dumps(lambda x: x)', setup = setup, number = 1000)
0.13595728227574

Does this help?

@piskvorky piskvorky added difficulty medium Medium issue: required good gensim understanding & python skills and removed difficulty easy Easy issue: required small fix labels Jan 12, 2016
@mmckerns
Copy link

mmckerns commented May 6, 2016

Hi, I'm the dill author. A typical, and fairly easy, choice that many packages (IPython, mpi4py, Pyro4, …) have made is to abstract the serializer. That way, you keep your performance, but let the user override performance with robustness. Both cloudpickle and dill provide drop-in replacement, while dill also has some additional features (e.g source code extraction) and serialization options (e.g include dependencies). I have made some recent optimizations to improve the timing, but I'm sure that both dill and cloudpickle will always be generally slower than pickle. Personally, I don't use dill directly too much -- however I do use it pretty extensively within my drop-in replacements of multiprocessing and pp (i.e. multiprocess and ppft), as it makes converting code to parallel/distributed pretty trivial, and that in itself is a huge win. If you guys find that you have feature requests for dill, fill out an issue.

@pranay360
Copy link
Contributor

@tmylk Is it required to change utils.pickle to accommodate dill and cloudpickle or is bench-marking done by @shirish93 sufficient to infer that due to speed issues both the libraries must not be used?

@tmylk
Copy link
Contributor

tmylk commented Sep 25, 2016

Hi @pranay360 The goal is to create an abstract serializer with default being pickle as it is now. Good implementation of this abstraction is in Pyro. Feel free to copy that approach.

@tmylk
Copy link
Contributor

tmylk commented Oct 5, 2016

This would fix #913

@menshikh-iv
Copy link
Contributor

Partially fixed in #1039

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

7 participants