Switch to dill//cloudpickle #558

piskvorky · 2015-12-06T05:46:07Z

Seems like cloudpickle is a new drop-in replacement for pickle, offering a wider range of functionality at no extra cost (no extra dependencies).

A bit like dill, which is older and probably more mature. @ogrisel do you know the difference?

Ticket: Investigate whether it's really the case such replacement is drop-in, especially for cross-platform (unix / win) and cross-python-version (2.x / 3.x) serializations. Document any limitations. If suitable, switch utils.pickle to use dill or cloudpickle internally.

The text was updated successfully, but these errors were encountered:

shirish93 · 2015-12-08T01:29:18Z

Hey @piskvorky ,Would this be a fair plan of exploration?

Install dill/cloud pickle on 2.7/3.4 on linux/windows (so 2_2_2 = 8 different settings).
throw all possible objects python will allow/they can take, into them.
time them
publish the results.

I imagine there wouldn't be an explicit need to test with gensim, correct? Though I see that it's a one-line change to swap the modules.

Would there be something specific we're looking at, either performance-wise, or features-wise that would be worth watching out for?

For what it's worth stackoverflow seems to have an ok-ish comparison of the two libraries.

Apologies if I just popped in out of nowhere. I'd been lurking/looking for opportunities to contribute, and this seems like a very manageable first contribution.

piskvorky · 2015-12-08T02:09:16Z

Thanks @shirish93 , that sounds like a good plan! And a great first contribution :)

I'd add py2.6 to the test for sure, we need to support that. And try serializing numpy arrays (and objects containing numpy arrays), that's a common use case.

I'm not so much worried about performance, speed is not critical here. If anything, size is more important (compression quality).

But what's really critical is the interoperability and stability: storing a "pickle" in py2.6, loading it back in 2.7 etc. If the serialization worked across py2/py3 boundary, that would be ideal, though I'm not sure if that's possible, given that both dill and cloudpickle support serializing code too.

gojomo · 2015-12-08T02:29:01Z

What part of couldpickle's wider functionality is desired?

shirish93 · 2015-12-09T19:47:12Z

After a preliminary exploration(Windows, Python 2.7 & Python 3.3, Cloudpickle vs Dill, vs Pickle), I would suggest we figure out a justification for the change as @gojomo suggests before further action.

On the plus side, they can pickle in functions, etc well.

On the -ve side, they are consistently 10x slower than plain pickle, across all types of native python data structures and numpy arrays. Cloudpickle is slightly faster than Dill, but they are both consistently 10x slower on my Window machine. This was without using the compiled C libraries for anything, but I'm not sure how much that's going to make a difference. I've started a discussion in their project to see if something can be done about it. (Update: Likely not.)

Since gensim doesn't extensively use pickling, the speed hit might not matter too much (and the task itself doesn't take too long anyway), but this could lead to unnecessary slowdowns if the 'extra' features of those libraries are not being used.

I'll do an extended analysis in all the environments depending on where you guys are leaning on it.

piskvorky · 2015-12-10T02:00:35Z

Thanks @shirish93 !

Can you show your benchmarking code?

10x slowdown sounds a little too much. Perhaps we could let users choose the serialization lib dynamically, but I'm not very excited about introducing the extra API complexity needed for this.

Compiled C would speed up serializations for sure -- do dill/cloudpickle offer such option?

Do they cross this py2.x-py3.x boundary, say can a function object stored in 2.6 be loaded in 3.4?

piskvorky · 2015-12-10T02:03:13Z

@gojomo mostly lambda functions, having to name everything is tedious.

shirish93 · 2016-01-11T15:33:20Z

Sorry for disappearing. Loong holiday break, haha!

Here's my test code:

picklers = ['pickle', 'dill', 'cloudpickle']
def timePickler(pickler):
    setup = '''import dill,random,cloudpickle, pickle, numpy as np \n\
    randArr = np.random.random(500)\n\
    randString = ''.join([chr(each) for each in [random.randint(32,97) for num in range(10000)]])\n\
    randNum = random.randint(99, 9999999999)'''

    arrTime = timeit.timeit('pickler.dumps(randArr)'.replace('pickler', pickler),setup = setup, number = 100000 )
    stringTime = timeit.timeit('pickler.dumps(randString)'.replace('pickler', pickler),setup = setup, number = 100000 )
    numTime = timeit.timeit('pickler.dumps(randNum)'.replace('pickler', pickler),setup = setup, number = 100000)
    print (pickler + " time for pickling single number: "+str(numTime))
    print (pickler + " time for pickling numpy Arr: "+str(arrTime))
    print (pickler + " time for pickling string Arr: "+str(stringTime))
    print ('\n')

for pickler in picklers:
    timePickler(pickler)

And here's the results:

pickle time for pickling single number: 0.07997888903719286
pickle time for pickling numpy Arr: 1.7257532820085544
pickle time for pickling string Arr: 0.16252835749355654


dill time for pickling single number: 1.757354197975019
dill time for pickling numpy Arr: 26.846140637378994
dill time for pickling string Arr: 2.273730786280794


cloudpickle time for pickling single number: 1.109232256508676
cloudpickle time for pickling numpy Arr: 23.424985621413725
cloudpickle time for pickling string Arr: 1.4475591140499091

It might make sense to use them for the lambdas though, if there's not a lot of pickling/depickling going on.

>>> timeit.timeit('cloudpickle.dumps(lambda x: x)', setup = setup, number = 1000)
0.25640227041549224
>>> timeit.timeit('dill.dumps(lambda x: x)', setup = setup, number = 1000)
0.13595728227574

Does this help?

mmckerns · 2016-05-06T13:50:25Z

Hi, I'm the dill author. A typical, and fairly easy, choice that many packages (IPython, mpi4py, Pyro4, …) have made is to abstract the serializer. That way, you keep your performance, but let the user override performance with robustness. Both cloudpickle and dill provide drop-in replacement, while dill also has some additional features (e.g source code extraction) and serialization options (e.g include dependencies). I have made some recent optimizations to improve the timing, but I'm sure that both dill and cloudpickle will always be generally slower than pickle. Personally, I don't use dill directly too much -- however I do use it pretty extensively within my drop-in replacements of multiprocessing and pp (i.e. multiprocess and ppft), as it makes converting code to parallel/distributed pretty trivial, and that in itself is a huge win. If you guys find that you have feature requests for dill, fill out an issue.

pranay360 · 2016-09-25T08:52:18Z

@tmylk Is it required to change utils.pickle to accommodate dill and cloudpickle or is bench-marking done by @shirish93 sufficient to infer that due to speed issues both the libraries must not be used?

tmylk · 2016-09-25T10:48:33Z

Hi @pranay360 The goal is to create an abstract serializer with default being pickle as it is now. Good implementation of this abstraction is in Pyro. Feel free to copy that approach.

tmylk · 2016-10-05T11:54:11Z

This would fix #913

menshikh-iv · 2017-10-02T14:36:20Z

Partially fixed in #1039

piskvorky added feature Issue described a new feature difficulty easy Easy issue: required small fix labels Dec 6, 2015

piskvorky changed the title ~~Switch to cloudpickle~~ Switch to dill//cloudpickle Dec 6, 2015

shirish93 mentioned this issue Dec 9, 2015

Speed considerations cloudpipe/cloudpickle#44

Closed

piskvorky added difficulty medium Medium issue: required good gensim understanding & python skills and removed difficulty easy Easy issue: required small fix labels Jan 12, 2016

piskvorky assigned tmylk May 6, 2016

macks22 mentioned this issue Jun 3, 2017

Add TextDirectoryCorpus that yields one doc per file recursively read from directory #1387

Closed

menshikh-iv closed this as completed Oct 2, 2017

menshikh-iv unassigned tmylk Oct 2, 2017

pgiri mentioned this issue Apr 27, 2018

Support for Dill or Cloudpickle? pgiri/dispy#121

Open

Ark-kun mentioned this issue May 24, 2019

Research dill vs. cloudpickle for pickling functions kubeflow/pipelines#1387

Closed

jrwalk mentioned this issue Mar 25, 2020

migrate from dill to cloudpickle for advanced serialization apache/airflow#7870

Closed

krzysztofarendt mentioned this issue Feb 6, 2021

Dill support krzysztofarendt/modestga#18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to dill//cloudpickle #558

Switch to dill//cloudpickle #558

piskvorky commented Dec 6, 2015

shirish93 commented Dec 8, 2015

piskvorky commented Dec 8, 2015

gojomo commented Dec 8, 2015

shirish93 commented Dec 9, 2015

piskvorky commented Dec 10, 2015

piskvorky commented Dec 10, 2015

shirish93 commented Jan 11, 2016

mmckerns commented May 6, 2016

pranay360 commented Sep 25, 2016

tmylk commented Sep 25, 2016

tmylk commented Oct 5, 2016

menshikh-iv commented Oct 2, 2017

Switch to dill//cloudpickle #558

Switch to dill//cloudpickle #558

Comments

piskvorky commented Dec 6, 2015

shirish93 commented Dec 8, 2015

piskvorky commented Dec 8, 2015

gojomo commented Dec 8, 2015

shirish93 commented Dec 9, 2015

piskvorky commented Dec 10, 2015

piskvorky commented Dec 10, 2015

shirish93 commented Jan 11, 2016

mmckerns commented May 6, 2016

pranay360 commented Sep 25, 2016

tmylk commented Sep 25, 2016

tmylk commented Oct 5, 2016

menshikh-iv commented Oct 2, 2017