Single pass serialization #4699

madsbk · 2021-04-13T11:33:18Z

This PR streamlines the serialization in Distributed by relying on msgpack and only refer to the serialize()/deserialize() infrastructure when encountering objects not supported by msgpack.

Tests added / passed
Passes black distributed / flake8 distributed / isort distributed

madsbk · 2021-04-16T15:11:29Z

@jrbourbeau @mrocklin @jakirkham, this is ready for the first round of reviews :)

fjetter · 2021-04-16T15:57:56Z

distributed/protocol/core.py

+    try:
+        with cache_dumps_lock:
+            result = cache_dumps[func]
+    except KeyError:


I still don't have a good feeling on what timescales where currently optimizing or whether this is a particularly performance critical section. Therefore, this comment might be irrelevant.
However, exception handling is relatively expensive and if we encounter a lot of cache misses a isin should be faster. That's ns level optimization. I could imagine the pickling is usually an order of magnitude slower and this doesn't matter at all

fjetter · 2021-04-16T15:58:11Z

distributed/protocol/core.py

+        with cache_dumps_lock:
+            result = cache_dumps[func]
+    except KeyError:
+        result = pickle.dumps(func, protocol=4)


Any reason why protocol=4 is hard coded?

I'm also curious about this

I am curious too :)
This is taken directly from

distributed/distributed/worker.py

Line 3527 in fa5d993

result = pickle.dumps(func, protocol=4)

@jakirkham do you know?

That line comes from @mrocklin's PR ( #4019 ), which allowed connections to dynamically determine what compression and pickle protocols are supported and then use them in communication. In a few places I think Matt found it easier to simply force pickle protocol 4 than allow it to be configurable. So if this is coming from that worker code, that is the history

fjetter · 2021-04-16T16:02:44Z

distributed/protocol/core.py

+            result = cache_dumps[func]
+    except KeyError:
+        result = pickle.dumps(func, protocol=4)
+        if len(result) < 100000:


I think we can be a bit more generous with the cache size. currently we're at 100 (LRU maxsize) * 100_000 B (result) ~ 1MB. Considering how much stuff we're logging without taking size into account too much, I would suggest to be more generous with this upper limit since large results are the juicy cache hits.

Related: in loads_function, what if we used hash(bytes_object) as the key instead of bytes_object itself? Then we wouldn't have to hang onto references to those large bytestrings that we won't look at again.

It sounds like that was just copied and moved over from here

distributed/distributed/worker.py

Line 3528 in fa5d993

if len(result) < 100000:

Perhaps we can make a new issue and revisit?

Perhaps we can make a new issue and revisit?

My plan is to remove worker.dumps_function() completely, it shouldn't be required to call it explicitly.

Ah ok in that case I don't think the protocol=4 bit above will be needed

fjetter · 2021-04-16T16:03:39Z

distributed/protocol/core.py

+
+def loads_function(bytes_object):
+    """ Load a function from bytes, cache bytes """
+    if len(bytes_object) < 100000:


personal preference: I would put the size of the cache into a constant s.t. the two function don't drift apart

madsbk · 2021-04-22T08:00:14Z

@fjetter good points! When we settle on an overall design I will incorporate you suggestions.

Right now I am waiting on @jrbourbeau @mrocklin to review the overall design before continuing :)

…erialization

gjoseph92

This does seem a bit cleaner and simpler to me, without being a fundamental change, which is nice. I haven't thought more carefully about the implications yet though.

gjoseph92 · 2021-04-22T17:26:36Z

distributed/protocol/core.py

+            result = cache_dumps[func]
+    except KeyError:
+        result = pickle.dumps(func, protocol=4)
+        if len(result) < 100000:


Related: in loads_function, what if we used hash(bytes_object) as the key instead of bytes_object itself? Then we wouldn't have to hang onto references to those large bytestrings that we won't look at again.

gjoseph92 · 2021-04-22T17:43:13Z

distributed/protocol/serialize.py

@@ -20,6 +21,8 @@
 dask_deserialize = dask.utils.Dispatch("dask_deserialize")

 _cached_allowed_modules = {}
+non_list_collection_types = (tuple, set, frozenset)


Are set and frozenset necessary here, since they can't contain lists or dicts, even recursively within tuples?

>>> {([])} Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'list'

gjoseph92 · 2021-04-22T17:57:00Z

distributed/protocol/serialize.py

+        and (
+            "pickle" not in serializers
+            or serializers.index("pickle") > serializers.index("msgpack")
+        )


Do we still care about whether pickle is used or not, now that we have msgpack_persist_lists?

Related: what happens if a MsgpackList gets pickled? Won't it be passed on (in a task, say) as a MsgpackList, not a plain list? Whereas msgpack_decode_default returns them as plain lists.

gjoseph92 · 2021-04-22T20:37:44Z

distributed/protocol/serialize.py

+        return {"__Set__": True, "as-tuple": tuple(obj)}
+
+    if typ is MsgpackList:
+        return {"__MsgpackList__": True, "as-tuple": tuple(obj.data)}


Suggested change

return {"__MsgpackList__": True, "as-tuple": tuple(obj.data)}

return {"__MsgpackList__": True, "as-tuple": obj.data}

What would happen if we did this instead? obj.data should already be a list, so I'm wondering if the extra copy to a tuple is necessary.

mrocklin

In general things here seem ok. There are issues around passing through the list of serializers. We need to make sure that we can turn pickle off.

mrocklin · 2021-04-16T15:55:25Z

distributed/tests/test_worker.py

+                    # With <https://github.com/dask/distributed/pull/4699>,
+                    # deserialization is done as part of communication.


@jrbourbeau I think that you might want to be aware of this change

mrocklin · 2021-04-23T14:58:27Z

distributed/protocol/core.py

+            if typ in (Serialized, SerializedCallable):
+                sub_header, sub_frames = obj.header, obj.frames
+            elif callable(obj):
+                sub_header, sub_frames = {"callable": dumps_function(obj)}, []


Functions can be quite large sometimes, for example if users close over large variables out of function scope. Msgpack may not handle this well in some cases

x = np.arange(1000000000) def f(y): return y + x.sum()

Obviously users shouldn't do this, but they will.

It looks like we're bypassing the list of serializers here. This allows users to get past configurations where users specifically turn off pickle.

mrocklin · 2021-04-23T14:59:55Z

distributed/protocol/core.py

+                sub_header, sub_frames = serialize_and_split(
+                    obj, serializers=serializers, on_error=on_error, context=context
+                )
+                _inplace_compress_frames(sub_header, sub_frames)


The inplace stuff always makes me uncomfortable. Thoughts on making new header/frames dict/lists here instead?

For reference, it was these sorts of inplace operations that previously caused us to run into the msgpack tuple vs list difference. I think that avoiding them when we can is useful, unless there is a large performance boost (which I wouldn't expect here).

mrocklin · 2021-04-23T15:02:40Z

distributed/protocol/core.py

+                            if deserialize == "delay-exception":
+                                return DelayedExceptionRaise(e)


I am confused about when this is necessary and why it wasn't before. I'm wary of creating new systems like this if we can avoid it.

I think I understand this now that I've seen the c.submit(identity, Foo()) test below

mrocklin · 2021-04-23T16:35:48Z

distributed/protocol/core.py

+                # `__MsgpackList__`, we decode it here explicitly. This way
+                # we can delay the convertion to a regular `list` until it
+                # gets to a worker.
+                if "__MsgpackList__" in obj:


What is the type of obj here? Is in the right test here, or is this special value in a more specific place?

mrocklin · 2021-04-23T16:37:40Z

distributed/protocol/tests/test_collection.py

-    header, frames = serialize([[[x]]])
-    assert "dask" in str(header)
-    assert len(frames) == 1
-    assert x.data == np.frombuffer(frames[0]).data


I'm curious, why did we drop this test?

mrocklin · 2021-04-23T16:39:54Z

distributed/tests/test_client.py

@@ -4628,8 +4624,6 @@ async def test_recreate_error_futures(c, s, a, b):

    function, args, kwargs = await c._recreate_error_locally(f)
    assert f.status == "error"
-    assert function.__name__ == "div"
-    assert args == (1, 0)


I'm curious, what happened here?

mrocklin · 2021-04-23T16:44:50Z

distributed/tests/test_client.py

-    assert results == list(map(inc, range(10)))
-    assert a.data and b.data
+        assert results == list(map(inc, range(10)))
+        assert a.data and b.data


Hrm, you mentioned this in meeting a couple of weeks ago. I see now how this is unfortunate.

I would expect this test to now be written as

with pytest.raises(CancelledError): await c.submit(identity, Foo())

I wouldn't expect the other lines here to be indented. In general seeing assert statements under a raises context manager is a sign that something is unclean :)

I have changed it to make it more clear what is going on:

# Notice, because serialization is delayed until `distributed.batched` # we don't get an exception immediately. The exception is raised and logged # when the ongoing communication between the client the scheduler encounters # the `Foo` class. Before <https://github.com/dask/distributed/pull/4699> # the serialization happened immediately in `submit()`, which would raise the # `MyException`. with captured_logger("distributed") as caplog: future = c.submit(identity, Foo()) # We sleep to make sure that a `BatchedSend.interval` has passed. await asyncio.sleep(c.scheduler_comm.interval) # Check that the serialization error was logged assert "Failed to serialize" in caplog.getvalue()

I'm curious, what do you think of the approach? We cannot easily catch the exception because it happens as part of the ongoing communication and not in the submit() call, but at least we log the exception.

mrocklin · 2021-04-23T16:49:06Z

distributed/tests/test_client.py

-            with pytest.raises(TypeError):
-                await c.run_on_scheduler(lambda: inc)
+            await c.run(lambda: inc)
+            await c.run_on_scheduler(lambda: inc)


If the user has specified that they don't want to allow serialization with pickle then these should continue to fail. Probably we need to feed the list of serializers down wherever serialize is beting called. I expect that this might be awkward to do when going through msgpack machinery. Maybe there is some global that we can misuse?

mrocklin · 2021-04-23T16:50:08Z

distributed/tests/test_core.py

@@ -771,8 +771,7 @@ async def f():
        await server.listen("tcp://")

        async with rpc(server.address, serializers=["msgpack"]) as r:
-            with pytest.raises(TypeError):
-                await r.echo(x=to_serialize(inc))
+            await r.echo(x=to_serialize(inc))


These sorts of changes are probably not ok. They fundamentally change the intent of the test, which is to ensure that things like this can be disallowed.

This has been fixed

mrocklin

In general things here seem ok. There are issues around passing through the list of serializers. We need to make sure that we can turn pickle off.

mrocklin · 2021-04-23T16:53:09Z

distributed/protocol/core.py

+        with cache_dumps_lock:
+            result = cache_dumps[func]
+    except KeyError:
+        result = pickle.dumps(func, protocol=4)


I'm also curious about this

mrocklin · 2021-04-23T16:54:31Z

distributed/protocol/core.py

+            if typ in (Serialized, SerializedCallable):
+                sub_header, sub_frames = obj.header, obj.frames
+            elif callable(obj):
+                sub_header, sub_frames = {"callable": dumps_function(obj)}, []


It looks like we're bypassing the list of serializers here. This allows users to get past configurations where users specifically turn off pickle.

…erialization

This reverts commit 9717ae8.

jakirkham · 2021-04-27T18:16:57Z

distributed/protocol/core.py

+cache_dumps = LRU(maxsize=100)
+cache_loads = LRU(maxsize=100)
+cache_dumps_lock = threading.Lock()
+cache_loads_lock = threading.Lock()


Since we are discussing getting rid of dumps_function, will these still be needed or will they go away as well?

This reverts commit 084680c.

This reverts commit 6548968.

…erialization

madsbk force-pushed the single_pass_serialization branch from 72a7d1b to f8b7dc9 Compare April 14, 2021 13:04

quasiben mentioned this pull request Apr 15, 2021

DGX Nightly Benchmark run 20210414 quasiben/dask-scheduler-performance#135

Open

madsbk force-pushed the single_pass_serialization branch 3 times, most recently from 2d4d221 to fc40639 Compare April 15, 2021 19:48

rjzamora mentioned this pull request Apr 15, 2021

Add iterate_collection argument to serialize #4641

Merged

madsbk added 6 commits April 16, 2021 15:40

Implementing single pass serialization

176ed15

Added dumps/loads cache

0a4433a

Fixed test_batched test

ff0a542

Fixed test_turn_off_pickle

084680c

fixed test_pickle_safe

6548968

fixed test_executor_offload

ca6d0f6

madsbk force-pushed the single_pass_serialization branch from fc40639 to ca6d0f6 Compare April 16, 2021 14:29

madsbk marked this pull request as ready for review April 16, 2021 15:06

fjetter reviewed Apr 16, 2021

View reviewed changes

rjzamora mentioned this pull request Apr 20, 2021

Use Blockwise for DataFrame IO (parquet, csv, and orc) dask/dask#7415

Merged

madsbk mentioned this pull request Apr 22, 2021

[Discussion] Serialize objects within tasks #4673

Open

madsbk changed the title ~~[WIP] Single pass serialization~~ Single pass serialization Apr 22, 2021

madsbk added 4 commits April 22, 2021 10:45

Merge branch 'main' of github.com:dask/distributed into single_pass_s…

f951260

…erialization

delay exceptions when loads_function

2da2476

fixed test_rpc_serialization

9717ae8

Making sure to also unpack remotes in MsgpackList

ec3d4ad

gjoseph92 reviewed Apr 22, 2021

View reviewed changes

mrocklin reviewed Apr 23, 2021

View reviewed changes

madsbk added 3 commits April 27, 2021 16:37

Merge branch 'main' of github.com:dask/distributed into single_pass_s…

f9b6f7c

…erialization

Revert "fixed test_rpc_serialization"

845b290

This reverts commit 9717ae8.

Checking the deserialises before unpickle

cd937b5

madsbk added 2 commits April 27, 2021 19:20

Rewritten test_robust_unserializable()

7fb8a26

typo

79f8d24

jakirkham reviewed Apr 27, 2021

View reviewed changes

madsbk added 10 commits April 28, 2021 16:24

Revert "Fixed test_turn_off_pickle"

9aa4bbe

This reverts commit 084680c.

Checking the serialises before pickle

f700725

Revert "fixed test_pickle_safe"

13ddcec

This reverts commit 6548968.

test_robust_unserializable(): catching CancelledError

2fa9f12

Merge branch 'main' of github.com:dask/distributed into single_pass_s…

e081747

…erialization

Merge branch 'main' of github.com:dask/distributed into single_pass_s…

f544006

…erialization

MsgpackList: support isinstance(x, list)

6a2a915

SerializedCallable: accept args

be143ba

to_serialize = msgpack_persist_lists

29baab1

replaced Serialize and to_serialize with TaskGraphValue

e7d5105

gjoseph92 mentioned this pull request May 12, 2021

dumps_task in SimpleShuffleLayer and BroadcastJoinLayer unpack dask/dask#7650

Open

madsbk mentioned this pull request Jun 9, 2021

[WIP] Fine grained serialization #4897

Closed

7 tasks

madsbk mentioned this pull request Jun 17, 2021

[REVIEW] Formalization of Computation #4923

Closed

9 tasks

madsbk mentioned this pull request Sep 27, 2021

error_message(): pickle exception and traceback immediately #5338

Merged

5 tasks

This was referenced Jan 27, 2022

Dask Futures performance as the number of tasks increases #5715

Open

Cythonic SchedulerState (WIP) #5176

Closed

Drop support for cythonized scheduler #5685

Closed

madsbk closed this Mar 30, 2022

	return {"__MsgpackList__": True, "as-tuple": tuple(obj.data)}
	return {"__MsgpackList__": True, "as-tuple": obj.data}

		# With <https://github.com/dask/distributed/pull/4699>,
		# deserialization is done as part of communication.

		if deserialize == "delay-exception":
		return DelayedExceptionRaise(e)

Single pass serialization #4699

Single pass serialization #4699

Conversation

madsbk commented Apr 13, 2021 • edited Loading

madsbk commented Apr 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madsbk Apr 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madsbk commented Apr 22, 2021

gjoseph92 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madsbk Apr 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madsbk commented Apr 13, 2021 •

edited

Loading

madsbk Apr 26, 2021 •

edited

Loading

madsbk Apr 27, 2021 •

edited

Loading