Decouple serialization, frame splitting, and compression in protocol #5150

mrocklin · 2021-08-01T17:56:15Z

Currently compression and frame splitting are tightly interwoven with
traversing through messages. This can be efficient, but results in a
complex system where it's hard to reason about when things get split or
compressed (indeed, this lead to a difficult to track down bug with
frame splitting).

This commit separates these processes into three separate stages:

Serialize all objects into frames
Split large frames
Compress compressible frames

This results in a much more uniform application of splitting and
compressing. However, this comes with a couple of undesired effects.

We add a new header if either splitting or compressing has occurred
We no longer avoid decompression when we don't want to deserialize

There is probably a clean way to achieve most/all of our goals here.
I wanted to push this up to start this conversation.

cc @quasiben @jakirkham @madsbk

Today we try to split up large messages in comms. This is useful in a few situations: 1. Websockets, which often pass frames through middleware that requires small messages 2. TLS, which fails on some OpenSSL versions with frames above the size of an int We correctly cut up data frames into smaller pieces to address these issues. However we don't apply this same logic to the header frame, which may still contain very large bytestrings. This commit adds a workaround in protocol dumps/loads which watches for this event and splits the header frame up if necessary. It works, but it's not very smooth. I would prefer that in the future we think about what a proper header should look like and ensure that it contains no user data. In the meantime this should help.

Currently compression and frame splitting are tightly interwoven with traversing through messages. This can be efficient, but results in a complex system where it's hard to reason about when things get split or compressed (indeed, this lead to a difficult to track down bug with frame splitting). This commit separates these processes into three separate stages: 1. Serialize all objects into frames 2. Split large frames 3. Compress compressible frames This results in a much more uniform application of splitting and compressing. However, this comes with a couple of undesired effects. 1. We add a new header if either splitting or compressing has occurred 2. We no longer avoid decompression when we don't want to deserialize There is probably a clean way to achieve most/all of our goals here. I wanted to push this up to start this conversation.

madsbk · 2021-08-02T08:25:04Z

I think this is a very good idea but it made me wonder: why do we split frames?
One reason is the limitation of the compression libraries that might not support arbitrary sized buffer. Are there any other reasons?

Because if compression libraries are the only reason, I think we should delegate the splitting and un-splitting to the compression libraries. That is, for the compression libraries that doesn't support arbitrary buffer sizes, we wrap the compression calls spit and un-split function. I think this will make the design even more simple.

mrocklin · 2021-08-02T12:27:08Z

Various pieces of network machinery don't like large frames.

Compression (as you mention)
Some versions of OpenSSL
Websockets

I wouldn't be surprised if would find more if we turned it off completely. There is a lot of strange software in the middle of corporate networks :)

madsbk

Various pieces of network machinery don't like large frames.
In that case, I think this is a good idea :)

The PR looks good to me but I suggest that you implement the Split large frames and the merge and decompress frames code in two functions. This will make the symmetry between the splitting and merging more clear.

madsbk · 2021-08-03T07:40:22Z

distributed/protocol/core.py

+        frames2 = []
+        lengths = []
+        compressions = []
+        from distributed.protocol.utils import frame_split_size as split


Move import to the top of the file, I don't think there is any cyclic imports?

jakirkham · 2021-08-18T04:04:54Z

cc @gjoseph92 (who has been thinking about this from the merging side recently)

…ify-protocol

mrocklin added 2 commits August 1, 2021 10:10

mrocklin mentioned this pull request Aug 2, 2021

Split large header in comms #5149

Open

cleanup dead code

f4ca371

madsbk reviewed Aug 3, 2021

View reviewed changes

Merge branch 'main' of https://github.com/dask/distributed into simpl…

c0ddfb7

…ify-protocol

mrocklin mentioned this pull request Mar 29, 2022

[REVIEW] ToPickle - Unpickle on the Scheduler #5728

Merged

2 tasks

mrocklin requested a review from fjetter as a code owner January 23, 2024 10:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple serialization, frame splitting, and compression in protocol #5150

Decouple serialization, frame splitting, and compression in protocol #5150

mrocklin commented Aug 1, 2021

madsbk commented Aug 2, 2021

mrocklin commented Aug 2, 2021

madsbk left a comment

madsbk Aug 3, 2021

jakirkham commented Aug 18, 2021

Decouple serialization, frame splitting, and compression in protocol #5150

Are you sure you want to change the base?

Decouple serialization, frame splitting, and compression in protocol #5150

Conversation

mrocklin commented Aug 1, 2021

madsbk commented Aug 2, 2021

mrocklin commented Aug 2, 2021

madsbk left a comment

Choose a reason for hiding this comment

madsbk Aug 3, 2021

Choose a reason for hiding this comment

jakirkham commented Aug 18, 2021