Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple serialization, frame splitting, and compression in protocol #5150

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

mrocklin
Copy link
Member

@mrocklin mrocklin commented Aug 1, 2021

Currently compression and frame splitting are tightly interwoven with
traversing through messages. This can be efficient, but results in a
complex system where it's hard to reason about when things get split or
compressed (indeed, this lead to a difficult to track down bug with
frame splitting).

This commit separates these processes into three separate stages:

  1. Serialize all objects into frames
  2. Split large frames
  3. Compress compressible frames

This results in a much more uniform application of splitting and
compressing. However, this comes with a couple of undesired effects.

  1. We add a new header if either splitting or compressing has occurred
  2. We no longer avoid decompression when we don't want to deserialize

There is probably a clean way to achieve most/all of our goals here.
I wanted to push this up to start this conversation.

cc @quasiben @jakirkham @madsbk

Today we try to split up large messages in comms.
This is useful in a few situations:

1.  Websockets, which often pass frames through middleware that requires
    small messages
2.  TLS, which fails on some OpenSSL versions with frames above the size
    of an int

We correctly cut up data frames into smaller pieces to address these
issues.  However we don't apply this same logic to the header frame,
which may still contain very large bytestrings.  This commit adds a
workaround in protocol dumps/loads which watches for this event and
splits the header frame up if necessary.

It works, but it's not very smooth.  I would prefer that in the future
we think about what a proper header should look like and ensure that it
contains no user data.  In the meantime this should help.
Currently compression and frame splitting are tightly interwoven with
traversing through messages.  This can be efficient, but results in a
complex system where it's hard to reason about when things get split or
compressed (indeed, this lead to a difficult to track down bug with
frame splitting).

This commit separates these processes into three separate stages:

1.  Serialize all objects into frames
2.  Split large frames
3.  Compress compressible frames

This results in a much more uniform application of splitting and
compressing.  However, this comes with a couple of undesired effects.

1.  We add a new header if either splitting or compressing has occurred
2.  We no longer avoid decompression when we don't want to deserialize

There is probably a clean way to achieve most/all of our goals here.
I wanted to push this up to start this conversation.
@madsbk
Copy link
Contributor

madsbk commented Aug 2, 2021

I think this is a very good idea but it made me wonder: why do we split frames?
One reason is the limitation of the compression libraries that might not support arbitrary sized buffer. Are there any other reasons?

Because if compression libraries are the only reason, I think we should delegate the splitting and un-splitting to the compression libraries. That is, for the compression libraries that doesn't support arbitrary buffer sizes, we wrap the compression calls spit and un-split function. I think this will make the design even more simple.

@mrocklin
Copy link
Member Author

mrocklin commented Aug 2, 2021

Various pieces of network machinery don't like large frames.

  1. Compression (as you mention)
  2. Some versions of OpenSSL
  3. Websockets

I wouldn't be surprised if would find more if we turned it off completely. There is a lot of strange software in the middle of corporate networks :)

Copy link
Contributor

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Various pieces of network machinery don't like large frames.
In that case, I think this is a good idea :)

The PR looks good to me but I suggest that you implement the Split large frames and the merge and decompress frames code in two functions. This will make the symmetry between the splitting and merging more clear.

frames2 = []
lengths = []
compressions = []
from distributed.protocol.utils import frame_split_size as split
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move import to the top of the file, I don't think there is any cyclic imports?

@jakirkham
Copy link
Member

cc @gjoseph92 (who has been thinking about this from the merging side recently)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants