Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Proper handling of application errors during get_data #4316

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Dec 4, 2020

TL;DR If we encounter application errors during get_data (e.g. custom deserializer) we are somehow reusing comm objects and are swallowing handler calls which might drive the cluster into a corrupt state (sometimes self healing)

This is a funny one, again. It is not a fix, yet, but an analysis of the problem with a semi-complete test (not sure, yet, what the outcome of the test should be exactly but it is reproducing it simply)

A tiny bit of context: We had some more or less deterministic deserialization issues connected to custom (de-)serialization code. We fixed this one so I'm not sure how critical this problem actually is but from my understanding it could put the cluster in a corrupt state, at least for a while. I've seen retries eventually recovering it so I guess this is just weird but not critical.

On top of the "Failed to deserialize" messages (which are obviously expected) we got things like

  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/distributed/worker.py", line 1283, in get_data
    assert response == "OK", response

AssertionError: {'op': 'get_data', 'keys': ("('shuffle-split-7371cf941c2c38490c6b2db8b38e4e98', 2, 1, (6, 2))", "('shuffle-split-7371cf941c2c38490c6b2db8b38e4e98', 2, 1, (6, 1))"), 'who': 'tls://10.3.136.133:31114', 'max_connections': None, 'reply': True}

If you follow the XXX comments from 1 to 4 this leads through the code sequentially as the events happen to produce this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant