WIP Proper handling of application errors during get_data #4316

fjetter · 2020-12-04T16:25:48Z

TL;DR If we encounter application errors during get_data (e.g. custom deserializer) we are somehow reusing comm objects and are swallowing handler calls which might drive the cluster into a corrupt state (sometimes self healing)

This is a funny one, again. It is not a fix, yet, but an analysis of the problem with a semi-complete test (not sure, yet, what the outcome of the test should be exactly but it is reproducing it simply)

A tiny bit of context: We had some more or less deterministic deserialization issues connected to custom (de-)serialization code. We fixed this one so I'm not sure how critical this problem actually is but from my understanding it could put the cluster in a corrupt state, at least for a while. I've seen retries eventually recovering it so I guess this is just weird but not critical.

On top of the "Failed to deserialize" messages (which are obviously expected) we got things like

  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/distributed/worker.py", line 1283, in get_data
    assert response == "OK", response

AssertionError: {'op': 'get_data', 'keys': ("('shuffle-split-7371cf941c2c38490c6b2db8b38e4e98', 2, 1, (6, 2))", "('shuffle-split-7371cf941c2c38490c6b2db8b38e4e98', 2, 1, (6, 1))"), 'who': 'tls://10.3.136.133:31114', 'max_connections': None, 'reply': True}

If you follow the XXX comments from 1 to 4 this leads through the code sequentially as the events happen to produce this issue.

fjetter added 2 commits December 4, 2020 17:04

Deserialization error during get_data

96b8009

Test for broken dep

650b86e

fjetter mentioned this pull request Dec 14, 2020

Deadlocks and infinite loops connected to failed dependencies #4360

Closed

fjetter closed this Jan 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP Proper handling of application errors during get_data #4316

WIP Proper handling of application errors during get_data #4316

fjetter commented Dec 4, 2020

WIP Proper handling of application errors during get_data #4316

WIP Proper handling of application errors during get_data #4316

Conversation

fjetter commented Dec 4, 2020