WIP Proper handling of application errors during get_data #4316
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TL;DR If we encounter application errors during
get_data
(e.g. custom deserializer) we are somehow reusing comm objects and are swallowing handler calls which might drive the cluster into a corrupt state (sometimes self healing)This is a funny one, again. It is not a fix, yet, but an analysis of the problem with a semi-complete test (not sure, yet, what the outcome of the test should be exactly but it is reproducing it simply)
A tiny bit of context: We had some more or less deterministic deserialization issues connected to custom (de-)serialization code. We fixed this one so I'm not sure how critical this problem actually is but from my understanding it could put the cluster in a corrupt state, at least for a while. I've seen retries eventually recovering it so I guess this is just weird but not critical.
On top of the "Failed to deserialize" messages (which are obviously expected) we got things like
If you follow the XXX comments from 1 to 4 this leads through the code sequentially as the events happen to produce this issue.