Add self-check pg_notify messages for silently dropped connections #14749
+32
−4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
SUMMARY
The problem this addresses is roughly established by this form of testing:
The dropping of the connection is undetected by the dispatcher, which is bad. This leaves the dispatcher in a state where it no longer receives messages. Worse, it doesn't know that it cannot receive messages. I have poked around several options to test the connection outside of this, and they all seem terrible. I think I've concluded I need threading to testing doing
SELECT 1
, because the very scenario I want to diagnose causes that to hang which makes it even less functional than if we had not added that check..That leads me to the nuclear option - this. This is how it looks from logs:
Obvious drawback - this is slow to detect problems. I've set this to send every 60 seconds, and then receiving has a tolerance of 20 seconds + whatever task interval we encounter... thus that 26 seconds. This is additive, and we add at least 1 second of delay in retry loop, so call it 87 seconds here to recover. To articulate my feelings:
The advantages of this are that it detects this state when nothing else can, and it goes into the error handling loop we want which re-connects but doesn't arbitrarily fail jobs in progress.
ISSUE TYPE
COMPONENT NAME