Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add self-check pg_notify messages for silently dropped connections #14749

Draft
wants to merge 5 commits into
base: devel
Choose a base branch
from

Conversation

AlanCoding
Copy link
Member

SUMMARY

The problem this addresses is roughly established by this form of testing:

awx-manage run_dispatcher --status
strace -f -s 1024 -p 8491
lsof -p 8491 | grep 28
# run as root
iptables -I OUTPUT -p tcp --sport 34586 -j DROP

The dropping of the connection is undetected by the dispatcher, which is bad. This leaves the dispatcher in a state where it no longer receives messages. Worse, it doesn't know that it cannot receive messages. I have poked around several options to test the connection outside of this, and they all seem terrible. I think I've concluded I need threading to testing doing SELECT 1, because the very scenario I want to diagnose causes that to hang which makes it even less functional than if we had not added that check..

That leads me to the nuclear option - this. This is how it looks from logs:

tools_awx_1 | 2024-01-09 15:56:38,547 ERROR    [-] awx.main.dispatch Error consuming new events from postgres, will retry for 40 s
tools_awx_1 | Traceback (most recent call last):
tools_awx_1 |   File "/awx_devel/awx/main/dispatch/worker/base.py", line 267, in run
tools_awx_1 |     conn.select_timeout = self.run_periodic_tasks(conn)
tools_awx_1 |   File "/awx_devel/awx/main/dispatch/worker/base.py", line 237, in run_periodic_tasks
tools_awx_1 |     raise db.DatabaseError(f'pg_notify self-check missing after {delta:.3f}s, did you drop the connection?')
tools_awx_1 | django.db.utils.DatabaseError: pg_notify self-check missing after 26.015s, did you drop the connection?
tools_awx_1 | 2024-01-09 15:56:39,576 INFO     [-] awx.main.dispatch Dispatcher listener connection established

Obvious drawback - this is slow to detect problems. I've set this to send every 60 seconds, and then receiving has a tolerance of 20 seconds + whatever task interval we encounter... thus that 26 seconds. This is additive, and we add at least 1 second of delay in retry loop, so call it 87 seconds here to recover. To articulate my feelings:

recover

The advantages of this are that it detects this state when nothing else can, and it goes into the error handling loop we want which re-connects but doesn't arbitrarily fail jobs in progress.

ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant